Table of Contents
Replace a Disk on a NetApp Filer
Summary: How to replace a failed disk on a Netapp Filer.
Date: Around 2015
Refactor: 7 March 2025: Checked links and formatting.
If you need to replace a disk in one of your netapp filers (for example because it is faulty) you can use the disk replace
command:
disk replace start [-f] [-m] <disk_name> <spare_disk_name>
-f
: skip confirmation-m
: allows mixing disks with different characteristics. It allows using the target disk with rotational speed that does not match that of the majority of disks in the aggregate. It also allows using the target disk from the opposite spare pool.
The disk replace command uses Rapid RAID Recovery to copy data from the specified file system disk to the specified spare disk. At the end of that process, roles of disks are reversed. The spare disk will replace the file system disk in the RAID group and the file system disk will become a spare.
The process can be stopped with:
disk replace stop <disk_name>
Recognizing a Failed Disk
Sometimes a disk is not functioning well anymore but isn't reporting that yet. In the Netapp onCommand manager, this looks like this:
As you can see there is one disk that is running on 100%, while other disks are not. Since they are in the same aggregate, and NetApp uses WAFL for their file layout, all disks should have roughly the same usage percentage (unless you have hot spots but even them I suspect it not to be like this).
Monitoring and Removing the Replaced Disk
Monitoring
You can monitor the progress with “sysconfig -r”. This will look like this:
Aggregate aggr1 (online, raid_dp) (block checksums) Plex /aggr1/plex0 (online, normal, active, pool0) RAID group /aggr1/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 1a.39 1a 2 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 parity 1a.27 1a 1 11 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.42 1d 2 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.40 1d 2 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.55 1d 3 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.56 1a 3 8 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.25 1a 1 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.75 1d 4 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.73 1a 4 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (replacing, copy in progress) -> copy 1a.60 1a 3 12 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (copy 0% completed) data 1d.72 1d 4 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.58 1d 3 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.23 1d 1 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.71 1a 4 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.43 1d 2 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304
You can also monitor the progress using aggr status -r aggr1
. This will give you roughly the same output.
Remove the Replaced Drive
Once the reconstruction has completed, you will want to remove the drive. In order to help you properly identify the drive, you can have the RED LED blink on the drive in a consistent manner to make it obvious to the person who will be pulling the drive:
priv set advanced blink_on 0c.32 or led_on oc.32 priv set admin
As you can see, the blink_on and led_on commands are privileged commands. Also note that using these commands will only have effect for a little while. After some time (but I'm not sure exactly how much time) the red LED will go off again.
Note: ONTAP 8.1 broke the led_on and blink_on commands.
Example
Use disk replace to replace a faulty disk:
filer01a> disk replace start 1a.73 1a.60 * You are about to copy and replace the following file system disk *** Disk /aggr1/plex0/rg0/1a.73 RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- data 1a.73 1a 4 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 *** Really replace disk 1a.73 with 1a.60? y disk replace: Disk 1a.73 was marked for replacing.
Monitor progress:
> sysconfig -r Aggregate aggr1 (online, raid_dp) (block checksums) Plex /aggr1/plex0 (online, normal, active, pool0) RAID group /aggr1/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 1a.39 1a 2 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 parity 1a.27 1a 1 11 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.42 1d 2 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.40 1d 2 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.55 1d 3 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.56 1a 3 8 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.25 1a 1 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.75 1d 4 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.73 1a 4 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (replacing, copy in progress) -> copy 1a.60 1a 3 12 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (copy 0% completed) data 1d.72 1d 4 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.58 1d 3 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.23 1d 1 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.71 1a 4 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.43 1d 2 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 filer01a> aggr status -r aggr1 Aggregate aggr1 (online, raid_dp) (block checksums) Plex /aggr1/plex0 (online, normal, active, pool0) RAID group /aggr1/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 1a.39 1a 2 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 parity 1a.27 1a 1 11 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.42 1d 2 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.40 1d 2 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.55 1d 3 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.56 1a 3 8 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.25 1a 1 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.75 1d 4 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.73 1a 4 9 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (replacing, copy in progress) -> copy 1a.60 1a 3 12 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 (copy 0% completed) data 1d.72 1d 4 8 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.58 1d 3 10 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.23 1d 1 7 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.71 1a 4 7 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.43 1d 2 11 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 RAID group /aggr1/plex0/rg1 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 1a.59 1a 3 11 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 parity 1d.41 1d 2 9 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.57 1d 3 9 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.24 1a 1 8 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.74 1a 4 10 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.26 1a 1 10 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.44 1a 2 12 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.76 1a 4 12 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.45 1d 2 13 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.61 1d 3 13 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1d.29 1d 1 13 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 data 1a.77 1a 4 13 FC:B 0 ATA 7200 635555/1301618176 635858/1302238304
Physically replace disk and then assigned newly inserted disk to filer:
filer01a> disk show -n DISK OWNER POOL SERIAL NUMBER HOME ------------ ------------- ----- ------------- ------------- 1a.73 Not Owned NONE N034TX1L filer01a> disk show -n DISK OWNER POOL SERIAL NUMBER HOME ------------ ------------- ----- ------------- ------------- 1a.73 Not Owned NONE N034TX1L filer01a> disk assign 1a.73 -o filer01a filer01a> disk show -n disk show: No disks match option -n. filer01a> disk show -v DISK OWNER POOL SERIAL NUMBER HOME ------------ ------------- ----- ------------- ------------- 0d.52 filer01b(151000001) Pool0 3SJ0WW3V00009035Q971 filer01b(151000001) 0c.51 filer01a(151000000) Pool0 J0XP8DVN filer01a(151000000) 0c.61 filer01a(151000000) Pool0 6SJ4XYHJ0000B2021M8H filer01a(151000000) 0d.54 filer01b(151000001) Pool0 3SJ0WPA000009035HK94 filer01b(151000001) ...<cut>... 1a.61 filer01a(151000000) Pool0 P8H78HMF filer01a(151000000) 1a.73 filer01a(151000000) Pool0 N034TX1L filer01a(151000000)
Disk Replace Message About Wrong Size
filer01a> disk replace start 1d.29 1a.73 * You are about to copy and replace the following file system disk *** Disk /aggr1/plex0/rg1/1d.29 RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- data 1d.29 1d 1 13 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304 *** Disk 1a.73 is bigger than disk 1d.29. Only 636 GB will be used on disk 1a.73. Really replace disk 1d.29 with 1a.73? y disk replace: Disk 1d.29 was marked for replacing.