wiki.getshifting.com

--- Sjoerd Hooft's InFormation Technology ---

User Tools

Site Tools


netappdiskreplace

Replace a Disk on a NetApp Filer

Summary: How to replace a failed disk on a Netapp Filer.
Date: Around 2015
Refactor: 7 March 2025: Checked links and formatting.

If you need to replace a disk in one of your netapp filers (for example because it is faulty) you can use the disk replace command:

disk replace start [-f] [-m] <disk_name> <spare_disk_name>
  • -f: skip confirmation
  • -m: allows mixing disks with different characteristics. It allows using the target disk with rotational speed that does not match that of the majority of disks in the aggregate. It also allows using the target disk from the opposite spare pool.

The disk replace command uses Rapid RAID Recovery to copy data from the specified file system disk to the specified spare disk. At the end of that process, roles of disks are reversed. The spare disk will replace the file system disk in the RAID group and the file system disk will become a spare.

The process can be stopped with:

disk replace stop <disk_name>

Recognizing a Failed Disk

Sometimes a disk is not functioning well anymore but isn't reporting that yet. In the Netapp onCommand manager, this looks like this:

netappdiskreplace02.jpg


As you can see there is one disk that is running on 100%, while other disks are not. Since they are in the same aggregate, and NetApp uses WAFL for their file layout, all disks should have roughly the same usage percentage (unless you have hot spots but even them I suspect it not to be like this).

Monitoring and Removing the Replaced Disk

Monitoring

You can monitor the progress with “sysconfig -r”. This will look like this:

Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304

You can also monitor the progress using aggr status -r aggr1. This will give you roughly the same output.

Remove the Replaced Drive

Once the reconstruction has completed, you will want to remove the drive. In order to help you properly identify the drive, you can have the RED LED blink on the drive in a consistent manner to make it obvious to the person who will be pulling the drive:

priv set advanced
blink_on 0c.32

    or

led_on oc.32
priv set admin

As you can see, the blink_on and led_on commands are privileged commands. Also note that using these commands will only have effect for a little while. After some time (but I'm not sure exactly how much time) the red LED will go off again.

Note: ONTAP 8.1 broke the led_on and blink_on commands.

Example

Use disk replace to replace a faulty disk:

filer01a> disk replace start 1a.73 1a.60
    * You are about to copy and replace the following file system disk ***
  Disk /aggr1/plex0/rg0/1a.73

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
***
Really replace disk 1a.73 with 1a.60? y
disk replace: Disk 1a.73 was marked for replacing.

Monitor progress:

> sysconfig -r
Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304


filer01a> aggr status -r aggr1
Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304

    RAID group /aggr1/plex0/rg1 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.59   1a    3   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1d.41   1d    2   9   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.57   1d    3   9   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.24   1a    1   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.74   1a    4   10  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.26   1a    1   10  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.44   1a    2   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.76   1a    4   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.45   1d    2   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.61   1d    3   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.29   1d    1   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.77   1a    4   13  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304

Physically replace disk and then assigned newly inserted disk to filer:

filer01a> disk show -n
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
1a.73        Not Owned                  NONE   N034TX1L

filer01a> disk show -n
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
1a.73        Not Owned                  NONE   N034TX1L
filer01a> disk assign 1a.73 -o filer01a
filer01a> disk show -n
disk show: No disks match option -n.
filer01a> disk show -v
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
0d.52        filer01b(151000001)    Pool0  3SJ0WW3V00009035Q971  filer01b(151000001)
0c.51        filer01a(151000000)    Pool0  J0XP8DVN              filer01a(151000000)
0c.61        filer01a(151000000)    Pool0  6SJ4XYHJ0000B2021M8H  filer01a(151000000)
0d.54        filer01b(151000001)    Pool0  3SJ0WPA000009035HK94  filer01b(151000001)
...<cut>...
1a.61        filer01a(151000000)    Pool0  P8H78HMF              filer01a(151000000)
1a.73        filer01a(151000000)    Pool0  N034TX1L              filer01a(151000000)

Disk Replace Message About Wrong Size

filer01a> disk replace start 1d.29 1a.73
    * You are about to copy and replace the following file system disk ***
  Disk /aggr1/plex0/rg1/1d.29

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      data      1d.29   1d    1   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
***
Disk 1a.73 is bigger than disk 1d.29.
Only 636 GB will be used on disk 1a.73.
Really replace disk 1d.29 with 1a.73? y
disk replace: Disk 1d.29 was marked for replacing.

onCommand Status

You can also see the pending status in OnCommand:

netappdiskreplace01.jpg
netappdiskreplace.txt · Last modified: by 127.0.0.1