3Ware tw_cli getting DEGRADED and ECC-ERROR on rebuild
I have some older servers still running 3Ware RAID cards. They work great, and have a nice command line interface to managing things using tw_cli. I recently had a drive fail, then when I went in to do the rebuild it errored with an ECC error and the rebuild never finished.
Versions:
CentOS 6.9
3Ware 9600S-8 Card
tw_cli
These are the steps which I did to resolve the issue and get everything back into working order. First we’re going to remove the DEGRADED or FAILED disk.
Find the failed RAID drive
Each of my servers has a different cX card number, so I always issue a show first to find the RAID card, and then find the failed drive.
[root@host3 ~]# tw_cli show Ctl Model Ports Drives Units NotOpt RRate VRate BBU ------------------------------------------------------------------------ c0 9500S-8 8 8 2 0 1 1 -
Now that we know we’re using c0 as the card, we can issue show on the card to see the devices and units that have been configured.
[root@host3 log]# tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 DEGRADED - - 64K 2793.91 ON OFF u2 SINGLE OK - - - 465.651 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCAUF1191785 p1 OK u0 931.51 GB 1953525168 JP9960HZ1GT8RU p2 ECC-ERROR u0 465.76 GB 976773168 WD-WCAUF1191407 p3 OK u0 931.51 GB 1953525168 Z4YEVK1V p4 OK u0 465.76 GB 976773168 WD-WCAUF1217402 p5 DEGRADED u0 931.51 GB 1953525168 Z4YEVMAL p6 OK u2 465.76 GB 976773168 6QG3SVAM p7 OK u0 465.76 GB 976773168 6QG3SEQK
Yeah this machine is pretty hold, it’s got 500GB drives. This is what my error looked like, you’ll see it’s DEGRADED and has an ECC-ERROR on one of the drives.
Remove failed drive from RAID
Next now we’re going to issue a command to remove the failed drive from the raid. Again this can be the degraded drive or a drive that has completely failed in the RAID.
From the output above, we identify that p5 is the drive that is not working.
[root@host3 log]# tw_cli /c0/p5 remove
You can double check it was removed
[root@host3 log]# tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 DEGRADED - - 64K 2793.91 ON OFF u2 SINGLE OK - - - 465.651 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCAUF1191785 p1 OK u0 931.51 GB 1953525168 JP9960HZ1GT8RU p2 ECC-ERROR u0 465.76 GB 976773168 WD-WCAUF1191407 p3 OK u0 931.51 GB 1953525168 Z4YEVK1V p4 OK u0 465.76 GB 976773168 WD-WCAUF1217402 p5 NOT-PRESENT - - - - p6 OK u2 465.76 GB 976773168 6QG3SVAM p7 OK u0 465.76 GB 976773168 6QG3SEQK
You can see it’s not present at this time. Now this is where you would pull the drive if it’s bad and replace with a new one if the drive failed in the raid. In my case, I had already installed a new drive and the rebuild is what failed. So I know the drive is good, it failed because of the ECC error.
Add new spare drive for rebuild
After you have inserted the replacement disk we need the controller to scan drives.
[root@host3 log]# tw_cli /c0 rescan
You can double check that the drive is now showing up if you issue a show command, now lets add the new drive as type spare.
[root@host3 log]# tw_cli /c0 add type=spare disk=5
If you run show after adding the spare you’ll see that we just added u1 which is type spare, this will be used by the RAID when performing a rebuild.
[root@host3 log]# tw_cli /c0 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - - 64K 2793.91 ON OFF
u1 SPARE OK - - - 931.505 - OFF
u2 SINGLE OK - - - 465.651 ON OFF
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 465.76 GB 976773168 WD-WCAUF1191785
p1 OK u0 931.51 GB 1953525168 JP9960HZ1GT8RU
p2 OK u0 465.76 GB 976773168 WD-WCAUF1191407
p3 OK u0 931.51 GB 1953525168 Z4YEVK1V
p4 OK u0 465.76 GB 976773168 WD-WCAUF1217402
p5 OK u1 931.51 GB 1953525168 Z4YEVMAL
p6 OK u2 465.76 GB 976773168 6QG3SVAM
p7 OK u0 465.76 GB 976773168 6QG3SEQK
Rebuild the RAID and and Ignore ECC
Now that we’ve got the new drive in and ready to go, we’ll need to issue a command to rebuild the RAID.
We’re going to ignoreecc to get the rebuild to work. This will get us past the problem, but be aware there could be some undetectable corruption on disk where the ECC error occurred that we’re ignoring.
I’ve done this a few times in situations, and have lucked out where it didn’t effect any data that you never know.
[root@host3 log]# tw_cli /c0/u0 start rebuild disk=5 ignoreecc
[root@host3 log]# tw_cli /c0 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 REBUILDING 0 - 64K 2793.91 ON OFF
u2 SINGLE OK - - - 465.651 ON OFF
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 465.76 GB 976773168 WD-WCAUF1191785
p1 OK u0 931.51 GB 1953525168 JP9960HZ1GT8RU
p2 OK u0 465.76 GB 976773168 WD-WCAUF1191407
p3 OK u0 931.51 GB 1953525168 Z4YEVK1V
p4 OK u0 465.76 GB 976773168 WD-WCAUF1217402
p5 DEGRADED u0 931.51 GB 1953525168 Z4YEVMAL
p6 OK u2 465.76 GB 976773168 6QG3SVAM
p7 OK u0 465.76 GB 976773168 6QG3SEQK
Depending on the size of you’re RAID it will take at least a few hours or more to rebuild.