3Ware tw_cli getting DEGRADED and ECC-ERROR on rebuild

I have some older servers still running 3Ware RAID cards.  They work great, and have a nice command line interface to managing things using tw_cli.  I recently had a drive fail, then when I went in to do the rebuild it errored with an ECC error and the rebuild never finished.

Versions:
CentOS 6.9
3Ware 9600S-8 Card
tw_cli

These are the steps which I did to resolve the issue and get everything back into working order.  First we’re going to remove the DEGRADED or FAILED disk.

Find the failed RAID drive

Each of my servers has a different cX card number, so I always issue a show first to find the RAID card, and then find the failed drive.

[root@host3 ~]# tw_cli show

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c0    9500S-8      8       8        2       0        1       1       -        

Now that we know we’re using c0 as the card, we can issue show on the card to see the devices and units that have been configured.

[root@host3 log]# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     2793.91   ON     OFF    
u2    SINGLE    OK             -       -       -       465.651   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAUF1191785     
p1     OK               u0     931.51 GB   1953525168    JP9960HZ1GT8RU      
p2     ECC-ERROR        u0     465.76 GB   976773168     WD-WCAUF1191407     
p3     OK               u0     931.51 GB   1953525168    Z4YEVK1V            
p4     OK               u0     465.76 GB   976773168     WD-WCAUF1217402     
p5     DEGRADED         u0     931.51 GB   1953525168    Z4YEVMAL            
p6     OK               u2     465.76 GB   976773168     6QG3SVAM            
p7     OK               u0     465.76 GB   976773168     6QG3SEQK            

Yeah this machine is pretty hold, it’s got 500GB drives.  This is what my error looked like, you’ll see it’s DEGRADED and has an ECC-ERROR on one of the drives.

Remove failed drive from RAID

Next now we’re going to issue a command to remove the failed drive from the raid.  Again this can be the degraded drive or a drive that has completely failed in the RAID.

From the output above, we identify that p5 is the drive that is not working.

[root@host3 log]# tw_cli /c0/p5 remove

You can double check it was removed

[root@host3 log]# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     2793.91   ON     OFF    
u2    SINGLE    OK             -       -       -       465.651   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAUF1191785     
p1     OK               u0     931.51 GB   1953525168    JP9960HZ1GT8RU      
p2     ECC-ERROR        u0     465.76 GB   976773168     WD-WCAUF1191407     
p3     OK               u0     931.51 GB   1953525168    Z4YEVK1V            
p4     OK               u0     465.76 GB   976773168     WD-WCAUF1217402     
p5     NOT-PRESENT      -      -           -             -
p6     OK               u2     465.76 GB   976773168     6QG3SVAM            
p7     OK               u0     465.76 GB   976773168     6QG3SEQK            

You can see it’s not present at this time.  Now this is where you would pull the drive if it’s bad and replace with a new one if the drive failed in the raid.  In my case, I had already installed a new drive and the rebuild is what failed.  So I know the drive is good, it failed because of the ECC error.

Add new spare drive for rebuild

After you have inserted the replacement disk we need the controller to scan drives.

[root@host3 log]# tw_cli /c0 rescan

You can double check that the drive is now showing up if you issue a show command, now lets add the new drive as type spare.

[root@host3 log]# tw_cli /c0 add type=spare disk=5

If you run show after adding the spare you’ll see that we just added u1 which is type spare, this will be used by the RAID when performing a rebuild.

[root@host3 log]# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     2793.91   ON     OFF    
u1    SPARE     OK             -       -       -       931.505   -      OFF    
u2    SINGLE    OK             -       -       -       465.651   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAUF1191785     
p1     OK               u0     931.51 GB   1953525168    JP9960HZ1GT8RU      
p2     OK               u0     465.76 GB   976773168     WD-WCAUF1191407     
p3     OK               u0     931.51 GB   1953525168    Z4YEVK1V            
p4     OK               u0     465.76 GB   976773168     WD-WCAUF1217402     
p5     OK               u1     931.51 GB   1953525168    Z4YEVMAL            
p6     OK               u2     465.76 GB   976773168     6QG3SVAM            
p7     OK               u0     465.76 GB   976773168     6QG3SEQK            

Rebuild the RAID and and Ignore ECC

Now that we’ve got the new drive in and ready to go, we’ll need to issue a command to rebuild the RAID.

We’re going to ignoreecc to get the rebuild to work.  This will get us past the problem, but be aware there could be some undetectable corruption on disk where the ECC error occurred that we’re ignoring.

I’ve done this a few times in situations, and have lucked out where it didn’t effect any data that you never know.

[root@host3 log]# tw_cli /c0/u0 start rebuild disk=5 ignoreecc
[root@host3 log]# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0       -       64K     2793.91   ON     OFF    
u2    SINGLE    OK             -       -       -       465.651   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAUF1191785     
p1     OK               u0     931.51 GB   1953525168    JP9960HZ1GT8RU      
p2     OK               u0     465.76 GB   976773168     WD-WCAUF1191407     
p3     OK               u0     931.51 GB   1953525168    Z4YEVK1V            
p4     OK               u0     465.76 GB   976773168     WD-WCAUF1217402     
p5     DEGRADED         u0     931.51 GB   1953525168    Z4YEVMAL            
p6     OK               u2     465.76 GB   976773168     6QG3SVAM            
p7     OK               u0     465.76 GB   976773168     6QG3SEQK            

Depending on the size of you’re RAID it will take at least a few hours or more to rebuild.

 

Categories: Uncategorized