What do you do when a Ceph OSD is nearfull?

I set up a cluster of 4 servers with three disks each; I used a combination of 3TB and 1TB drives which I had laying around at the time.

When I ran ceph osd status, I see that one of the 1TB OSD is nearfull which isn’t right.  You never want to have an OSD fill up 100%.  So I need to make some changes. Here is what my OSD status looks like right now.

[root@dwlaxosd1 ~]# ceph osd status
+----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+
| id |           host          |  used | avail | wr ops | wr data | rd ops | rd data |       state        |
+----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+
| 0  | dwlaxosd1.deasilnet.com | 1430G | 1364G |    0   |     0   |    0   |     0   |     exists,up      |
| 1  | dwlaxosd2.deasilnet.com | 1620G | 1174G |    0   |     0   |    0   |     0   |     exists,up      |
| 2  |                         |    0  |    0  |    0   |     0   |    0   |     0   |   autoout,exists   |
| 3  | dwlaxosd4.deasilnet.com |  454G | 2340G |    0   |     0   |    0   |     0   |     exists,up      |
| 4  | dwlaxosd1.deasilnet.com | 1196G | 1598G |    0   |     0   |    0   |     0   |     exists,up      |
| 5  | dwlaxosd1.deasilnet.com | 1084G | 1709G |    0   |     0   |    0   |     0   |     exists,up      |
| 6  | dwlaxosd2.deasilnet.com |  812G |  119G |    0   |     0   |    0   |     0   | exists,nearfull,up |
| 7  | dwlaxosd3.deasilnet.com |  771G |  160G |    0   |     0   |    0   |     0   |     exists,up      |
| 8  | dwlaxosd3.deasilnet.com |  657G |  273G |    0   |     0   |    0   |     0   |     exists,up      |
| 9  | dwlaxosd4.deasilnet.com |  427G |  504G |    0   |     0   |    0   |     0   |     exists,up      |
| 10 | dwlaxosd4.deasilnet.com |  387G |  544G |    0   |     0   |    0   |     0   |     exists,up      |
+----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+

Redistribute storage options

We have two options, adding more OSDs which will assist in redistributing the pages.  Another option which I’m going to show today is reweighting the OSD to have Ceph redistribute the pages without actually adding more storage.

ceph osd reweight-by-utilization [percentage] 

Running the command will make adjustments to a maximum of 4 OSDs that are at 120% utilization.  We can also manually change one OSD at a time.

ceph osd reweight osd.X [weight]

Where X = OSD number, i.e., osd.6 and weight is 1.0 by default, and I can change that to something like 0.90, the weight doesn’t need to change much, just small fractions.

Rebalance cluster using reweight

So, I’m going to run the reweight by utilization command on my cluster; this is will automatically adjust the OSDs over storage utilization to redistribute pages and rebalance my cluster.

[root@dwlaxosd1 ~]# ceph osd reweight-by-utilization
moved 10 / 512 (1.95312%)
avg 51.2
stddev 21.9399 -> 22.4535 (expected baseline 6.78823)
min osd.10 with 20 -> 20 pgs (0.390625 -> 0.390625 * mean)
max osd.3 with 87 -> 90 pgs (1.69922 -> 1.75781 * mean)

oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.4747
overload_utilization 0.5697
osd.6 weight 1.0000 -> 0.9500
osd.7 weight 1.0000 -> 0.9500
osd.8 weight 1.0000 -> 0.9500
osd.1 weight 1.0000 -> 0.9500

We can see that Ceph made adjustments to 4 OSDs and only lowered their weight by .05 like I mentioned just a small about is necessary.

I’ll wait a little while and see how the rebalance looks…

I checked again 6 hours later, and it hadn’t given me the desired effect so each time I re-ran the reweight by utilization command.  Then again 6 hours later, a different OSD was starting to show near full, so I reweighted one more time.  Now after waiting a couple more days, this is my final result.

[root@dwlaxosd1 ~]# ceph osd status
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |           host          |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | dwlaxosd1.deasilnet.com | 1352G | 1441G |    0   |     0   |    0   |     0   | exists,up |
| 1  | dwlaxosd2.deasilnet.com | 1104G | 1689G |    0   |     0   |    0   |     0   | exists,up |
| 2  | dwlaxosd3.deasilnet.com | 1800G |  994G |    0   |     0   |    0   |     0   | exists,up |
| 3  | dwlaxosd4.deasilnet.com | 1764G | 1030G |    0   |     0   |    0   |     0   | exists,up |
| 4  | dwlaxosd1.deasilnet.com | 1185G | 1608G |    0   |     0   |    0   |     0   | exists,up |
| 5  | dwlaxosd1.deasilnet.com | 1107G | 1686G |    0   |     0   |    0   |     0   | exists,up |
| 6  | dwlaxosd2.deasilnet.com |  614G |  316G |    0   |     0   |    0   |     0   | exists,up |
| 7  | dwlaxosd3.deasilnet.com |  370G |  560G |    0   |     0   |    0   |     0   | exists,up |
| 8  | dwlaxosd3.deasilnet.com |  411G |  520G |    0   |     0   |    0   |     0   | exists,up |
| 9  | dwlaxosd4.deasilnet.com |  493G |  438G |    0   |     0   |    0   |     0   | exists,up |
| 10 | dwlaxosd4.deasilnet.com |  285G |  645G |    0   |     0   |    0   |     0   | exists,up |
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+

You’ll notice that the storage was redistributed, and now let’s look at my final result on how the reweighting looks for each OSD.

[root@dwlaxosd1 ~]# ceph osd tree
ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF 
-1       20.92242 root default                               
-3        8.18697     host dwlaxosd1                         
 0   hdd  2.72899         osd.0          up  1.00000 1.00000 
 4   hdd  2.72899         osd.4          up  1.00000 1.00000 
 5   hdd  2.72899         osd.5          up  1.00000 1.00000 
-5        3.63869     host dwlaxosd2                         
 1   hdd  2.72899         osd.1          up  0.80005 1.00000 
 6   hdd  0.90970         osd.6          up  0.75006 1.00000 
-7        4.54839     host dwlaxosd3                         
 2   hdd  2.72899         osd.2          up  1.00000 1.00000 
 7   hdd  0.90970         osd.7          up  0.75006 1.00000 
 8   hdd  0.90970         osd.8          up  0.75006 1.00000 
-9        4.54839     host dwlaxosd4                         
 3   hdd  2.72899         osd.3          up  0.95001 1.00000 
 9   hdd  0.90970         osd.9          up  1.00000 1.00000 
10   hdd  0.90970         osd.10         up  1.00000 1.00000 

Now we can see how the reweighting looks, you can see that I reweighted a couple of times and some of the 1TB storage devices got the most amount of change, as well as a couple of 3TB devices it thought needed different distributions.

So night now my Ceph status is HEALTH_OK, so we’re good, but I’m to go ahead and run reweight one more time since I can visually see some of the OSDs 2 and 6 are still not distributed evenly.  So let’s see what happens.

[root@dwlaxosd1 ~]# ceph osd reweight-by-utilization
moved 15 / 512 (2.92969%)
avg 46.5455
stddev 24.0354 -> 23.7194 (expected baseline 6.50493)
min osd.10 with 17 -> 17 pgs (0.365234 -> 0.365234 * mean)
max osd.2 with 92 -> 86 pgs (1.97656 -> 1.84766 * mean)

oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.4897
overload_utilization 0.5876
osd.6 weight 0.7501 -> 0.7001
osd.2 weight 1.0000 -> 0.9500
osd.3 weight 0.9500 -> 0.9000
osd.1 weight 0.8000 -> 0.8500

Just as I thought, it made adjustments to osd.2 and 6, as well as adjusted 3 and 1.

That should be good for now.