What do you do when a Ceph OSD is nearfull?
I set up a cluster of 4 servers with three disks each; I used a combination of 3TB and 1TB drives which I had laying around at the time.
When I ran ceph osd status, I see that one of the 1TB OSD is nearfull which isn’t right. You never want to have an OSD fill up 100%. So I need to make some changes. Here is what my OSD status looks like right now.
[root@dwlaxosd1 ~]# ceph osd status +----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+ | 0 | dwlaxosd1.deasilnet.com | 1430G | 1364G | 0 | 0 | 0 | 0 | exists,up | | 1 | dwlaxosd2.deasilnet.com | 1620G | 1174G | 0 | 0 | 0 | 0 | exists,up | | 2 | | 0 | 0 | 0 | 0 | 0 | 0 | autoout,exists | | 3 | dwlaxosd4.deasilnet.com | 454G | 2340G | 0 | 0 | 0 | 0 | exists,up | | 4 | dwlaxosd1.deasilnet.com | 1196G | 1598G | 0 | 0 | 0 | 0 | exists,up | | 5 | dwlaxosd1.deasilnet.com | 1084G | 1709G | 0 | 0 | 0 | 0 | exists,up | | 6 | dwlaxosd2.deasilnet.com | 812G | 119G | 0 | 0 | 0 | 0 | exists,nearfull,up | | 7 | dwlaxosd3.deasilnet.com | 771G | 160G | 0 | 0 | 0 | 0 | exists,up | | 8 | dwlaxosd3.deasilnet.com | 657G | 273G | 0 | 0 | 0 | 0 | exists,up | | 9 | dwlaxosd4.deasilnet.com | 427G | 504G | 0 | 0 | 0 | 0 | exists,up | | 10 | dwlaxosd4.deasilnet.com | 387G | 544G | 0 | 0 | 0 | 0 | exists,up | +----+-------------------------+-------+-------+--------+---------+--------+---------+--------------------+
Redistribute storage options
We have two options, adding more OSDs which will assist in redistributing the pages. Another option which I’m going to show today is reweighting the OSD to have Ceph redistribute the pages without actually adding more storage.
ceph osd reweight-by-utilization [percentage]
Running the command will make adjustments to a maximum of 4 OSDs that are at 120% utilization. We can also manually change one OSD at a time.
ceph osd reweight osd.X [weight]
Where X = OSD number, i.e., osd.6 and weight is 1.0 by default, and I can change that to something like 0.90, the weight doesn’t need to change much, just small fractions.
Rebalance cluster using reweight
So, I’m going to run the reweight by utilization command on my cluster; this is will automatically adjust the OSDs over storage utilization to redistribute pages and rebalance my cluster.
[root@dwlaxosd1 ~]# ceph osd reweight-by-utilization moved 10 / 512 (1.95312%) avg 51.2 stddev 21.9399 -> 22.4535 (expected baseline 6.78823) min osd.10 with 20 -> 20 pgs (0.390625 -> 0.390625 * mean) max osd.3 with 87 -> 90 pgs (1.69922 -> 1.75781 * mean) oload 120 max_change 0.05 max_change_osds 4 average_utilization 0.4747 overload_utilization 0.5697 osd.6 weight 1.0000 -> 0.9500 osd.7 weight 1.0000 -> 0.9500 osd.8 weight 1.0000 -> 0.9500 osd.1 weight 1.0000 -> 0.9500
We can see that Ceph made adjustments to 4 OSDs and only lowered their weight by .05 like I mentioned just a small about is necessary.
I’ll wait a little while and see how the rebalance looks…
I checked again 6 hours later, and it hadn’t given me the desired effect so each time I re-ran the reweight by utilization command. Then again 6 hours later, a different OSD was starting to show near full, so I reweighted one more time. Now after waiting a couple more days, this is my final result.
[root@dwlaxosd1 ~]# ceph osd status +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | dwlaxosd1.deasilnet.com | 1352G | 1441G | 0 | 0 | 0 | 0 | exists,up | | 1 | dwlaxosd2.deasilnet.com | 1104G | 1689G | 0 | 0 | 0 | 0 | exists,up | | 2 | dwlaxosd3.deasilnet.com | 1800G | 994G | 0 | 0 | 0 | 0 | exists,up | | 3 | dwlaxosd4.deasilnet.com | 1764G | 1030G | 0 | 0 | 0 | 0 | exists,up | | 4 | dwlaxosd1.deasilnet.com | 1185G | 1608G | 0 | 0 | 0 | 0 | exists,up | | 5 | dwlaxosd1.deasilnet.com | 1107G | 1686G | 0 | 0 | 0 | 0 | exists,up | | 6 | dwlaxosd2.deasilnet.com | 614G | 316G | 0 | 0 | 0 | 0 | exists,up | | 7 | dwlaxosd3.deasilnet.com | 370G | 560G | 0 | 0 | 0 | 0 | exists,up | | 8 | dwlaxosd3.deasilnet.com | 411G | 520G | 0 | 0 | 0 | 0 | exists,up | | 9 | dwlaxosd4.deasilnet.com | 493G | 438G | 0 | 0 | 0 | 0 | exists,up | | 10 | dwlaxosd4.deasilnet.com | 285G | 645G | 0 | 0 | 0 | 0 | exists,up | +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
You’ll notice that the storage was redistributed, and now let’s look at my final result on how the reweighting looks for each OSD.
[root@dwlaxosd1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 20.92242 root default -3 8.18697 host dwlaxosd1 0 hdd 2.72899 osd.0 up 1.00000 1.00000 4 hdd 2.72899 osd.4 up 1.00000 1.00000 5 hdd 2.72899 osd.5 up 1.00000 1.00000 -5 3.63869 host dwlaxosd2 1 hdd 2.72899 osd.1 up 0.80005 1.00000 6 hdd 0.90970 osd.6 up 0.75006 1.00000 -7 4.54839 host dwlaxosd3 2 hdd 2.72899 osd.2 up 1.00000 1.00000 7 hdd 0.90970 osd.7 up 0.75006 1.00000 8 hdd 0.90970 osd.8 up 0.75006 1.00000 -9 4.54839 host dwlaxosd4 3 hdd 2.72899 osd.3 up 0.95001 1.00000 9 hdd 0.90970 osd.9 up 1.00000 1.00000 10 hdd 0.90970 osd.10 up 1.00000 1.00000
Now we can see how the reweighting looks, you can see that I reweighted a couple of times and some of the 1TB storage devices got the most amount of change, as well as a couple of 3TB devices it thought needed different distributions.
So night now my Ceph status is HEALTH_OK, so we’re good, but I’m to go ahead and run reweight one more time since I can visually see some of the OSDs 2 and 6 are still not distributed evenly. So let’s see what happens.
[root@dwlaxosd1 ~]# ceph osd reweight-by-utilization moved 15 / 512 (2.92969%) avg 46.5455 stddev 24.0354 -> 23.7194 (expected baseline 6.50493) min osd.10 with 17 -> 17 pgs (0.365234 -> 0.365234 * mean) max osd.2 with 92 -> 86 pgs (1.97656 -> 1.84766 * mean) oload 120 max_change 0.05 max_change_osds 4 average_utilization 0.4897 overload_utilization 0.5876 osd.6 weight 0.7501 -> 0.7001 osd.2 weight 1.0000 -> 0.9500 osd.3 weight 0.9500 -> 0.9000 osd.1 weight 0.8000 -> 0.8500
Just as I thought, it made adjustments to osd.2 and 6, as well as adjusted 3 and 1.
That should be good for now.