How to resolve Ceph pool getting active+remapped+backfill_toofull
Ceph Storage Cluster
Ceph is a clustered storage solution that can use any number of commodity servers and hard drives. These can then be made available as object, block or file system storage through a unified interface to your applications or servers. The data can be replicated based on your needs so any single disk or server failure does not effect your data, or the availability of the storage cluster.
Checking the Cluster
We monitor our Ceph cluster health by using Nagios with Ceph plugins and recently had an alert that needed to be resolved.
This issue started when we added an additional 6TB drive into the cluster, and as the cluster was backfilling (redistributing) data we got an alert. This alert was one of our OSD just getting an health warning active+remapped+backfill_toofull, this is the process which I went through to resolve.
First Tried Reweighting the OSDs
I previously had a similar issue were an OSD was nearfull and ran reweight to help resolve they issue
ceph osd reweight-by-utilization
This is what the cluster looked like before starting the reweight process.
[root@osd1 ~]# ceph -s cluster: id: ffdb9e09-fdca-48bb-b7fb-cd17151d5c09 health: HEALTH_ERR 1 backfillfull osd(s) 2 pool(s) backfillfull 26199/6685016 objects misplaced (0.392%) Degraded data redundancy (low space): 1 pg backfill_toofull services: mon: 3 daemons, quorum osd1,osd2,osd3 mgr: osd1(active), standbys: osd2 mds: cephfs-2/2/2 up {0=osd1=up:active,1=osd2=up:active}, 1 up:standby osd: 15 osds: 15 up, 15 in; 1 remapped pgs data: pools: 2 pools, 256 pgs objects: 3264k objects, 12342 GB usage: 24773 GB used, 18898 GB / 43671 GB avail pgs: 26199/6685016 objects misplaced (0.392%) 255 active+clean 1 active+remapped+backfill_toofull [root@osd1 ~]# ceph osd status +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+ | 0 | osd1.example.com | 1741G | 1053G | 0 | 0 | 0 | 0 | exists,up | | 1 | osd2.example.com | 2034G | 760G | 0 | 0 | 0 | 0 | exists,up | | 2 | osd3.example.com | 1937G | 857G | 0 | 0 | 0 | 0 | exists,up | | 3 | osd4.example.com | 2031G | 763G | 0 | 0 | 0 | 0 | exists,up | | 4 | osd1.example.com | 2032G | 761G | 0 | 0 | 0 | 0 | exists,up | | 5 | osd1.example.com | 2033G | 761G | 0 | 0 | 0 | 0 | exists,up | | 6 | osd2.example.com | 485G | 446G | 0 | 0 | 0 | 0 | exists,up | | 7 | osd3.example.com | 677G | 254G | 0 | 0 | 0 | 0 | exists,up | | 8 | osd3.example.com | 869G | 61.7G | 0 | 0 | 0 | 0 | backfillfull,exists,up | | 9 | osd4.example.com | 676G | 255G | 0 | 0 | 0 | 0 | exists,up | | 10 | osd4.example.com | 194G | 736G | 0 | 0 | 0 | 0 | exists,up | | 11 | osd5.example.com | 2806G | 2782G | 0 | 0 | 0 | 0 | exists,up | | 12 | osd5.example.com | 1938G | 3650G | 0 | 0 | 0 | 0 | exists,up | | 13 | osd5.example.com | 2901G | 2687G | 0 | 0 | 0 | 0 | exists,up | | 14 | osd5.example.com | 2412G | 3067G | 0 | 0 | 0 | 0 | exists,up | +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+ [root@osd1 ~]# ceph osd reweight-by-utilization moved 9 / 512 (1.75781%) avg 34.1333 stddev 16.7087 -> 16.5484 (expected baseline 5.64427) min osd.6 with 8 -> 8 pgs (0.234375 -> 0.234375 * mean) max osd.13 with 60 -> 60 pgs (1.75781 -> 1.75781 * mean) oload 120 max_change 0.05 max_change_osds 4 average_utilization 0.5673 overload_utilization 0.6807 osd.8 weight 0.6501 -> 0.6001 osd.1 weight 0.7501 -> 0.7001 osd.5 weight 0.8852 -> 0.8353 osd.4 weight 0.9500 -> 0.9000
This process will take a while to run based on the size of your cluster and your configuration.
For me it took about 24 hours to complete, and it didn’t resolve my issue, so I attempted another reweight, and again after 24 hours later I now have two OSDs with a status of backfillfull. So obviously need to look into another way of getting this resolved.
Second Tried Increasing PG
I did some addition checking and looked further into the issue.
I first checked the OSD troubleshooting and then the PG troubleshooting, I tracked down I had a pg issue.
Looks like pg 1.33 is getting low space and not continuing with the backfill. We have misplaced objects and not missing objects which is good, our cluster is still running during this process.
[root@osd1 ~]# ceph health detail HEALTH_ERR 2 backfillfull osd(s); 2 pool(s) backfillfull; 70105/6685016 objects misplaced (1.049%); Degraded data redundancy (low space): 1 pg backfill_toofull OSD_BACKFILLFULL 2 backfillfull osd(s) osd.8 is backfill full osd.9 is backfill full POOL_BACKFILLFULL 2 pool(s) backfillfull pool 'cephfs_data' is backfillfull pool 'cephfs_metadata' is backfillfull OBJECT_MISPLACED 70105/6685016 objects misplaced (1.049%) PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull pg 1.33 is active+remapped+backfill_toofull, acting [12,4] [root@osd1 ~]# ceph osd status +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+ | 0 | osd1.example.com | 1741G | 1053G | 0 | 0 | 0 | 0 | exists,up | | 1 | osd2.example.com | 1937G | 856G | 0 | 0 | 0 | 0 | exists,up | | 2 | osd3.example.com | 2033G | 760G | 0 | 0 | 0 | 0 | exists,up | | 3 | osd4.example.com | 2180G | 614G | 0 | 0 | 0 | 0 | exists,up | | 4 | osd1.example.com | 1936G | 857G | 0 | 0 | 0 | 0 | exists,up | | 5 | osd1.example.com | 1840G | 954G | 0 | 0 | 0 | 0 | exists,up | | 6 | osd2.example.com | 485G | 446G | 0 | 0 | 0 | 0 | exists,up | | 7 | osd3.example.com | 677G | 254G | 0 | 0 | 0 | 0 | exists,up | | 8 | osd3.example.com | 869G | 61.7G | 0 | 0 | 0 | 0 | backfillfull,exists,up | | 9 | osd4.example.com | 867G | 64.3G | 0 | 0 | 0 | 0 | backfillfull,exists,up | | 10 | osd4.example.com | 194G | 737G | 0 | 0 | 0 | 0 | exists,up | | 11 | osd5.example.com | 2806G | 2782G | 0 | 0 | 0 | 0 | exists,up | | 12 | osd5.example.com | 1938G | 3650G | 0 | 0 | 0 | 0 | exists,up | | 13 | osd5.example.com | 2901G | 2687G | 0 | 0 | 0 | 0 | exists,up | | 14 | osd5.example.com | 2412G | 3067G | 0 | 0 | 0 | 0 | exists,up | +----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
We can see that now today I have 2 OSDs that are backfillfull, which isn’t good, and I can see that pg 1.33 seems to be the one that is giving us a problem.
After doing some additional research I was able to determine the when I setup my Ceph cluster, I only had <10 OSDs, now I’m running 16 OSDs. I had made a bad assumption there was a single OSD per server, but in fact we have 4 drives in each server which gives us 4 OSDs per physical server. Each OSD manages an individual storage device.
Based on the Ceph documentation in order to determine the number of pg you want in your pool, the calculation would be something like this. (OSDs * 100) / Replicas, so in my case I now have 16 OSDs, and 2 copies of each object.
16 * 100 / 2 = 800
The number of pg must be in powers of 2, so the next matching power of 2 would be 1024. So I checked our pool pg size and attempted to make adjustments to see if they helps.
Remember when making changes to pg_num also increase pgp_num.
[root@osd1 ~]# ceph osd lspools 1 cephfs_data,2 cephfs_metadata, [root@osd1 ~]# ceph osd pool get cephfs_data size size: 2 [root@osd1 ~]# ceph osd pool get cephfs_data min_size min_size: 1 [root@osd1 ~]# ceph osd pool get cephfs_data pg_num pg_num: 128 [root@osd1 ~]# ceph osd pool get cephfs_data pgp_num pgp_num: 128
We can see that when I created the pool I used the default of 128, not realizing that I was going to be adding OSDs over time and it’s recommended to adjust pg_num and pgp_num based on the increasing number of OSDs. So I attempted to increase pg_num from 128 to 1024.
[root@osd1 ~]# ceph osd pool set cephfs_data pg_num 1024 Error E2BIG: specified pg_num 1024 is too large (creating 920 new PGs on ~15 OSDs exceeds per-OSD max of 32)
I’m not able to make such a radical jump from 128 to 1024, so I did a smaller increase from 128 to 256.
[root@osd1 ~]# ceph osd pool set cephfs_data pg_num 256 set pool 1 pg_num to 256
This has initiated the changes in my pool, and before making any further adjustments it will take some time for the cluster to recover. I’m going to wait for this to complete again before making any further changes.
So you can see what my Ceph health check looks like, this is where we are at now after making those changes.
[root@osd1 ~]# ceph -s cluster: id: ffdb9e09-fdca-48bb-b7fb-cd17151d5c09 health: HEALTH_ERR 2 backfillfull osd(s) 2 pool(s) backfillfull 2830303/6685016 objects misplaced (42.338%) Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded Degraded data redundancy (low space): 2 pgs backfill_toofull services: mon: 3 daemons, quorum osd1,osd2,osd3 mgr: osd1(active), standbys: osd2 mds: cephfs-2/2/2 up {0=osd1=up:active,1=osd2=up:active}, 1 up:standby osd: 15 osds: 15 up, 15 in; 130 remapped pgs data: pools: 2 pools, 384 pgs objects: 3264k objects, 12342 GB usage: 24915 GB used, 18756 GB / 43671 GB avail pgs: 2/6685016 objects degraded (0.000%) 2830303/6685016 objects misplaced (42.338%) 253 active+clean 120 active+remapped+backfill_wait 8 active+remapped+backfilling 2 active+remapped+backfill_wait+backfill_toofull 1 active+recovery_wait+degraded io: recovery: 95900 kB/s, 24 objects/s [root@osd1 ~]# ceph health detail HEALTH_ERR 2 backfillfull osd(s); 2 pool(s) backfillfull; 2792612/6685016 objects misplaced (41.774%); Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded; Degraded data redundancy (low space): 2 pgs backfill_toofull OSD_BACKFILLFULL 2 backfillfull osd(s) osd.8 is backfill full osd.9 is backfill full POOL_BACKFILLFULL 2 pool(s) backfillfull pool 'cephfs_data' is backfillfull pool 'cephfs_metadata' is backfillfull OBJECT_MISPLACED 2792612/6685016 objects misplaced (41.774%) PG_DEGRADED Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded pg 1.3a is active+recovery_wait+degraded, acting [11,2] PG_DEGRADED_FULL Degraded data redundancy (low space): 2 pgs backfill_toofull pg 1.33 is active+remapped+backfill_wait+backfill_toofull, acting [12,4] pg 1.a6 is active+remapped+backfill_wait+backfill_toofull, acting [7,14]
Earlier when I started only pg 1.33 was showing backfill_toofull, and now we have pg 1.33 and 1.a6 both showing. Lets wait for the dust to settle after our last change before making any more adjustments.
The Recovery Process
After 24 hours it’s looking good, no errors, but it’s still got going through a recover process. We’re down from 42% to 18% objects misplaced, and our OSDs no longer have any backfill error messages, so looks like we’re on the right path.
[root@osd1 ~]# ceph -s cluster: id: ffdb9e09-fdca-48bb-b7fb-cd17151d5c09 health: HEALTH_ERR 1235611/6685016 objects misplaced (18.483%) Degraded data redundancy (low space): 5 pgs backfill_toofull services: mon: 3 daemons, quorum osd1,osd2,osd3 mgr: osd1(active), standbys: osd2 mds: cephfs-2/2/2 up {0=osd1=up:active,1=osd2=up:active}, 1 up:standby osd: 15 osds: 15 up, 15 in; 57 remapped pgs data: pools: 2 pools, 384 pgs objects: 3264k objects, 12342 GB usage: 25062 GB used, 18609 GB / 43671 GB avail pgs: 1235611/6685016 objects misplaced (18.483%) 327 active+clean 49 active+remapped+backfill_wait 5 active+remapped+backfill_wait+backfill_toofull 3 active+remapped+backfilling io: recovery: 38584 kB/s, 9 objects/s [root@osd1 ~]# ceph -s cluster: id: ffdb9e09-fdca-48bb-b7fb-cd17151d5c09 health: HEALTH_ERR 1235327/6685016 objects misplaced (18.479%) Degraded data redundancy (low space): 5 pgs backfill_toofull services: mon: 3 daemons, quorum osd1,osd2,osd3 mgr: osd1(active), standbys: osd2 mds: cephfs-2/2/2 up {0=osd1=up:active,1=osd2=up:active}, 1 up:standby osd: 15 osds: 15 up, 15 in; 57 remapped pgs data: pools: 2 pools, 384 pgs objects: 3264k objects, 12342 GB usage: 25063 GB used, 18608 GB / 43671 GB avail pgs: 1235327/6685016 objects misplaced (18.479%) 327 active+clean 49 active+remapped+backfill_wait 5 active+remapped+backfill_wait+backfill_toofull 3 active+remapped+backfilling io: recovery: 32430 kB/s, 8 objects/s [root@osd1 ~]# ceph osd status +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | osd1.example.com | 1789G | 1004G | 0 | 0 | 0 | 0 | exists,up | | 1 | osd2.example.com | 2228G | 566G | 0 | 0 | 0 | 0 | exists,up | | 2 | osd3.example.com | 2270G | 524G | 0 | 0 | 0 | 0 | exists,up | | 3 | osd4.example.com | 2164G | 629G | 0 | 0 | 0 | 0 | exists,up | | 4 | osd1.example.com | 2069G | 725G | 0 | 0 | 0 | 0 | exists,up | | 5 | osd1.example.com | 1454G | 1339G | 0 | 0 | 0 | 0 | exists,up | | 6 | osd2.example.com | 485G | 446G | 0 | 0 | 0 | 0 | exists,up | | 7 | osd3.example.com | 437G | 494G | 0 | 0 | 0 | 0 | exists,up | | 8 | osd3.example.com | 627G | 303G | 0 | 0 | 0 | 0 | exists,up | | 9 | osd4.example.com | 771G | 159G | 0 | 0 | 0 | 0 | exists,up | | 10 | osd4.example.com | 339G | 591G | 0 | 0 | 0 | 0 | exists,up | | 11 | osd5.example.com | 2464G | 3124G | 0 | 0 | 0 | 0 | exists,up | | 12 | osd5.example.com | 2174G | 3414G | 0 | 0 | 0 | 0 | exists,up | | 13 | osd5.example.com | 3418G | 2170G | 0 | 0 | 0 | 0 | exists,up | | 14 | osd5.example.com | 2367G | 3112G | 0 | 0 | 0 | 0 | exists,up | +----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+ [root@osd1 ~]#
The recovery process is looking good. I’ll check back again tomorrow to make sure it’s finished and all of our alerts have cleared.
Once that is done I’ll make one more adjustment on the pg_num to bring it up to the right level for the number of our OSDs.