How to resolve Ceph pool getting active+remapped+backfill_toofull

Ceph Storage Cluster

Ceph is a clustered storage solution that can use any number of commodity servers and hard drives.  These can then be made available as object, block or file system storage through a unified interface to your applications or servers. The data can be replicated based on your needs so any single disk or server failure does not effect your data, or the availability of the storage cluster.

Checking the Cluster

We monitor our Ceph cluster health by using Nagios with Ceph plugins and recently had an alert that needed to be resolved.

This issue started when we added an additional 6TB drive into the cluster, and as the cluster was backfilling (redistributing) data we got an alert.  This alert was one of our OSD just getting an health warning active+remapped+backfill_toofull, this is the process which I went through to resolve.

First Tried Reweighting the OSDs

I previously had a similar issue were an OSD was nearfull and ran reweight to help resolve they issue

ceph osd reweight-by-utilization

This is what the cluster looked like before starting the reweight process.

[root@osd1 ~]# ceph -s 
  cluster:
    id:     ffdb9e09-fdca-48bb-b7fb-cd17151d5c09
    health: HEALTH_ERR
            1 backfillfull osd(s)
            2 pool(s) backfillfull
            26199/6685016 objects misplaced (0.392%)
            Degraded data redundancy (low space): 1 pg backfill_toofull
 
  services:
    mon: 3 daemons, quorum osd1,osd2,osd3
    mgr: osd1(active), standbys: osd2
    mds: cephfs-2/2/2 up  {0=osd1=up:active,1=osd2=up:active}, 1 up:standby
    osd: 15 osds: 15 up, 15 in; 1 remapped pgs
 
  data:
    pools:   2 pools, 256 pgs
    objects: 3264k objects, 12342 GB
    usage:   24773 GB used, 18898 GB / 43671 GB avail
    pgs:     26199/6685016 objects misplaced (0.392%)
             255 active+clean
             1   active+remapped+backfill_toofull
 
[root@osd1 ~]# ceph osd status
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
| id |           host          |  used | avail | wr ops | wr data | rd ops | rd data |         state          |
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
| 0  | osd1.example.com | 1741G | 1053G |    0   |     0   |    0   |     0   |       exists,up        |
| 1  | osd2.example.com | 2034G |  760G |    0   |     0   |    0   |     0   |       exists,up        |
| 2  | osd3.example.com | 1937G |  857G |    0   |     0   |    0   |     0   |       exists,up        |
| 3  | osd4.example.com | 2031G |  763G |    0   |     0   |    0   |     0   |       exists,up        |
| 4  | osd1.example.com | 2032G |  761G |    0   |     0   |    0   |     0   |       exists,up        |
| 5  | osd1.example.com | 2033G |  761G |    0   |     0   |    0   |     0   |       exists,up        |
| 6  | osd2.example.com |  485G |  446G |    0   |     0   |    0   |     0   |       exists,up        |
| 7  | osd3.example.com |  677G |  254G |    0   |     0   |    0   |     0   |       exists,up        |
| 8  | osd3.example.com |  869G | 61.7G |    0   |     0   |    0   |     0   | backfillfull,exists,up |
| 9  | osd4.example.com |  676G |  255G |    0   |     0   |    0   |     0   |       exists,up        |
| 10 | osd4.example.com |  194G |  736G |    0   |     0   |    0   |     0   |       exists,up        |
| 11 | osd5.example.com | 2806G | 2782G |    0   |     0   |    0   |     0   |       exists,up        |
| 12 | osd5.example.com | 1938G | 3650G |    0   |     0   |    0   |     0   |       exists,up        |
| 13 | osd5.example.com | 2901G | 2687G |    0   |     0   |    0   |     0   |       exists,up        |
| 14 | osd5.example.com | 2412G | 3067G |    0   |     0   |    0   |     0   |       exists,up        |
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
[root@osd1 ~]# ceph osd reweight-by-utilization
moved 9 / 512 (1.75781%)
avg 34.1333
stddev 16.7087 -> 16.5484 (expected baseline 5.64427)
min osd.6 with 8 -> 8 pgs (0.234375 -> 0.234375 * mean)
max osd.13 with 60 -> 60 pgs (1.75781 -> 1.75781 * mean)

oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.5673
overload_utilization 0.6807
osd.8 weight 0.6501 -> 0.6001
osd.1 weight 0.7501 -> 0.7001
osd.5 weight 0.8852 -> 0.8353
osd.4 weight 0.9500 -> 0.9000

This process will take a while to run based on the size of your cluster and your configuration.

For me it took about 24 hours to complete, and it didn’t resolve my issue, so I attempted another reweight, and again after 24 hours later I now have two OSDs with a status of backfillfull.  So obviously need to look into another way of getting this resolved.

Second Tried Increasing PG

I did some addition checking and looked further into the issue.

I first checked the OSD troubleshooting and then the PG troubleshooting, I tracked down I had a pg issue.

Looks like pg 1.33 is getting low space and not continuing with the backfill.  We have misplaced objects and not missing objects which is good, our cluster is still running during this process.

[root@osd1 ~]# ceph health detail
HEALTH_ERR 2 backfillfull osd(s); 2 pool(s) backfillfull; 70105/6685016 objects misplaced (1.049%); Degraded data redundancy (low space): 1 pg backfill_toofull
OSD_BACKFILLFULL 2 backfillfull osd(s)
    osd.8 is backfill full
    osd.9 is backfill full
POOL_BACKFILLFULL 2 pool(s) backfillfull
    pool 'cephfs_data' is backfillfull
    pool 'cephfs_metadata' is backfillfull
OBJECT_MISPLACED 70105/6685016 objects misplaced (1.049%)
PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
    pg 1.33 is active+remapped+backfill_toofull, acting [12,4]
[root@osd1 ~]# ceph osd status
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
| id |           host          |  used | avail | wr ops | wr data | rd ops | rd data |         state          |
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+
| 0  | osd1.example.com | 1741G | 1053G |    0   |     0   |    0   |     0   |       exists,up        |
| 1  | osd2.example.com | 1937G |  856G |    0   |     0   |    0   |     0   |       exists,up        |
| 2  | osd3.example.com | 2033G |  760G |    0   |     0   |    0   |     0   |       exists,up        |
| 3  | osd4.example.com | 2180G |  614G |    0   |     0   |    0   |     0   |       exists,up        |
| 4  | osd1.example.com | 1936G |  857G |    0   |     0   |    0   |     0   |       exists,up        |
| 5  | osd1.example.com | 1840G |  954G |    0   |     0   |    0   |     0   |       exists,up        |
| 6  | osd2.example.com |  485G |  446G |    0   |     0   |    0   |     0   |       exists,up        |
| 7  | osd3.example.com |  677G |  254G |    0   |     0   |    0   |     0   |       exists,up        |
| 8  | osd3.example.com |  869G | 61.7G |    0   |     0   |    0   |     0   | backfillfull,exists,up |
| 9  | osd4.example.com |  867G | 64.3G |    0   |     0   |    0   |     0   | backfillfull,exists,up |
| 10 | osd4.example.com |  194G |  737G |    0   |     0   |    0   |     0   |       exists,up        |
| 11 | osd5.example.com | 2806G | 2782G |    0   |     0   |    0   |     0   |       exists,up        |
| 12 | osd5.example.com | 1938G | 3650G |    0   |     0   |    0   |     0   |       exists,up        |
| 13 | osd5.example.com | 2901G | 2687G |    0   |     0   |    0   |     0   |       exists,up        |
| 14 | osd5.example.com | 2412G | 3067G |    0   |     0   |    0   |     0   |       exists,up        |
+----+-------------------------+-------+-------+--------+---------+--------+---------+------------------------+

We can see that now today I have 2 OSDs that are backfillfull, which isn’t good, and I can see that pg 1.33 seems to be the one that is giving us a problem.

After doing some additional research I was able to determine the when I setup my Ceph cluster, I only had <10 OSDs, now I’m running 16 OSDs.  I had made a bad assumption there was a single OSD per server, but in fact we have 4 drives in each server which gives us 4 OSDs per physical server. Each OSD manages an individual storage device.

Based on the Ceph documentation in order to determine the number of pg you want in your pool, the calculation would be something like this. (OSDs * 100) / Replicas, so in my case I now have 16 OSDs, and 2 copies of each object.

16 * 100 / 2 = 800

The number of pg must be in powers of 2, so the next matching power of 2 would be 1024. So I checked our pool pg size and attempted to make adjustments to see if they helps.

Remember when making changes to pg_num also increase pgp_num.

[root@osd1 ~]# ceph osd lspools
1 cephfs_data,2 cephfs_metadata,
[root@osd1 ~]# ceph osd pool get cephfs_data size
size: 2
[root@osd1 ~]# ceph osd pool get cephfs_data min_size
min_size: 1
[root@osd1 ~]# ceph osd pool get cephfs_data pg_num
pg_num: 128
[root@osd1 ~]# ceph osd pool get cephfs_data pgp_num
pgp_num: 128

We can see that when I created the pool I used the default of 128, not realizing that I was going to be adding OSDs over time and it’s recommended to adjust pg_num and pgp_num based on the increasing number of OSDs.  So I attempted to increase pg_num from 128 to 1024.

[root@osd1 ~]# ceph osd pool set cephfs_data pg_num 1024
Error E2BIG: specified pg_num 1024 is too large (creating 920 new PGs on ~15 OSDs exceeds per-OSD max of 32)

I’m not able to make such a radical jump from 128 to 1024, so I did a smaller increase from 128 to 256.

[root@osd1 ~]# ceph osd pool set cephfs_data pg_num 256
set pool 1 pg_num to 256

This has initiated the changes in my pool, and before making any further adjustments it will take some time for the cluster to recover. I’m going to wait for this to complete again before making any further changes.

So you can see what my Ceph health check looks like, this is where we are at now after making those changes.

[root@osd1 ~]# ceph -s 
  cluster:
    id:     ffdb9e09-fdca-48bb-b7fb-cd17151d5c09
    health: HEALTH_ERR
            2 backfillfull osd(s)
            2 pool(s) backfillfull
            2830303/6685016 objects misplaced (42.338%)
            Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded
            Degraded data redundancy (low space): 2 pgs backfill_toofull
 
  services:
    mon: 3 daemons, quorum osd1,osd2,osd3
    mgr: osd1(active), standbys: osd2
    mds: cephfs-2/2/2 up  {0=osd1=up:active,1=osd2=up:active}, 1 up:standby
    osd: 15 osds: 15 up, 15 in; 130 remapped pgs
 
  data:
    pools:   2 pools, 384 pgs
    objects: 3264k objects, 12342 GB
    usage:   24915 GB used, 18756 GB / 43671 GB avail
    pgs:     2/6685016 objects degraded (0.000%)
             2830303/6685016 objects misplaced (42.338%)
             253 active+clean
             120 active+remapped+backfill_wait
             8   active+remapped+backfilling
             2   active+remapped+backfill_wait+backfill_toofull
             1   active+recovery_wait+degraded
 
  io:
    recovery: 95900 kB/s, 24 objects/s
 
[root@osd1 ~]# ceph health detail
HEALTH_ERR 2 backfillfull osd(s); 2 pool(s) backfillfull; 2792612/6685016 objects misplaced (41.774%); Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded; Degraded data redundancy (low space): 2 pgs backfill_toofull
OSD_BACKFILLFULL 2 backfillfull osd(s)
    osd.8 is backfill full
    osd.9 is backfill full
POOL_BACKFILLFULL 2 pool(s) backfillfull
    pool 'cephfs_data' is backfillfull
    pool 'cephfs_metadata' is backfillfull
OBJECT_MISPLACED 2792612/6685016 objects misplaced (41.774%)
PG_DEGRADED Degraded data redundancy: 2/6685016 objects degraded (0.000%), 1 pg degraded
    pg 1.3a is active+recovery_wait+degraded, acting [11,2]
PG_DEGRADED_FULL Degraded data redundancy (low space): 2 pgs backfill_toofull
    pg 1.33 is active+remapped+backfill_wait+backfill_toofull, acting [12,4]
    pg 1.a6 is active+remapped+backfill_wait+backfill_toofull, acting [7,14]

Earlier when I started only pg 1.33 was showing backfill_toofull, and now we have  pg 1.33 and 1.a6 both showing.  Lets wait for the dust to settle after our last change before making any more adjustments.

The Recovery Process

After 24 hours it’s looking good, no errors, but it’s still got going through a recover process.  We’re down from 42% to 18% objects misplaced, and our OSDs no longer have any backfill error messages, so looks like we’re on the right path.

[root@osd1 ~]# ceph -s 
  cluster:
    id:     ffdb9e09-fdca-48bb-b7fb-cd17151d5c09
    health: HEALTH_ERR
            1235611/6685016 objects misplaced (18.483%)
            Degraded data redundancy (low space): 5 pgs backfill_toofull
 
  services:
    mon: 3 daemons, quorum osd1,osd2,osd3
    mgr: osd1(active), standbys: osd2
    mds: cephfs-2/2/2 up  {0=osd1=up:active,1=osd2=up:active}, 1 up:standby
    osd: 15 osds: 15 up, 15 in; 57 remapped pgs
 
  data:
    pools:   2 pools, 384 pgs
    objects: 3264k objects, 12342 GB
    usage:   25062 GB used, 18609 GB / 43671 GB avail
    pgs:     1235611/6685016 objects misplaced (18.483%)
             327 active+clean
             49  active+remapped+backfill_wait
             5   active+remapped+backfill_wait+backfill_toofull
             3   active+remapped+backfilling
 
  io:
    recovery: 38584 kB/s, 9 objects/s
 
[root@osd1 ~]# ceph -s 
  cluster:
    id:     ffdb9e09-fdca-48bb-b7fb-cd17151d5c09
    health: HEALTH_ERR
            1235327/6685016 objects misplaced (18.479%)
            Degraded data redundancy (low space): 5 pgs backfill_toofull
 
  services:
    mon: 3 daemons, quorum osd1,osd2,osd3
    mgr: osd1(active), standbys: osd2
    mds: cephfs-2/2/2 up  {0=osd1=up:active,1=osd2=up:active}, 1 up:standby
    osd: 15 osds: 15 up, 15 in; 57 remapped pgs
 
  data:
    pools:   2 pools, 384 pgs
    objects: 3264k objects, 12342 GB
    usage:   25063 GB used, 18608 GB / 43671 GB avail
    pgs:     1235327/6685016 objects misplaced (18.479%)
             327 active+clean
             49  active+remapped+backfill_wait
             5   active+remapped+backfill_wait+backfill_toofull
             3   active+remapped+backfilling
 
  io:
    recovery: 32430 kB/s, 8 objects/s
 
[root@osd1 ~]# ceph osd status
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |           host          |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | osd1.example.com | 1789G | 1004G |    0   |     0   |    0   |     0   | exists,up |
| 1  | osd2.example.com | 2228G |  566G |    0   |     0   |    0   |     0   | exists,up |
| 2  | osd3.example.com | 2270G |  524G |    0   |     0   |    0   |     0   | exists,up |
| 3  | osd4.example.com | 2164G |  629G |    0   |     0   |    0   |     0   | exists,up |
| 4  | osd1.example.com | 2069G |  725G |    0   |     0   |    0   |     0   | exists,up |
| 5  | osd1.example.com | 1454G | 1339G |    0   |     0   |    0   |     0   | exists,up |
| 6  | osd2.example.com |  485G |  446G |    0   |     0   |    0   |     0   | exists,up |
| 7  | osd3.example.com |  437G |  494G |    0   |     0   |    0   |     0   | exists,up |
| 8  | osd3.example.com |  627G |  303G |    0   |     0   |    0   |     0   | exists,up |
| 9  | osd4.example.com |  771G |  159G |    0   |     0   |    0   |     0   | exists,up |
| 10 | osd4.example.com |  339G |  591G |    0   |     0   |    0   |     0   | exists,up |
| 11 | osd5.example.com | 2464G | 3124G |    0   |     0   |    0   |     0   | exists,up |
| 12 | osd5.example.com | 2174G | 3414G |    0   |     0   |    0   |     0   | exists,up |
| 13 | osd5.example.com | 3418G | 2170G |    0   |     0   |    0   |     0   | exists,up |
| 14 | osd5.example.com | 2367G | 3112G |    0   |     0   |    0   |     0   | exists,up |
+----+-------------------------+-------+-------+--------+---------+--------+---------+-----------+
[root@osd1 ~]# 

The recovery process is looking good.  I’ll check back again tomorrow to make sure it’s finished and all of our alerts have cleared.

Once that is done I’ll make one more adjustment on the pg_num to bring it up to the right level for the number of our OSDs.