Rook Ceph Failed to complete rook-ceph-mon0: signal: aborted (core dumped)

I’ve got an installation of Rook + Ceph, running on our Kubernetes self hosted environment and after running for a few days or a week, we end up having a problem where 2 of the 3 mons stop working.  They go into a CrashLoopBackOff and I haven’t been able to recover.  When that happens a number of pods the relay on rook/ceph also have issues.  I’ve had to completely remove our Rook installation and start over fresh, about 3 times now.  So we need to determine what the heck is causing the problem.

We’re using versions:

Ceph 12.2.4
Rook v0.7.0-40.g284c1b3
CentOS 7.4
Kubernetes 1.9.6

The error I’m getting is

failed to run mon. failed to start mon: Failed to complete rook-ceph-mon0: signal: aborted (core dumped)

I’ve spent hours trying to determine issue as well as trying to fix.  There is documented on the Rook website documentation that is exactly our issue, recover from a failed mon sounds like like our situation.  I’ve tried it twice and both times it wouldn’t didn’t help me get out the mess.

I even jumped into the Rook Slack channel and finally found a solution to my issue.  Looks like it comes down to the fact that the CephFS is experimental.  We had created multiple file systems for groups of pods that needed to share the same data.  Sounds like a good way of sharing data, well that is what caused the issue.

Based on the guys in the Slack channel, the filesystem implementation is experimental.  I didn’t see that anywhere in the documentation, docs mention about kernel version, but nothing about not using multiple file systems. So based on that understanding, we had started over again, with a clean Rook installation and setup each of the different pods to use a different sub-directory on a single common file system.

You can determine if you’re having a similar issue this is what our pods looked like at the time.  The issue always starts with 2 of the mons going out and giving CrashLoopBackOff, once that started other things would be effected, and we couldn’t find a way to recover.

$ kubectl get pods -n rook 
NAME                                      READY     STATUS             RESTARTS   AGE
rook-ceph-mds-myfs-5f74b67c6d-8cbrz      0/1       CrashLoopBackOff   85         9h
rook-ceph-mds-myfs-5f74b67c6d-pp9x9      0/1       CrashLoopBackOff   85         9h
rook-ceph-mds-fsvolume-65d5985578-5snlm   0/1       Error              86         9h
rook-ceph-mds-fsvolume-65d5985578-p65hq   1/1       Running            86         9h
rook-ceph-mgr0-cfccfd6b8-4gwhg            1/1       Running            0          9h
rook-ceph-mon0-jdgnw                      0/1       CrashLoopBackOff   11         32m
rook-ceph-mon1-j5vpm                      0/1       CrashLoopBackOff   9          22m
rook-ceph-mon2-z7pnd                      0/1       CrashLoopBackOff   10         30m
rook-ceph-osd-5fj7q                       1/1       Running            1          1d
rook-ceph-osd-kt5zb                       1/1       Running            2          1d
rook-ceph-osd-nqlp6                       1/1       Running            1          1d
rook-ceph-osd-wzzjm                       1/1       Running            0          1d
rook-tools                                1/1       Running            0          25m

Also here’s the output of the logs, you’ll see the last line where it’s getting a core dump

$ kubectl -n rook logs rook-ceph-mon0-8khz6
2018-04-17 04:35:17.285811 I | rook: starting Rook v0.7.0-40.g284c1b3 with arguments '/usr/local/bin/rook mon --config-dir=/var/lib/rook --name=rook-ceph-mon0 --port=6790 --fsid=d4f1a1ca-b919-4c5b-89f2-1aed3d913a97'
.
.
.
2018-04-17 04:35:27.536807 I | rook-ceph-mon0:     -8> 2018-04-17 04:35:27.510728 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240659 0x55d3913c4000 global_id  (34096) v1
2018-04-17 04:35:27.536836 I | rook-ceph-mon0:     -7> 2018-04-17 04:35:27.510857 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240660 0x55d3913c4200 global_id  (34096) v1
2018-04-17 04:35:27.536855 I | rook-ceph-mon0:     -6> 2018-04-17 04:35:27.510920 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240661 0x55d3913c4400 global_id  (34096) v1
2018-04-17 04:35:27.536879 I | rook-ceph-mon0:     -5> 2018-04-17 04:35:27.510987 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240662 0x55d3913c4600 global_id  (34096) v1
2018-04-17 04:35:27.536905 I | rook-ceph-mon0:     -4> 2018-04-17 04:35:27.511058 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240663 0x55d3913c4800 global_id  (34096) v1
2018-04-17 04:35:27.536923 I | rook-ceph-mon0:     -3> 2018-04-17 04:35:27.511108 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240664 0x55d3913c4a00 global_id  (34096) v1
2018-04-17 04:35:27.536941 I | rook-ceph-mon0:     -2> 2018-04-17 04:35:27.511235 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240665 0x55d3913c4c00 global_id  (34096) v1
2018-04-17 04:35:27.536997 I | rook-ceph-mon0:     -1> 2018-04-17 04:35:27.511298 7f382bd67700  5 -- 10.107.181.203:6790/0 >> 10.111.191.96:6790/0 conn(0x55d39147a800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1518 cs=1 l=0). rx mon.2 seq 776240666 0x55d3913c4e00 global_id  (34096) v1
2018-04-17 04:35:27.537028 I | rook-ceph-mon0:      0> 2018-04-17 04:35:27.518367 7f382fd6f700 -1 /build/ceph-12.2.4/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f382fd6f700 time 2018-04-17 04:35:27.510001
2018-04-17 04:35:27.537047 I | rook-ceph-mon0: /build/ceph-12.2.4/src/mds/FSMap.cc: 876: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)
2018-04-17 04:35:27.537070 I | rook-ceph-mon0: 
2018-04-17 04:35:27.537096 I | rook-ceph-mon0:  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
2018-04-17 04:35:27.537122 I | rook-ceph-mon0:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55d387221672]
2018-04-17 04:35:27.537146 I | rook-ceph-mon0:  2: (FSMap::assign_standby_replay(mds_gid_t, int, int)+0x457) [0x55d3874b9eb7]
2018-04-17 04:35:27.537180 I | rook-ceph-mon0:  3: (MDSMonitor::try_standby_replay(MDSMap::mds_info_t const&, Filesystem const&, MDSMap::mds_info_t const&)+0x222) [0x55d38719bfe2]
2018-04-17 04:35:27.537198 I | rook-ceph-mon0:  4: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0xc7c) [0x55d3871a004c]
2018-04-17 04:35:27.537226 I | rook-ceph-mon0:  5: (MDSMonitor::tick()+0x8ea) [0x55d3871a6c8a]
2018-04-17 04:35:27.537244 I | rook-ceph-mon0:  6: (MDSMonitor::on_active()+0x28) [0x55d38719bc88]
2018-04-17 04:35:27.537287 I | rook-ceph-mon0:  7: (PaxosService::_active()+0x40a) [0x55d3870fb77a]
2018-04-17 04:35:27.537306 I | rook-ceph-mon0:  8: (Context::complete(int)+0x9) [0x55d386fd1629]
2018-04-17 04:35:27.537325 I | rook-ceph-mon0:  9: (void finish_contexts(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x55d386fdb01b]
2018-04-17 04:35:27.537345 I | rook-ceph-mon0:  10: (Paxos::finish_round()+0x188) [0x55d3870f3358]
2018-04-17 04:35:27.537364 I | rook-ceph-mon0:  11: (Paxos::handle_last(boost::intrusive_ptr)+0xf9d) [0x55d3870f486d]
2018-04-17 04:35:27.537382 I | rook-ceph-mon0:  12: (Paxos::dispatch(boost::intrusive_ptr)+0x263) [0x55d3870f51c3]
2018-04-17 04:35:27.537405 I | rook-ceph-mon0:  13: (Monitor::dispatch_op(boost::intrusive_ptr)+0xefe) [0x55d386fc72ce]
2018-04-17 04:35:27.537430 I | rook-ceph-mon0:  14: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55d386fc7e5b]
2018-04-17 04:35:27.537456 I | rook-ceph-mon0:  15: (Monitor::ms_dispatch(Message*)+0x23) [0x55d386ff7d93]
2018-04-17 04:35:27.537480 I | rook-ceph-mon0:  16: (DispatchQueue::entry()+0xf4a) [0x55d38752282a]
2018-04-17 04:35:27.537506 I | rook-ceph-mon0:  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d3872d1a8d]
2018-04-17 04:35:27.537530 I | rook-ceph-mon0:  18: (()+0x76ba) [0x7f3837abc6ba]
2018-04-17 04:35:27.537548 I | rook-ceph-mon0:  19: (clone()+0x6d) [0x7f38362e641d]
2018-04-17 04:35:27.537576 I | rook-ceph-mon0:  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
2018-04-17 04:35:27.537595 I | rook-ceph-mon0: 
2018-04-17 04:35:27.537619 I | rook-ceph-mon0: --- logging levels ---
2018-04-17 04:35:27.537644 I | rook-ceph-mon0:    0/ 5 none
2018-04-17 04:35:27.537669 I | rook-ceph-mon0:    0/ 1 lockdep
2018-04-17 04:35:27.537726 I | rook-ceph-mon0:    0/ 1 context
2018-04-17 04:35:27.537746 I | rook-ceph-mon0:    1/ 1 crush
2018-04-17 04:35:27.537769 I | rook-ceph-mon0:    1/ 5 mds
2018-04-17 04:35:27.537793 I | rook-ceph-mon0:    1/ 5 mds_balancer
2018-04-17 04:35:27.537820 I | rook-ceph-mon0:    1/ 5 mds_locker
2018-04-17 04:35:27.537844 I | rook-ceph-mon0:    1/ 5 mds_log
2018-04-17 04:35:27.537869 I | rook-ceph-mon0:    1/ 5 mds_log_expire
2018-04-17 04:35:27.537894 I | rook-ceph-mon0:    1/ 5 mds_migrator
2018-04-17 04:35:27.537919 I | rook-ceph-mon0:    0/ 1 buffer
2018-04-17 04:35:27.537945 I | rook-ceph-mon0:    0/ 1 timer
2018-04-17 04:35:27.537969 I | rook-ceph-mon0:    0/ 1 filer
2018-04-17 04:35:27.537993 I | rook-ceph-mon0:    0/ 1 striper
2018-04-17 04:35:27.538019 I | rook-ceph-mon0:    0/ 1 objecter
2018-04-17 04:35:27.538038 I | rook-ceph-mon0:    0/ 0 rados
2018-04-17 04:35:27.538053 I | rook-ceph-mon0:    0/ 5 rbd
2018-04-17 04:35:27.538071 I | rook-ceph-mon0:    0/ 5 rbd_mirror
2018-04-17 04:35:27.538088 I | rook-ceph-mon0:    0/ 5 rbd_replay
2018-04-17 04:35:27.538105 I | rook-ceph-mon0:    0/ 5 journaler
2018-04-17 04:35:27.538122 I | rook-ceph-mon0:    0/ 5 objectcacher
2018-04-17 04:35:27.538139 I | rook-ceph-mon0:    0/ 5 client
2018-04-17 04:35:27.538157 I | rook-ceph-mon0:    0/ 0 osd
2018-04-17 04:35:27.538176 I | rook-ceph-mon0:    0/ 5 optracker
2018-04-17 04:35:27.538194 I | rook-ceph-mon0:    0/ 5 objclass
2018-04-17 04:35:27.538213 I | rook-ceph-mon0:    0/ 0 filestore
2018-04-17 04:35:27.538231 I | rook-ceph-mon0:    0/ 0 journal
2018-04-17 04:35:27.538248 I | rook-ceph-mon0:    0/ 5 ms
2018-04-17 04:35:27.538272 I | rook-ceph-mon0:    0/ 0 mon
2018-04-17 04:35:27.538290 I | rook-ceph-mon0:    0/10 monc
2018-04-17 04:35:27.538317 I | rook-ceph-mon0:    1/ 5 paxos
2018-04-17 04:35:27.538336 I | rook-ceph-mon0:    0/ 5 tp
2018-04-17 04:35:27.538362 I | rook-ceph-mon0:    1/ 5 auth
2018-04-17 04:35:27.538387 I | rook-ceph-mon0:    1/ 5 crypto
2018-04-17 04:35:27.538412 I | rook-ceph-mon0:    1/ 1 finisher
2018-04-17 04:35:27.538438 I | rook-ceph-mon0:    1/ 1 reserver
2018-04-17 04:35:27.538462 I | rook-ceph-mon0:    1/ 5 heartbeatmap
2018-04-17 04:35:27.538488 I | rook-ceph-mon0:    1/ 5 perfcounter
2018-04-17 04:35:27.538513 I | rook-ceph-mon0:    1/ 5 rgw
2018-04-17 04:35:27.538538 I | rook-ceph-mon0:    1/10 civetweb
2018-04-17 04:35:27.538563 I | rook-ceph-mon0:    1/ 5 javaclient
2018-04-17 04:35:27.538588 I | rook-ceph-mon0:    1/ 5 asok
2018-04-17 04:35:27.538613 I | rook-ceph-mon0:    1/ 1 throttle
2018-04-17 04:35:27.538638 I | rook-ceph-mon0:    0/ 0 refs
2018-04-17 04:35:27.538663 I | rook-ceph-mon0:    1/ 5 xio
2018-04-17 04:35:27.538712 I | rook-ceph-mon0:    1/ 5 compressor
2018-04-17 04:35:27.538733 I | rook-ceph-mon0:    0/ 0 bluestore
2018-04-17 04:35:27.538750 I | rook-ceph-mon0:    1/ 5 bluefs
2018-04-17 04:35:27.538777 I | rook-ceph-mon0:    1/ 3 bdev
2018-04-17 04:35:27.538796 I | rook-ceph-mon0:    1/ 5 kstore
2018-04-17 04:35:27.538825 I | rook-ceph-mon0:    4/ 5 rocksdb
2018-04-17 04:35:27.538843 I | rook-ceph-mon0:    0/ 0 leveldb
2018-04-17 04:35:27.538860 I | rook-ceph-mon0:    4/ 5 memdb
2018-04-17 04:35:27.538877 I | rook-ceph-mon0:    1/ 5 kinetic
2018-04-17 04:35:27.538893 I | rook-ceph-mon0:    1/ 5 fuse
2018-04-17 04:35:27.538910 I | rook-ceph-mon0:    1/ 5 mgr
2018-04-17 04:35:27.538927 I | rook-ceph-mon0:    1/ 5 mgrc
2018-04-17 04:35:27.538944 I | rook-ceph-mon0:    1/ 5 dpdk
2018-04-17 04:35:27.538963 I | rook-ceph-mon0:    1/ 5 eventtrace
2018-04-17 04:35:27.539006 I | rook-ceph-mon0:   -2/-2 (syslog threshold)
2018-04-17 04:35:27.539025 I | rook-ceph-mon0:   -1/-1 (stderr threshold)
2018-04-17 04:35:27.539042 I | rook-ceph-mon0:   max_recent     10000
2018-04-17 04:35:27.539072 I | rook-ceph-mon0:   max_new         1000
2018-04-17 04:35:27.539091 I | rook-ceph-mon0:   log_file /dev/stdout
2018-04-17 04:35:27.539119 I | rook-ceph-mon0: --- end dump of recent events ---
2018-04-17 04:35:27.571591 I | rook-ceph-mon0: /build/ceph-12.2.4/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f382fd6f700 time 2018-04-17 04:35:27.510001
2018-04-17 04:35:27.571769 I | rook-ceph-mon0: /build/ceph-12.2.4/src/mds/FSMap.cc: 876: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)
2018-04-17 04:35:27.571812 I | rook-ceph-mon0:  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
2018-04-17 04:35:27.571829 I | rook-ceph-mon0:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55d387221672]
2018-04-17 04:35:27.571842 I | rook-ceph-mon0:  2: (FSMap::assign_standby_replay(mds_gid_t, int, int)+0x457) [0x55d3874b9eb7]
2018-04-17 04:35:27.571856 I | rook-ceph-mon0:  3: (MDSMonitor::try_standby_replay(MDSMap::mds_info_t const&, Filesystem const&, MDSMap::mds_info_t const&)+0x222) [0x55d38719bfe2]
2018-04-17 04:35:27.571870 I | rook-ceph-mon0:  4: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0xc7c) [0x55d3871a004c]
2018-04-17 04:35:27.571883 I | rook-ceph-mon0:  5: (MDSMonitor::tick()+0x8ea) [0x55d3871a6c8a]
2018-04-17 04:35:27.571902 I | rook-ceph-mon0:  6: (MDSMonitor::on_active()+0x28) [0x55d38719bc88]
2018-04-17 04:35:27.571934 I | rook-ceph-mon0:  7: (PaxosService::_active()+0x40a) [0x55d3870fb77a]
2018-04-17 04:35:27.571971 I | rook-ceph-mon0:  8: (Context::complete(int)+0x9) [0x55d386fd1629]
2018-04-17 04:35:27.572011 I | rook-ceph-mon0:  9: (void finish_contexts(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x55d386fdb01b]
2018-04-17 04:35:27.572054 I | rook-ceph-mon0:  10: (Paxos::finish_round()+0x188) [0x55d3870f3358]
2018-04-17 04:35:27.572082 I | rook-ceph-mon0:  11: (Paxos::handle_last(boost::intrusive_ptr)+0xf9d) [0x55d3870f486d]
2018-04-17 04:35:27.572097 I | rook-ceph-mon0:  12: (Paxos::dispatch(boost::intrusive_ptr)+0x263) [0x55d3870f51c3]
2018-04-17 04:35:27.572128 I | rook-ceph-mon0:  13: (Monitor::dispatch_op(boost::intrusive_ptr)+0xefe) [0x55d386fc72ce]
2018-04-17 04:35:27.572146 I | rook-ceph-mon0:  14: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55d386fc7e5b]
2018-04-17 04:35:27.572160 I | rook-ceph-mon0:  15: (Monitor::ms_dispatch(Message*)+0x23) [0x55d386ff7d93]
2018-04-17 04:35:27.572187 I | rook-ceph-mon0:  16: (DispatchQueue::entry()+0xf4a) [0x55d38752282a]
2018-04-17 04:35:27.572201 I | rook-ceph-mon0:  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d3872d1a8d]
2018-04-17 04:35:27.572223 I | rook-ceph-mon0:  18: (()+0x76ba) [0x7f3837abc6ba]
2018-04-17 04:35:27.572237 I | rook-ceph-mon0:  19: (clone()+0x6d) [0x7f38362e641d]
2018-04-17 04:35:27.572251 I | rook-ceph-mon0:  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
2018-04-17 04:35:27.572266 I | rook-ceph-mon0: 2018-04-17 04:35:27.518367 7f382fd6f700 -1 /build/ceph-12.2.4/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f382fd6f700 time 2018-04-17 04:35:27.510001
2018-04-17 04:35:27.572279 I | rook-ceph-mon0: /build/ceph-12.2.4/src/mds/FSMap.cc: 876: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)
2018-04-17 04:35:27.572292 I | rook-ceph-mon0: 
2018-04-17 04:35:27.572306 I | rook-ceph-mon0:  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
2018-04-17 04:35:27.572334 I | rook-ceph-mon0:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55d387221672]
2018-04-17 04:35:27.572350 I | rook-ceph-mon0:  2: (FSMap::assign_standby_replay(mds_gid_t, int, int)+0x457) [0x55d3874b9eb7]
2018-04-17 04:35:27.572364 I | rook-ceph-mon0:  3: (MDSMonitor::try_standby_replay(MDSMap::mds_info_t const&, Filesystem const&, MDSMap::mds_info_t const&)+0x222) [0x55d38719bfe2]
2018-04-17 04:35:27.572389 I | rook-ceph-mon0:  4: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0xc7c) [0x55d3871a004c]
2018-04-17 04:35:27.572403 I | rook-ceph-mon0:  5: (MDSMonitor::tick()+0x8ea) [0x55d3871a6c8a]
2018-04-17 04:35:27.572417 I | rook-ceph-mon0:  6: (MDSMonitor::on_active()+0x28) [0x55d38719bc88]
2018-04-17 04:35:27.572430 I | rook-ceph-mon0:  7: (PaxosService::_active()+0x40a) [0x55d3870fb77a]
2018-04-17 04:35:27.572454 I | rook-ceph-mon0:  8: (Context::complete(int)+0x9) [0x55d386fd1629]
2018-04-17 04:35:27.572468 I | rook-ceph-mon0:  9: (void finish_contexts(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x55d386fdb01b]
2018-04-17 04:35:27.572481 I | rook-ceph-mon0:  10: (Paxos::finish_round()+0x188) [0x55d3870f3358]
2018-04-17 04:35:27.572504 I | rook-ceph-mon0:  11: (Paxos::handle_last(boost::intrusive_ptr)+0xf9d) [0x55d3870f486d]
2018-04-17 04:35:27.572518 I | rook-ceph-mon0:  12: (Paxos::dispatch(boost::intrusive_ptr)+0x263) [0x55d3870f51c3]
2018-04-17 04:35:27.572531 I | rook-ceph-mon0:  13: (Monitor::dispatch_op(boost::intrusive_ptr)+0xefe) [0x55d386fc72ce]
2018-04-17 04:35:27.572550 I | rook-ceph-mon0:  14: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55d386fc7e5b]
2018-04-17 04:35:27.572564 I | rook-ceph-mon0:  15: (Monitor::ms_dispatch(Message*)+0x23) [0x55d386ff7d93]
2018-04-17 04:35:27.572588 I | rook-ceph-mon0:  16: (DispatchQueue::entry()+0xf4a) [0x55d38752282a]
2018-04-17 04:35:27.572602 I | rook-ceph-mon0:  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d3872d1a8d]
2018-04-17 04:35:27.572617 I | rook-ceph-mon0:  18: (()+0x76ba) [0x7f3837abc6ba]
2018-04-17 04:35:27.572638 I | rook-ceph-mon0:  19: (clone()+0x6d) [0x7f38362e641d]
2018-04-17 04:35:27.572654 I | rook-ceph-mon0:  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
2018-04-17 04:35:27.572667 I | rook-ceph-mon0: 
2018-04-17 04:35:27.572711 I | rook-ceph-mon0:      0> 2018-04-17 04:35:27.518367 7f382fd6f700 -1 /build/ceph-12.2.4/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f382fd6f700 time 2018-04-17 04:35:27.510001
2018-04-17 04:35:27.572735 I | rook-ceph-mon0: /build/ceph-12.2.4/src/mds/FSMap.cc: 876: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)
2018-04-17 04:35:27.572749 I | rook-ceph-mon0: 
2018-04-17 04:35:27.572768 I | rook-ceph-mon0:  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
2018-04-17 04:35:27.572801 I | rook-ceph-mon0:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55d387221672]
2018-04-17 04:35:27.572829 I | rook-ceph-mon0:  2: (FSMap::assign_standby_replay(mds_gid_t, int, int)+0x457) [0x55d3874b9eb7]
2018-04-17 04:35:27.572844 I | rook-ceph-mon0:  3: (MDSMonitor::try_standby_replay(MDSMap::mds_info_t const&, Filesystem const&, MDSMap::mds_info_t const&)+0x222) [0x55d38719bfe2]
2018-04-17 04:35:27.572857 I | rook-ceph-mon0:  4: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0xc7c) [0x55d3871a004c]
2018-04-17 04:35:27.572884 I | rook-ceph-mon0:  5: (MDSMonitor::tick()+0x8ea) [0x55d3871a6c8a]
2018-04-17 04:35:27.572898 I | rook-ceph-mon0:  6: (MDSMonitor::on_active()+0x28) [0x55d38719bc88]
2018-04-17 04:35:27.572911 I | rook-ceph-mon0:  7: (PaxosService::_active()+0x40a) [0x55d3870fb77a]
2018-04-17 04:35:27.572934 I | rook-ceph-mon0:  8: (Context::complete(int)+0x9) [0x55d386fd1629]
2018-04-17 04:35:27.572948 I | rook-ceph-mon0:  9: (void finish_contexts(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x55d386fdb01b]
2018-04-17 04:35:27.572963 I | rook-ceph-mon0:  10: (Paxos::finish_round()+0x188) [0x55d3870f3358]
2018-04-17 04:35:27.572995 I | rook-ceph-mon0:  11: (Paxos::handle_last(boost::intrusive_ptr)+0xf9d) [0x55d3870f486d]
2018-04-17 04:35:27.573010 I | rook-ceph-mon0:  12: (Paxos::dispatch(boost::intrusive_ptr)+0x263) [0x55d3870f51c3]
2018-04-17 04:35:27.573023 I | rook-ceph-mon0:  13: (Monitor::dispatch_op(boost::intrusive_ptr)+0xefe) [0x55d386fc72ce]
2018-04-17 04:35:27.573044 I | rook-ceph-mon0:  14: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55d386fc7e5b]
2018-04-17 04:35:27.573058 I | rook-ceph-mon0:  15: (Monitor::ms_dispatch(Message*)+0x23) [0x55d386ff7d93]
2018-04-17 04:35:27.573071 I | rook-ceph-mon0:  16: (DispatchQueue::entry()+0xf4a) [0x55d38752282a]
2018-04-17 04:35:27.573085 I | rook-ceph-mon0:  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d3872d1a8d]
2018-04-17 04:35:27.573106 I | rook-ceph-mon0:  18: (()+0x76ba) [0x7f3837abc6ba]
2018-04-17 04:35:27.573121 I | rook-ceph-mon0:  19: (clone()+0x6d) [0x7f38362e641d]
2018-04-17 04:35:27.573144 I | rook-ceph-mon0:  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
2018-04-17 04:35:27.573157 I | rook-ceph-mon0: 
failed to run mon. failed to start mon: Failed to complete rook-ceph-mon0: signal: aborted (core dumped)

Now we’ve been running for almost a week without any problem.