Context
I help operate two geographically separate OpenStack/Ceph clusters
consisting of 32 servers each, of which 16 (per cluster) are dedicated
OSD servers. Each OSD server currently has six OSDs. Each OSD runs
on a dedicated 4TB SAS disk. Each server also has SSDs, which are
mostly used for OSD write journals.
We monitor our systems using the venerable Nagios.
My colleague Alessandro has written many specific
checks for infrastructure services such as Ceph. Some of them
periodically check the logs for possibly non-harmless messages. It can be interesting to try to understand these messages and get down to their root cause. Here's one from early this morning (edited for readability):
monitor writes:
> Service: L_check_logfiles
> State: WARNING
> Output: WARN - WARNING - (30 warnings in check_logfiles.protocol-2015-02-06-03-35-35) - File=/var/log/ceph/ceph.log Message=2015-02-06 03:35:26.877633 osd.1 [2001:db8:625:ca1e:100::1021]:6800/219476 2257 : [WRN] slow request 30.185039 seconds old, received at 2015-02-06 03:34:56.692108: osd_op(client.1932427.0:27645811 rbd_data.1fd765491f48ea.00000000000000a9 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 52720648192] 5.4220493a ack+ondisk+write e9526) v4 currently no flag points reached ...
The "Output" line tells us that a write operation to osd.1 was stuck in a queue for 30+ seconds around 03:35. Why did that happen in the middle of the night, when utilization is low?
Graphite to the Rescue
Lately we have set up a CollectD/Graphite monitoring
infrastructure for each site. It collects data in ten-second
intervals. The ten-second samples are only retained for three
hours, then aggregated to one-minute samples that are retained for
24 hours, and so on. Because this event happened at night, I missed
the fine-grained samples, so all the graphs shown here have one-minute temporal
resolution.
The event is visible from the "#Ceph" dashboard on our Graphite installation. I extracted a few graphs from it and focused them on the event in question.
Here is a graph that shows block I/O (actually just "I") operations per second of all OSD processes summed up for each OSD server:
These patterns (a few OSD servers reading heavily) indicate
"scrubbing" activity (scrubbing checks existing data for
correctness).
We can look at the block I/O read rates on the individual OSD
disks on ceph21:
and see that /dev/sdc, /dev/sdd and /dev/sde were quite busy, while the other three OSD disks were mostly idle.
The busy devices correspond to osd.0, osd.1 and osd.2:
$ ssh ceph21 'mount | egrep "/dev/sd[cde]"'
/dev/sde1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
/dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime)
/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime)
So we have strong confirmation that osd.1 (which was slow) was being scrubbed (heavily read from). If we look for "scrub" messages in the OSD log around that time, we see that there were in fact three deep-scrubs finishing between 03:38:11 and 03:38:40:
$ ssh ceph21 'fgrep scrub <(gzip -dc /var/log/ceph/ceph-osd.*.log.1.gz) | sort -n' | egrep ' 03:[345]'
2015-02-06 03:38:11.058160 7f3376f7f700 0 log [INF] : 5.13a deep-scrub ok
2015-02-06 03:38:36.608224 7ffe093c4700 0 log [INF] : 5.111 deep-scrub ok
2015-02-06 03:38:40.711687 7f29ac56c700 0 log [INF] : 5.15a deep-scrub ok
The OSD logs unfortunately don't tell us when these scrubs were started, but from looking at the the graphs, in particular the second one, we can guess with high confidence that they all started between 03:21 and 03:22.
Now we can check which OSDs these PGs map to, and we find that,
indeed, they respectively include OSDs 0, 1, and 2:
$ ceph pg dump all | egrep '\<(5\.1(3a|11|5a))\>' | awk '{ print $1, $14 }'
dumped all in format plain
5.15a [0,56,34]
5.13a [1,64,70]
5.111 [2,66,86]
Conclusions
- Deep-scrubbing has an impact on Ceph performance
- It can happen that, on a given server, multiple OSDs are busy
performing deep scrubs at the same time
- When three deep-scrubs happen in parallel on the same server, the
impact can be very visible and lead to >30s queues. This also seems
to affect write OPs, not just reads.
I am somewhat surprised by this, as I would have thought that the
impact is mostly due to "per-spindle" limitations (random read
op/s), but maybe there's another bottleneck. One possibly
interesting observation is that the three disks in question are
connected to the same SAS host adapter, namely the one at PCI
address 05:00.0 (there's a second host adapter on 83:00.0):
$ ssh ceph21 'ls -l /dev/disk/by-path/*-lun-0'
$ ssh zhdk0021.zhdk.cloud.switch.ch 'cd /dev/disk/by-path && /bin/ls -l *-lun-0 | cut -c 38-'
pci-0000:05:00.0-sas-0x4433221100000000-lun-0 -> ../../sdc
pci-0000:05:00.0-sas-0x4433221101000000-lun-0 -> ../../sde
pci-0000:05:00.0-sas-0x4433221102000000-lun-0 -> ../../sdd
pci-0000:05:00.0-sas-0x4433221103000000-lun-0 -> ../../sdf
pci-0000:83:00.0-sas-0x4433221100000000-lun-0 -> ../../sdg
pci-0000:83:00.0-sas-0x4433221101000000-lun-0 -> ../../sdh
pci-0000:83:00.0-sas-0x4433221102000000-lun-0 -> ../../sdi
pci-0000:83:00.0-sas-0x4433221103000000-lun-0 -> ../../sdj
pci-0000:83:00.0-sas-0x4433221104000000-lun-0 -> ../../sdk
pci-0000:83:00.0-sas-0x4433221105000000-lun-0 -> ../../sdl
pci-0000:83:00.0-sas-0x4433221106000000-lun-0 -> ../../sdm
pci-0000:83:00.0-sas-0x4433221107000000-lun-0 -> ../../sdn
Maybe that SAS adapter was the bottleneck in this case.
Possible avenues for improvement
Better foreground/background I/O isolation
Ceph could do a (much) better job isolating actual user I/O from "background" I/O caused by tasks such as scrubbing or rebalancing. See Loïc Dachary's post on Lowering Ceph scrub I/O priority for something that can be configured on recent-enough versions of Ceph. (Thanks for the pointer, Harry!)
Better scrub scheduling
Ceph could do a (much) better job spreading out deep-scrubs over time. The effect described here is not an isolated occurrence - earlier I had observed periods of massive deep-scrubbing, with multi-day periods of no deep-scrubbing at all between them. For example, this is the block read-rate graph across our other Ceph cluster over the past 60 hours:
You see that all the deep-scrubbing is done across one 20-hour period. Crazy! This should be evenly spread out over a week (the global deep-scrub interval).
OSD/controller mapping on our hosts
We could do a better job distributing (consecutive) OSDs across controllers. While we're at it, we should also make sure that journals are distributed nicely across all SSDs, and that we never get confused by changing kernel device names for our SAS disks. And I want a pony.
Any other ideas?