Simon's Musings: February 2015

Context

I help operate two geographically separate OpenStack/Ceph clusters consisting of 32 servers each, of which 16 (per cluster) are dedicated OSD servers. Each OSD server currently has six OSDs. Each OSD runs on a dedicated 4TB SAS disk. Each server also has SSDs, which are mostly used for OSD write journals.

We monitor our systems using the venerable Nagios. My colleague Alessandro has written many specific checks for infrastructure services such as Ceph. Some of them periodically check the logs for possibly non-harmless messages. It can be interesting to try to understand these messages and get down to their root cause. Here's one from early this morning (edited for readability):

monitor writes:
> Service: L_check_logfiles
> State: WARNING
> Output: WARN - WARNING - (30 warnings in check_logfiles.protocol-2015-02-06-03-35-35) - File=/var/log/ceph/ceph.log Message=2015-02-06 03:35:26.877633 osd.1 [2001:db8:625:ca1e:100::1021]:6800/219476 2257 : [WRN] slow request 30.185039 seconds old, received at 2015-02-06 03:34:56.692108: osd_op(client.1932427.0:27645811 rbd_data.1fd765491f48ea.00000000000000a9 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 52720648192] 5.4220493a ack+ondisk+write e9526) v4 currently no flag points reached ...

The "Output" line tells us that a write operation to osd.1 was stuck in a queue for 30+ seconds around 03:35. Why did that happen in the middle of the night, when utilization is low?

Graphite to the Rescue

Lately we have set up a CollectD/Graphite monitoring infrastructure for each site. It collects data in ten-second intervals. The ten-second samples are only retained for three hours, then aggregated to one-minute samples that are retained for 24 hours, and so on. Because this event happened at night, I missed the fine-grained samples, so all the graphs shown here have one-minute temporal resolution.

The event is visible from the "#Ceph" dashboard on our Graphite installation. I extracted a few graphs from it and focused them on the event in question.

Here is a graph that shows block I/O (actually just "I") operations per second of all OSD processes summed up for each OSD server:

These patterns (a few OSD servers reading heavily) indicate "scrubbing" activity (scrubbing checks existing data for correctness).

We can look at the block I/O read rates on the individual OSD disks on ceph21:

and see that /dev/sdc, /dev/sdd and /dev/sde were quite busy, while the other three OSD disks were mostly idle.

The busy devices correspond to osd.0, osd.1 and osd.2:

    $ ssh ceph21 'mount | egrep "/dev/sd[cde]"'
    /dev/sde1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
    /dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime)
    /dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime)

So we have strong confirmation that osd.1 (which was slow) was being scrubbed (heavily read from). If we look for "scrub" messages in the OSD log around that time, we see that there were in fact three deep-scrubs finishing between 03:38:11 and 03:38:40:

    $ ssh ceph21 'fgrep scrub &lt;(gzip -dc /var/log/ceph/ceph-osd.*.log.1.gz) | sort -n' | egrep ' 03:[345]'
    2015-02-06 03:38:11.058160 7f3376f7f700  0 log [INF] : 5.13a deep-scrub ok
    2015-02-06 03:38:36.608224 7ffe093c4700  0 log [INF] : 5.111 deep-scrub ok
    2015-02-06 03:38:40.711687 7f29ac56c700  0 log [INF] : 5.15a deep-scrub ok

The OSD logs unfortunately don't tell us when these scrubs were started, but from looking at the the graphs, in particular the second one, we can guess with high confidence that they all started between 03:21 and 03:22.

Now we can check which OSDs these PGs map to, and we find that, indeed, they respectively include OSDs 0, 1, and 2:

    $ ceph pg dump all | egrep '\&lt;(5\.1(3a|11|5a))\&gt;' | awk '{ print $1, $14 }'
    dumped all in format plain
    5.15a [0,56,34]
    5.13a [1,64,70]
    5.111 [2,66,86]

Conclusions

Deep-scrubbing has an impact on Ceph performance
It can happen that, on a given server, multiple OSDs are busy performing deep scrubs at the same time
When three deep-scrubs happen in parallel on the same server, the impact can be very visible and lead to >30s queues. This also seems to affect write OPs, not just reads.

I am somewhat surprised by this, as I would have thought that the impact is mostly due to "per-spindle" limitations (random read op/s), but maybe there's another bottleneck. One possibly interesting observation is that the three disks in question are connected to the same SAS host adapter, namely the one at PCI address 05:00.0 (there's a second host adapter on 83:00.0):

$ ssh ceph21 'ls -l /dev/disk/by-path/*-lun-0'
$ ssh zhdk0021.zhdk.cloud.switch.ch 'cd /dev/disk/by-path && /bin/ls -l *-lun-0 | cut -c 38-'
 pci-0000:05:00.0-sas-0x4433221100000000-lun-0 -> ../../sdc
 pci-0000:05:00.0-sas-0x4433221101000000-lun-0 -> ../../sde
 pci-0000:05:00.0-sas-0x4433221102000000-lun-0 -> ../../sdd
 pci-0000:05:00.0-sas-0x4433221103000000-lun-0 -> ../../sdf
 pci-0000:83:00.0-sas-0x4433221100000000-lun-0 -> ../../sdg
 pci-0000:83:00.0-sas-0x4433221101000000-lun-0 -> ../../sdh
 pci-0000:83:00.0-sas-0x4433221102000000-lun-0 -> ../../sdi
 pci-0000:83:00.0-sas-0x4433221103000000-lun-0 -> ../../sdj
 pci-0000:83:00.0-sas-0x4433221104000000-lun-0 -> ../../sdk
 pci-0000:83:00.0-sas-0x4433221105000000-lun-0 -> ../../sdl
 pci-0000:83:00.0-sas-0x4433221106000000-lun-0 -> ../../sdm
 pci-0000:83:00.0-sas-0x4433221107000000-lun-0 -> ../../sdn

Maybe that SAS adapter was the bottleneck in this case.

Possible avenues for improvement

Better foreground/background I/O isolation

Ceph could do a (much) better job isolating actual user I/O from "background" I/O caused by tasks such as scrubbing or rebalancing. See Loïc Dachary's post on Lowering Ceph scrub I/O priority for something that can be configured on recent-enough versions of Ceph. (Thanks for the pointer, Harry!)

Better scrub scheduling

Ceph could do a (much) better job spreading out deep-scrubs over time. The effect described here is not an isolated occurrence - earlier I had observed periods of massive deep-scrubbing, with multi-day periods of no deep-scrubbing at all between them. For example, this is the block read-rate graph across our other Ceph cluster over the past 60 hours:

You see that all the deep-scrubbing is done across one 20-hour period. Crazy! This should be evenly spread out over a week (the global deep-scrub interval).

OSD/controller mapping on our hosts

We could do a better job distributing (consecutive) OSDs across controllers. While we're at it, we should also make sure that journals are distributed nicely across all SSDs, and that we never get confused by changing kernel device names for our SAS disks. And I want a pony.

Any other ideas?

Simon's Musings

Friday, February 13, 2015

BloFELD

Friday, February 06, 2015

Ceph Deep-Scrubbing Impact Study