Sunday, April 17, 2016
Google Code Jam 2016 in Lisp: Round 1A
Sunday, April 10, 2016
Google Code Jam 2016 Qualification Round in Lisp
The puzzles were lots of fun, as usual. A (Counting Sheep) and B (Revenge of the Pancakes) were easy. When I looked at C (Coin Jam), I found it tricky, so I put that aside and did D (Fractiles) next. Explaining the problem to my 19-year-old son helped me find a good approach. After implementing it and submitting the solutions, I turned back to puzzle C. My solution looked complex, and I suspected that there would be a more elegant way. Still I was happy with the solution as it produced a result for the large set very quickly, so I submitted the results and was done. My solutions can be found here.
So how did I fare?
(if (< (* s c) k)
"IMPOSSIBLE"
...but I used:
(let ((min-s (truncate k c)))
(if (< s min-s)
"IMPOSSIBLE"
My test was too lax, and I sometimes output a "solution" when I should have printed IMPOSSIBLE. This stupid mistake cost me >2000 ranks. It would be nice if I managed to avoid this in the next round. But realistically I'll make even more of these mistakes because the problems will be harder and the time pressure much higher.How did the other Lispers do?
The highest ranking Lisper was Ipetru at #594 with all solutions in Lisp (and of course all correct). I looked at his solutions for C and D, and they are so incredibly compact that I couldn't believe my eyes. D used the same approach as I had, just very elegantly written—the code proper is about two lines; much harder to hide stupid mistakes in there! C used a completely different approach, deterministically generating Jamcoins rather than checking candidates as I had.
The second-ranking Lisper was DarkKnight. at #720. He only wrote one solution in Lisp. In fact he used different languages for all solutions, and I mean all 8 solutions, not all 4 puzzles! bc, OCaml, Lisp, Racket, Lua, Perl, Octave, R. Impressive! :-)
Friday, February 13, 2015
BloFELD
Panopticlick illustrates that even without cookies etc., modern browsers usually send enough specific-enough information to Web servers so that you can be tracked. The various useful standard HTTP headers effectively work like a fingerprint of your browser/OS/hardware combination.
When I look at the Panopticlick results of my favorite browser (I get 22.26 bits of "identifying information", while more senior users have reported scores up to full 24 :-), one thing that stands out is a long list of "System Fonts". Arguably it is useful for me when Web sites know what fonts I have installed on my system, so that they can present Web pages with fonts that I actually have rather than having to send me the fonts as well. So the intention is good, but the implementation discloses too much of my typographic tastes. What can we do to fix this?
Well, that should be quite obvious: Instead of my browser sending the full list of fonts, it could send a Bloom filter that matches the fonts that I have. When a Web server wants to render a document for me, it can check for some candidate fonts whether I have them. Bloom filters are approximative and will sometimes generate false positives, but one Comic Sans Web page in 1'000 or so should be a small price to pay to get my virginitprivacy back.
You may respond that a priori the Bloom filter discloses as much of my privacy as the full list of fonts. But! I can simply send a new Bloom filter ("salted" differently) to each site I visit. Voilà how I'll defeat all traceability of Web visits, undermine the business model of the Internet economy, and destroy the entire Western civilization. Muahaha!
-- BloFELD
(Bloom Filter-Enabled Limited Disclosure)
(Apologies to smb and Bill Cheswick, who seem to have fully baked and published a better version of this idea in 2007, and to the probably numerous others)
Friday, February 06, 2015
Ceph Deep-Scrubbing Impact Study
Context
I help operate two geographically separate OpenStack/Ceph clusters consisting of 32 servers each, of which 16 (per cluster) are dedicated OSD servers. Each OSD server currently has six OSDs. Each OSD runs on a dedicated 4TB SAS disk. Each server also has SSDs, which are mostly used for OSD write journals.
We monitor our systems using the venerable Nagios. My colleague Alessandro has written many specific checks for infrastructure services such as Ceph. Some of them periodically check the logs for possibly non-harmless messages. It can be interesting to try to understand these messages and get down to their root cause. Here's one from early this morning (edited for readability):
monitor writes:
> Service: L_check_logfiles
> State: WARNING
> Output: WARN - WARNING - (30 warnings in check_logfiles.protocol-2015-02-06-03-35-35) - File=/var/log/ceph/ceph.log Message=2015-02-06 03:35:26.877633 osd.1 [2001:db8:625:ca1e:100::1021]:6800/219476 2257 : [WRN] slow request 30.185039 seconds old, received at 2015-02-06 03:34:56.692108: osd_op(client.1932427.0:27645811 rbd_data.1fd765491f48ea.00000000000000a9 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 52720648192] 5.4220493a ack+ondisk+write e9526) v4 currently no flag points reached ...
The "Output" line tells us that a write operation to osd.1 was stuck in a queue for 30+ seconds around 03:35. Why did that happen in the middle of the night, when utilization is low?
Graphite to the Rescue
Lately we have set up a CollectD/Graphite monitoring infrastructure for each site. It collects data in ten-second intervals. The ten-second samples are only retained for three hours, then aggregated to one-minute samples that are retained for 24 hours, and so on. Because this event happened at night, I missed the fine-grained samples, so all the graphs shown here have one-minute temporal resolution.
The event is visible from the "#Ceph" dashboard on our Graphite installation. I extracted a few graphs from it and focused them on the event in question.
Here is a graph that shows block I/O (actually just "I") operations per second of all OSD processes summed up for each OSD server:
These patterns (a few OSD servers reading heavily) indicate "scrubbing" activity (scrubbing checks existing data for correctness).
We can look at the block I/O read rates on the individual OSD disks on ceph21:
and see that /dev/sdc, /dev/sdd and /dev/sde were quite busy, while the other three OSD disks were mostly idle.
The busy devices correspond to osd.0, osd.1 and osd.2:
$ ssh ceph21 'mount | egrep "/dev/sd[cde]"'
/dev/sde1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
/dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime)
/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime)
So we have strong confirmation that osd.1 (which was slow) was being scrubbed (heavily read from). If we look for "scrub" messages in the OSD log around that time, we see that there were in fact three deep-scrubs finishing between 03:38:11 and 03:38:40:
$ ssh ceph21 'fgrep scrub <(gzip -dc /var/log/ceph/ceph-osd.*.log.1.gz) | sort -n' | egrep ' 03:[345]'
2015-02-06 03:38:11.058160 7f3376f7f700 0 log [INF] : 5.13a deep-scrub ok
2015-02-06 03:38:36.608224 7ffe093c4700 0 log [INF] : 5.111 deep-scrub ok
2015-02-06 03:38:40.711687 7f29ac56c700 0 log [INF] : 5.15a deep-scrub ok
The OSD logs unfortunately don't tell us when these scrubs were started, but from looking at the the graphs, in particular the second one, we can guess with high confidence that they all started between 03:21 and 03:22.
Now we can check which OSDs these PGs map to, and we find that, indeed, they respectively include OSDs 0, 1, and 2:
$ ceph pg dump all | egrep '\<(5\.1(3a|11|5a))\>' | awk '{ print $1, $14 }'
dumped all in format plain
5.15a [0,56,34]
5.13a [1,64,70]
5.111 [2,66,86]
Conclusions
- Deep-scrubbing has an impact on Ceph performance
- It can happen that, on a given server, multiple OSDs are busy performing deep scrubs at the same time
- When three deep-scrubs happen in parallel on the same server, the impact can be very visible and lead to >30s queues. This also seems to affect write OPs, not just reads.
I am somewhat surprised by this, as I would have thought that the impact is mostly due to "per-spindle" limitations (random read op/s), but maybe there's another bottleneck. One possibly interesting observation is that the three disks in question are connected to the same SAS host adapter, namely the one at PCI address 05:00.0 (there's a second host adapter on 83:00.0):
$ ssh ceph21 'ls -l /dev/disk/by-path/*-lun-0' $ ssh zhdk0021.zhdk.cloud.switch.ch 'cd /dev/disk/by-path && /bin/ls -l *-lun-0 | cut -c 38-' pci-0000:05:00.0-sas-0x4433221100000000-lun-0 -> ../../sdc pci-0000:05:00.0-sas-0x4433221101000000-lun-0 -> ../../sde pci-0000:05:00.0-sas-0x4433221102000000-lun-0 -> ../../sdd pci-0000:05:00.0-sas-0x4433221103000000-lun-0 -> ../../sdf pci-0000:83:00.0-sas-0x4433221100000000-lun-0 -> ../../sdg pci-0000:83:00.0-sas-0x4433221101000000-lun-0 -> ../../sdh pci-0000:83:00.0-sas-0x4433221102000000-lun-0 -> ../../sdi pci-0000:83:00.0-sas-0x4433221103000000-lun-0 -> ../../sdj pci-0000:83:00.0-sas-0x4433221104000000-lun-0 -> ../../sdk pci-0000:83:00.0-sas-0x4433221105000000-lun-0 -> ../../sdl pci-0000:83:00.0-sas-0x4433221106000000-lun-0 -> ../../sdm pci-0000:83:00.0-sas-0x4433221107000000-lun-0 -> ../../sdn
Maybe that SAS adapter was the bottleneck in this case.
Possible avenues for improvement
Better foreground/background I/O isolation
Ceph could do a (much) better job isolating actual user I/O from "background" I/O caused by tasks such as scrubbing or rebalancing. See Loïc Dachary's post on Lowering Ceph scrub I/O priority for something that can be configured on recent-enough versions of Ceph. (Thanks for the pointer, Harry!)
Better scrub scheduling
Ceph could do a (much) better job spreading out deep-scrubs over time. The effect described here is not an isolated occurrence - earlier I had observed periods of massive deep-scrubbing, with multi-day periods of no deep-scrubbing at all between them. For example, this is the block read-rate graph across our other Ceph cluster over the past 60 hours:
You see that all the deep-scrubbing is done across one 20-hour period. Crazy! This should be evenly spread out over a week (the global deep-scrub interval).
OSD/controller mapping on our hosts
We could do a better job distributing (consecutive) OSDs across controllers. While we're at it, we should also make sure that journals are distributed nicely across all SSDs, and that we never get confused by changing kernel device names for our SAS disks. And I want a pony.
Any other ideas?
Wednesday, June 25, 2014
Riding the SDN Hype Wave
As a networking veteran who has fought in the OSI and ATM wars, I like making fun of these fads like the next person—buzzword bingo anyone? But actually, I consider the SDN hype a really good thing. Why? Because, to quote Cisco CTO Padmasree Warrior, "networking is cool again", and that's not just good for her company but a breath of fresh air for the industry as a whole.
What I like in particular is that SDN (because nobody knows exactly what it means and where its limits are...) legitimates ideas that would have quickly been rejected before ("it has been known for years that this doesn't scale/that this-and-that needs to be done in hardware/..."). Of course this also means that many tired ideas will get another chance by being rebranded as SDN, but personally I think that this does less damage.


