Simon's Musings

Sunday, April 17, 2016

Google Code Jam 2016 in Lisp: Round 1A

Having survived (like 96% of all Lispers and 81.5% of the wider population) the Qualification Round, I'm trying to pass the first "real" round. As expected, things are getting much more serious now! Round 1 is held in three sub-rounds, and you can participate in each, or until you advance to round 2. In each sub-round, the 1'000 highest-ranking contestants advance to round 2.

Where the qualification round had consisted of four problems (subjectively, two easy and two somewhat tricky) with 27 hours time to solve them, round 1 is three problems (an easier one and two quite tricky ones) that have to be solved within 2h30. From previous participations I knew that I would likely run into the time limit.

Getting off to a good start

Although the first sub-round, Round 1A, started at 3AM in my time zone, I decided to give it a try. I went to bed somewhat early and asked one of my sons, who was going out, to wake me up a bit before three, which he dutifully did. (Thank you!)

The first puzzle (The Last Word) went quite well. Somehow I convinced myself quickly that it is sufficient to put the next letter in front if it is greater than or equal to the first letter of the current substring, and at the end if it is smaller. My straightforward Lisp implementation can be found here.

The second task (Rank and File) was cute. After a bit of scribbling and thinking, I suddenly realized that it's sufficient to look for numbers that occur an odd number of times, sort those numbers and be done with it. I coded this up fairly quickly as well (Lisp source here).

So I was done with the first two problems after 44'22" with no penalties. I was also reasonably certain that I got both the large input sets right as well. (In the Qualification Round I had wrongly believed so as well, but in the critical case of problem C, the difference between small and large input was qualitative rather than quantitative.)

Problem C: close, but no cigar

With 1h45 left, I attacked the last problem, BFFs. This is a great problem—very simple in its formulation, yet requiring good (but not impossible) analysis, and also a bit of programming skills. It would be a nice recruitment interview problem.

This is pretty clearly a graph problem. The input set is a rather special set of directed graph with out-degree 1 (the BFF function). This must somehow be mapped to another kind of graph that represents a circle with some constraints. But this mapping is not that straightforward. Studying the sample inputs on the problem page, I quickly found that there are two different types of circle: One is a complete cyclic sub-graph of the input—that's easy to find given the simple type of input graph. The other possibility is to have a "core" circular graph of two nodes (a "couple" of mutual BFFs), with "extensions" of other nodes (kids) that (BFF-) point towards the nodes of the core. I wrote the code for both cases and tested it successfully against the tiny example on the problem page. My code looked for the maximum cycle, and the maximum graph of the second type, and would return the size of the bigger one.

To my great disappointment, the system told me that my solution for the small input set was incorrect. I couldn't figure out what was wrong, and time ran out on me. It was 5h30, and I had finished at rank #2137, with correct solutions for the first two problems and none for the third.

I went to bed slightly frustrated, and of course couldn't sleep right away but continued to think about the problem... what if the entire class consists of happy couples, i.e. mutual BFFs? Then I could put everybody in the circle. Hmmm... that's a case I had somehow missed! In fact the second type of graph (couple with "inbound" extensions) can occur arbitrarily often in the circle. Having understood my error, I could finally sleep.

In the next morning I had to write the code, although the contest was over. Although I had coded (correctly, I think) a good part of the necessary logic the night before, it took me several hours to get the algorithm right for the actual solution. I was too ambitious and wanted to write a really efficient solution ("early optimization..."), although a stupid O(N²) approach would probably have worked even for the maximum problem size of a 1'000 kids class. Probably I was also tired. Also, I saw that I don't know how to do proper debugging in SBCL, and had written the code in a way that made debugging somewhat hard. Finally I got the code right, see C.lisp here. Although the code is not particularly compact, I'm happy with it because at least I can read and understand it (for how long is another question), and it is O(N) with a pretty low constant factor for what that is worth—it runs in 20 milliseconds on the large practice set including I/O.

What about the other Lispers?

The per-language dashboard for Lisp on go-hero.net shows that none of us advanced in sub-round 1A. A very honorable mention goes to vii, who ended up on rank #1019 with correct solutions for all problems and input sets, but with a few penalties landing him or her 2'33" above the qualifying threshold. Good job & better luck next time! I'm sure he or she has what it takes.

Will I continue to use Lisp (and SBCL)?

While coding the last problem (BFFs), I sometimes thought that the program would have been easier to write in another language, even in C. I couldn't find an elegant recursive approach, and I always found writing iterative code in Lisp a bit awkward; I never really mastered loop or any of the other "modern" iteration constructs—why learn them when do will do? But the primary reason why I would have preferred another language was data structures. I built my graph from "node" objects (expressed as defstructs) with pointers, which was natural enough and worked well. But in another language I would have used a convenient vector representation... in Lisp the syntax for dereferencing vectors is so verbose and awkward that I was worried getting things wrong. And I felt that in the short time, I couldn't write a nice data type abstraction to hide the intricacies of such a vector representation.

What about SBCL? As mentioned before, I normally use Franz Inc's excellent Allegro Common Lisp, but I wanted to use a "real" open source system for the contest. I'm slowly getting used to SLIME after more than 20 years of Franz's Emacs-Lisp interface. But when I had to debug my final BFFs solution, I felt almost blind because I wasn't able to look at local variables from within a backtrace, something that works quite nicely with Allegro. So I had to resort to the equivalent of printf()s (warn), which made me feel a bit dirty. Probably I should learn how others debug under SBCL, I'm sure there's a better way! [Edited to add: (declaim (optimize (debug 3))) seems to go a long way!]

In the weeks before the contest, I had toyed with the idea to use Go this year, but I didn't have the time to learn the language well enough to make this viable. And overall, Lisp has treated me quite well this year. Also I've been using it for about 30 years now. So I don't feel like quitting mid-contest.

There are still two opportunities for me to try to advance to round 2. Timezone-wise they are much more convenient for me. On the other hand, I need some luck to find a good solution for the medium-tough problem as quickly as in 1A, and a bit more to find a good solution for the tough problem at all. But I'm an eternal optimist...

Sunday, April 10, 2016

Google Code Jam 2016 Qualification Round in Lisp

This year I registered for Google Code Jam again. The qualification round was this weekend, and I submitted solutions in Lisp. This time I used SBCL with GNU Emacs and SLIME as the IDE, rather than Allegro Common Lisp with GNU Emacs and Franz Inc's emacs-lisp interface.
The puzzles were lots of fun, as usual. A (Counting Sheep) and B (Revenge of the Pancakes) were easy. When I looked at C (Coin Jam), I found it tricky, so I put that aside and did D (Fractiles) next. Explaining the problem to my 19-year-old son helped me find a good approach. After implementing it and submitting the solutions, I turned back to puzzle C. My solution looked complex, and I suspected that there would be a more elegant way. Still I was happy with the solution as it produced a result for the large set very quickly, so I submitted the results and was done. My solutions can be found here.

So how did I fare?

The round went on for 27 hours, and I started somewhere in the middle due to timezone. So after I was done with my solutions, I had to wait until the next day to see the final score. The "small" problem datasets are checked immediately after submission, so I already knew I had all of those right before I went to bed. Since they already accounted for 37 points, and I needed 40 to advance to the next round, I could be pretty optimistic that I had passed—having any large dataset correct would get me more than the three missing points.

The intermediate scoreboard put me on rank #1189; this is assuming (for everybody) that all the "large" solutions that were submitted would be correct. I was pretty sure that mine were all correct, so I had hopes that my final rank would be even higher.

But my final rank was #2748, because it turned out that my "large" solution for D (Fractiles) was incorrect. This was a big disappointment, because I was very confident in my solution approach. And it turned out that my error was a stupid one in one of the "easy" parts of the implementation; the decision whether the number of grad students (S) is sufficient to clean enough tiles in the base-length-K and complexity C artwork to detect whether there's any gold tile. The correct test would be

  (if (< (* s c) k)
      "IMPOSSIBLE"

...but I used:

  (let ((min-s (truncate k c)))
    (if (< s min-s)
        "IMPOSSIBLE"

My test was too lax, and I sometimes output a "solution" when I should have printed IMPOSSIBLE. This stupid mistake cost me >2000 ranks. It would be nice if I managed to avoid this in the next round. But realistically I'll make even more of these mistakes because the problems will be harder and the time pressure much higher.

How did the other Lispers do?

On www.go-hero.net there are statistics about GCJ participants and submissions, and it's easy to find all submissions for this round that used Lisp. We Lispers were quite successful this weekend, with 26 out of 27 proceeding to the next round, a success rate of 96% compared to 81.5% for the entire population.
The highest ranking Lisper was Ipetru at #594 with all solutions in Lisp (and of course all correct). I looked at his solutions for C and D, and they are so incredibly compact that I couldn't believe my eyes. D used the same approach as I had, just very elegantly written—the code proper is about two lines; much harder to hide stupid mistakes in there! C used a completely different approach, deterministically generating Jamcoins rather than checking candidates as I had.
The second-ranking Lisper was DarkKnight. at #720. He only wrote one solution in Lisp. In fact he used different languages for all solutions, and I mean all 8 solutions, not all 4 puzzles! bc, OCaml, Lisp, Racket, Lua, Perl, Octave, R. Impressive! :-)

Friday, February 13, 2015

BloFELD

Panopticlick illustrates that even without cookies etc., modern browsers usually send enough specific-enough information to Web servers so that you can be tracked. The various useful standard HTTP headers effectively work like a fingerprint of your browser/OS/hardware combination.

When I look at the Panopticlick results of my favorite browser (I get 22.26 bits of "identifying information", while more senior users have reported scores up to full 24 :-), one thing that stands out is a long list of "System Fonts". Arguably it is useful for me when Web sites know what fonts I have installed on my system, so that they can present Web pages with fonts that I actually have rather than having to send me the fonts as well. So the intention is good, but the implementation discloses too much of my typographic tastes. What can we do to fix this?

Well, that should be quite obvious: Instead of my browser sending the full list of fonts, it could send a Bloom filter that matches the fonts that I have. When a Web server wants to render a document for me, it can check for some candidate fonts whether I have them. Bloom filters are approximative and will sometimes generate false positives, but one Comic Sans Web page in 1'000 or so should be a small price to pay to get my ~~virginit~~privacy back.

You may respond that a priori the Bloom filter discloses as much of my privacy as the full list of fonts. But! I can simply send a new Bloom filter ("salted" differently) to each site I visit. Voilà how I'll defeat all traceability of Web visits, undermine the business model of the Internet economy, and destroy the entire Western civilization. Muahaha!

-- BloFELD
(Bloom Filter-Enabled Limited Disclosure)

(Apologies to smb and Bill Cheswick, who seem to have fully baked and published a better version of this idea in 2007, and to the probably numerous others)

Friday, February 06, 2015

Ceph Deep-Scrubbing Impact Study

Context

I help operate two geographically separate OpenStack/Ceph clusters consisting of 32 servers each, of which 16 (per cluster) are dedicated OSD servers. Each OSD server currently has six OSDs. Each OSD runs on a dedicated 4TB SAS disk. Each server also has SSDs, which are mostly used for OSD write journals.

We monitor our systems using the venerable Nagios. My colleague Alessandro has written many specific checks for infrastructure services such as Ceph. Some of them periodically check the logs for possibly non-harmless messages. It can be interesting to try to understand these messages and get down to their root cause. Here's one from early this morning (edited for readability):

monitor writes:
> Service: L_check_logfiles
> State: WARNING
> Output: WARN - WARNING - (30 warnings in check_logfiles.protocol-2015-02-06-03-35-35) - File=/var/log/ceph/ceph.log Message=2015-02-06 03:35:26.877633 osd.1 [2001:db8:625:ca1e:100::1021]:6800/219476 2257 : [WRN] slow request 30.185039 seconds old, received at 2015-02-06 03:34:56.692108: osd_op(client.1932427.0:27645811 rbd_data.1fd765491f48ea.00000000000000a9 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 52720648192] 5.4220493a ack+ondisk+write e9526) v4 currently no flag points reached ...

The "Output" line tells us that a write operation to osd.1 was stuck in a queue for 30+ seconds around 03:35. Why did that happen in the middle of the night, when utilization is low?

Graphite to the Rescue

Lately we have set up a CollectD/Graphite monitoring infrastructure for each site. It collects data in ten-second intervals. The ten-second samples are only retained for three hours, then aggregated to one-minute samples that are retained for 24 hours, and so on. Because this event happened at night, I missed the fine-grained samples, so all the graphs shown here have one-minute temporal resolution.

The event is visible from the "#Ceph" dashboard on our Graphite installation. I extracted a few graphs from it and focused them on the event in question.

Here is a graph that shows block I/O (actually just "I") operations per second of all OSD processes summed up for each OSD server:

These patterns (a few OSD servers reading heavily) indicate "scrubbing" activity (scrubbing checks existing data for correctness).

We can look at the block I/O read rates on the individual OSD disks on ceph21:

and see that /dev/sdc, /dev/sdd and /dev/sde were quite busy, while the other three OSD disks were mostly idle.

The busy devices correspond to osd.0, osd.1 and osd.2:

    $ ssh ceph21 'mount | egrep "/dev/sd[cde]"'
    /dev/sde1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
    /dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime)
    /dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime)

So we have strong confirmation that osd.1 (which was slow) was being scrubbed (heavily read from). If we look for "scrub" messages in the OSD log around that time, we see that there were in fact three deep-scrubs finishing between 03:38:11 and 03:38:40:

    $ ssh ceph21 'fgrep scrub &lt;(gzip -dc /var/log/ceph/ceph-osd.*.log.1.gz) | sort -n' | egrep ' 03:[345]'
    2015-02-06 03:38:11.058160 7f3376f7f700  0 log [INF] : 5.13a deep-scrub ok
    2015-02-06 03:38:36.608224 7ffe093c4700  0 log [INF] : 5.111 deep-scrub ok
    2015-02-06 03:38:40.711687 7f29ac56c700  0 log [INF] : 5.15a deep-scrub ok

The OSD logs unfortunately don't tell us when these scrubs were started, but from looking at the the graphs, in particular the second one, we can guess with high confidence that they all started between 03:21 and 03:22.

Now we can check which OSDs these PGs map to, and we find that, indeed, they respectively include OSDs 0, 1, and 2:

    $ ceph pg dump all | egrep '\&lt;(5\.1(3a|11|5a))\&gt;' | awk '{ print $1, $14 }'
    dumped all in format plain
    5.15a [0,56,34]
    5.13a [1,64,70]
    5.111 [2,66,86]

Conclusions

Deep-scrubbing has an impact on Ceph performance
It can happen that, on a given server, multiple OSDs are busy performing deep scrubs at the same time
When three deep-scrubs happen in parallel on the same server, the impact can be very visible and lead to >30s queues. This also seems to affect write OPs, not just reads.

I am somewhat surprised by this, as I would have thought that the impact is mostly due to "per-spindle" limitations (random read op/s), but maybe there's another bottleneck. One possibly interesting observation is that the three disks in question are connected to the same SAS host adapter, namely the one at PCI address 05:00.0 (there's a second host adapter on 83:00.0):

$ ssh ceph21 'ls -l /dev/disk/by-path/*-lun-0'
$ ssh zhdk0021.zhdk.cloud.switch.ch 'cd /dev/disk/by-path && /bin/ls -l *-lun-0 | cut -c 38-'
 pci-0000:05:00.0-sas-0x4433221100000000-lun-0 -> ../../sdc
 pci-0000:05:00.0-sas-0x4433221101000000-lun-0 -> ../../sde
 pci-0000:05:00.0-sas-0x4433221102000000-lun-0 -> ../../sdd
 pci-0000:05:00.0-sas-0x4433221103000000-lun-0 -> ../../sdf
 pci-0000:83:00.0-sas-0x4433221100000000-lun-0 -> ../../sdg
 pci-0000:83:00.0-sas-0x4433221101000000-lun-0 -> ../../sdh
 pci-0000:83:00.0-sas-0x4433221102000000-lun-0 -> ../../sdi
 pci-0000:83:00.0-sas-0x4433221103000000-lun-0 -> ../../sdj
 pci-0000:83:00.0-sas-0x4433221104000000-lun-0 -> ../../sdk
 pci-0000:83:00.0-sas-0x4433221105000000-lun-0 -> ../../sdl
 pci-0000:83:00.0-sas-0x4433221106000000-lun-0 -> ../../sdm
 pci-0000:83:00.0-sas-0x4433221107000000-lun-0 -> ../../sdn

Maybe that SAS adapter was the bottleneck in this case.

Possible avenues for improvement

Better foreground/background I/O isolation

Ceph could do a (much) better job isolating actual user I/O from "background" I/O caused by tasks such as scrubbing or rebalancing. See Loïc Dachary's post on Lowering Ceph scrub I/O priority for something that can be configured on recent-enough versions of Ceph. (Thanks for the pointer, Harry!)

Better scrub scheduling

Ceph could do a (much) better job spreading out deep-scrubs over time. The effect described here is not an isolated occurrence - earlier I had observed periods of massive deep-scrubbing, with multi-day periods of no deep-scrubbing at all between them. For example, this is the block read-rate graph across our other Ceph cluster over the past 60 hours:

You see that all the deep-scrubbing is done across one 20-hour period. Crazy! This should be evenly spread out over a week (the global deep-scrub interval).

OSD/controller mapping on our hosts

We could do a better job distributing (consecutive) OSDs across controllers. While we're at it, we should also make sure that journals are distributed nicely across all SSDs, and that we never get confused by changing kernel device names for our SAS disks. And I want a pony.

Any other ideas?

Wednesday, June 25, 2014

Riding the SDN Hype Wave

In case you haven't noticed, Software-Defined Networking has become the guiding meme for most innovation in networks over the past few years. It's a great meme because it sounds cool and slightly mysterious. The notion was certainly inspired by Software-Defined Radio, which had a relatively well-defined meaning, and has since spread to other areas such as Software-Defined Storage and, coming soon, the Software-Defined Economy.
As a networking veteran who has fought in the OSI and ATM wars, I like making fun of these fads like the next person—buzzword bingo anyone? But actually, I consider the SDN hype a really good thing. Why? Because, to quote Cisco CTO Padmasree Warrior, "networking is cool again", and that's not just good for her company but a breath of fresh air for the industry as a whole.
What I like in particular is that SDN (because nobody knows exactly what it means and where its limits are...) legitimates ideas that would have quickly been rejected before ("it has been known for years that this doesn't scale/that this-and-that needs to be done in hardware/..."). Of course this also means that many tired ideas will get another chance by being rebranded as SDN, but personally I think that this does less damage.

SDN Beyond OpenFlow

The public perception of SDN has been pretty much driven by OpenFlow's vision of separating forwarding plane (the "hardware" function) and control plane, and using external software to drive networks, usually using a "logically centralized" control approach.

The Open Networking Forum attempts to codify this close association of SDN and OpenFlow by publishing their own SDN definition. OpenFlow has huge merits as a concrete proposal that can be (and is) implemented and used in real systems. Therefore it deserves a lot of credit for making people take the "SDN" vision seriously. But I think the SDN meme is too beautiful to be left confined to OpenFlow-based and "logically centralized" approaches. I much prefer JR Rivers's (Cumulus Networks) suggestion for what SDN should be: "How can I write software to do things that used to be super hard and do them super easy?" That's certainly more inclusive!

x86 as a Viable High-Speed Packet Processing Platform

Something that I definitely consider an SDN approach is to revisit generic computing hardware (mostly defined as "x86" these days) and see what you can do in terms of interesting packet processing on such platforms. It turns out that these boxes have come a long way over the past few years! In particular, recent Intel server CPUs (Sandy Bridge/Ivy Bridge) have massively increased memory bandwidth compared to previous generations, CPU cores to spare. On the interface front, most/all of today's 10 Gigabit Ethernet adapters have many helpful performance features such as multiple receive/transmit queues, segmentation offload, hardware virtualization support and so on. So is it now possible to do line-rate 10Gb/s packet processing on this platform?

The dirty secret is that even the leading companies in ASIC-based backbone routers are already using regular multi-core CPUs for high-performance middleboxes such as firewalls (as opposed to previous generations that had to use expensive-to-design and program network processors, FPGAs and/or ASICs).

Intel has its DPDK (Data Plane Development Kit) to support high-performance applications using their network adapters and processors, and there are several existence proofs now that you can do interesting packet processing on multiple 10Gb/s interfaces using one core or less per interface—and you can get many of those cores in fairly inexpensive boxes.

Snabb Switch

One of my favorite projects in this space is Luke Gorrie's Snabb Switch. If CPU-based forwarding approaches are at the fringe of SDN, Snabb Switch is at the fringe of CPU-based forwarding approaches... hm, maybe I just like being different.

Snabb Switch is based on the Lua scripting language and on the excellent LuaJIT implementation. It runs entirely in user space, which means that it can avoid all user/kernel interface issues that make high-performance difficult, but also means that it has to implement its own device drivers in user space! Fortunately Lua is a much friendlier platform for developing those, and one of Luke's not-so-secret missions for Snabb is that "writing device drivers should be fun again".

The Snabb Switch project has gained of traction over the year or so since its inception. A large operator is investigating its use in an ambitious backbone/NFV project; high-performance integration with the QEMU/KVM hypervisor has been upstreamed; and the integration into OpenStack Networking is making good progress, with some hope of significant parts being integrated for the "Juno" release. And my colleague and long-time backbone engineering teammate Alex Gall is developing a standalone L2VPN (Ethernet over IPv6) appliance based on Snabb Switch, with the ultimate goal of removing our only business requirement for MPLS in the backbone. Now that should convince even the curmudgeons who fought in the X.25 wars, no?

The final proof that Snabb Switch's world domination is inevitable is that it was featured in the pilot episode of Ivan Pepelnjaks new Software Gone Wild podcast.

(In fact that is the very reason for this post, because yours truly also had a (small) appearance in that episode, and I had to give up the address of my blog... and now I'm afraid that some of the venerable readers of Ivan's blog will follow the link and find that nothing has been posted here lately, even less so related to networking. Welcome anyway!)

Come on in, the water's fine!

So turn your bullshit detectors way down, embrace the hype, and enjoy the ride! There are plenty of good ideas waiting to be implemented once we free ourselves from the rule of those ASIC-wielding vertically integrated giga-companies...