Friday, February 13, 2015


Panopticlick illustrates that even without cookies etc., modern browsers usually send enough specific-enough information to Web servers so that you can be tracked. The various useful standard HTTP headers effectively work like a fingerprint of your browser/OS/hardware combination.

When I look at the Panopticlick results of my favorite browser (I get 22.26 bits of "identifying information", while more senior users have reported scores up to full 24 :-), one thing that stands out is a long list of "System Fonts". Arguably it is useful for me when Web sites know what fonts I have installed on my system, so that they can present Web pages with fonts that I actually have rather than having to send me the fonts as well. So the intention is good, but the implementation discloses too much of my typographic tastes. What can we do to fix this?

Well, that should be quite obvious: Instead of my browser sending the full list of fonts, it could send a Bloom filter that matches the fonts that I have. When a Web server wants to render a document for me, it can check for some candidate fonts whether I have them. Bloom filters are approximative and will sometimes generate false positives, but one Comic Sans Web page in 1'000 or so should be a small price to pay to get my virginitprivacy back.

You may respond that a priori the Bloom filter discloses as much of my privacy as the full list of fonts. But! I can simply send a new Bloom filter ("salted" differently) to each site I visit. Voilà how I'll defeat all traceability of Web visits, undermine the business model of the Internet economy, and destroy the entire Western civilization. Muahaha!

-- BloFELD
(Bloom Filter-Enabled Limited Disclosure)

(Apologies to smb and Bill Cheswick, who seem to have fully baked and published a better version of this idea in 2007, and to the probably numerous others)

Friday, February 06, 2015

Ceph Deep-Scrubbing Impact Study


I help operate two geographically separate OpenStack/Ceph clusters consisting of 32 servers each, of which 16 (per cluster) are dedicated OSD servers. Each OSD server currently has six OSDs. Each OSD runs on a dedicated 4TB SAS disk. Each server also has SSDs, which are mostly used for OSD write journals.

We monitor our systems using the venerable Nagios. My colleague Alessandro has written many specific checks for infrastructure services such as Ceph. Some of them periodically check the logs for possibly non-harmless messages. It can be interesting to try to understand these messages and get down to their root cause. Here's one from early this morning (edited for readability):

monitor writes:
> Service: L_check_logfiles
> State: WARNING
> Output: WARN - WARNING - (30 warnings in check_logfiles.protocol-2015-02-06-03-35-35) - File=/var/log/ceph/ceph.log Message=2015-02-06 03:35:26.877633 osd.1 [2001:db8:625:ca1e:100::1021]:6800/219476 2257 : [WRN] slow request 30.185039 seconds old, received at 2015-02-06 03:34:56.692108: osd_op(client.1932427.0:27645811 rbd_data.1fd765491f48ea.00000000000000a9 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 52720648192] 5.4220493a ack+ondisk+write e9526) v4 currently no flag points reached ...

The "Output" line tells us that a write operation to osd.1 was stuck in a queue for 30+ seconds around 03:35. Why did that happen in the middle of the night, when utilization is low?

Graphite to the Rescue

Lately we have set up a CollectD/Graphite monitoring infrastructure for each site. It collects data in ten-second intervals. The ten-second samples are only retained for three hours, then aggregated to one-minute samples that are retained for 24 hours, and so on. Because this event happened at night, I missed the fine-grained samples, so all the graphs shown here have one-minute temporal resolution.

The event is visible from the "#Ceph" dashboard on our Graphite installation. I extracted a few graphs from it and focused them on the event in question.

Here is a graph that shows block I/O (actually just "I") operations per second of all OSD processes summed up for each OSD server:

These patterns (a few OSD servers reading heavily) indicate "scrubbing" activity (scrubbing checks existing data for correctness).

We can look at the block I/O read rates on the individual OSD disks on ceph21:

and see that /dev/sdc, /dev/sdd and /dev/sde were quite busy, while the other three OSD disks were mostly idle.

The busy devices correspond to osd.0, osd.1 and osd.2:

    $ ssh ceph21 'mount | egrep "/dev/sd[cde]"'
    /dev/sde1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
    /dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime)
    /dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime)

So we have strong confirmation that osd.1 (which was slow) was being scrubbed (heavily read from). If we look for "scrub" messages in the OSD log around that time, we see that there were in fact three deep-scrubs finishing between 03:38:11 and 03:38:40:

    $ ssh ceph21 'fgrep scrub <(gzip -dc /var/log/ceph/ceph-osd.*.log.1.gz) | sort -n' | egrep ' 03:[345]'
    2015-02-06 03:38:11.058160 7f3376f7f700  0 log [INF] : 5.13a deep-scrub ok
    2015-02-06 03:38:36.608224 7ffe093c4700  0 log [INF] : 5.111 deep-scrub ok
    2015-02-06 03:38:40.711687 7f29ac56c700  0 log [INF] : 5.15a deep-scrub ok

The OSD logs unfortunately don't tell us when these scrubs were started, but from looking at the the graphs, in particular the second one, we can guess with high confidence that they all started between 03:21 and 03:22.

Now we can check which OSDs these PGs map to, and we find that, indeed, they respectively include OSDs 0, 1, and 2:

    $ ceph pg dump all | egrep '\<(5\.1(3a|11|5a))\>' | awk '{ print $1, $14 }'
    dumped all in format plain
    5.15a [0,56,34]
    5.13a [1,64,70]
    5.111 [2,66,86]


  • Deep-scrubbing has an impact on Ceph performance
  • It can happen that, on a given server, multiple OSDs are busy performing deep scrubs at the same time
  • When three deep-scrubs happen in parallel on the same server, the impact can be very visible and lead to >30s queues. This also seems to affect write OPs, not just reads.

I am somewhat surprised by this, as I would have thought that the impact is mostly due to "per-spindle" limitations (random read op/s), but maybe there's another bottleneck. One possibly interesting observation is that the three disks in question are connected to the same SAS host adapter, namely the one at PCI address 05:00.0 (there's a second host adapter on 83:00.0):

$ ssh ceph21 'ls -l /dev/disk/by-path/*-lun-0'
$ ssh 'cd /dev/disk/by-path && /bin/ls -l *-lun-0 | cut -c 38-'
 pci-0000:05:00.0-sas-0x4433221100000000-lun-0 -> ../../sdc
 pci-0000:05:00.0-sas-0x4433221101000000-lun-0 -> ../../sde
 pci-0000:05:00.0-sas-0x4433221102000000-lun-0 -> ../../sdd
 pci-0000:05:00.0-sas-0x4433221103000000-lun-0 -> ../../sdf
 pci-0000:83:00.0-sas-0x4433221100000000-lun-0 -> ../../sdg
 pci-0000:83:00.0-sas-0x4433221101000000-lun-0 -> ../../sdh
 pci-0000:83:00.0-sas-0x4433221102000000-lun-0 -> ../../sdi
 pci-0000:83:00.0-sas-0x4433221103000000-lun-0 -> ../../sdj
 pci-0000:83:00.0-sas-0x4433221104000000-lun-0 -> ../../sdk
 pci-0000:83:00.0-sas-0x4433221105000000-lun-0 -> ../../sdl
 pci-0000:83:00.0-sas-0x4433221106000000-lun-0 -> ../../sdm
 pci-0000:83:00.0-sas-0x4433221107000000-lun-0 -> ../../sdn

Maybe that SAS adapter was the bottleneck in this case.

Possible avenues for improvement

Better foreground/background I/O isolation

Ceph could do a (much) better job isolating actual user I/O from "background" I/O caused by tasks such as scrubbing or rebalancing. See Loïc Dachary's post on Lowering Ceph scrub I/O priority for something that can be configured on recent-enough versions of Ceph. (Thanks for the pointer, Harry!)

Better scrub scheduling

Ceph could do a (much) better job spreading out deep-scrubs over time. The effect described here is not an isolated occurrence - earlier I had observed periods of massive deep-scrubbing, with multi-day periods of no deep-scrubbing at all between them. For example, this is the block read-rate graph across our other Ceph cluster over the past 60 hours:

You see that all the deep-scrubbing is done across one 20-hour period. Crazy! This should be evenly spread out over a week (the global deep-scrub interval).

OSD/controller mapping on our hosts

We could do a better job distributing (consecutive) OSDs across controllers. While we're at it, we should also make sure that journals are distributed nicely across all SSDs, and that we never get confused by changing kernel device names for our SAS disks. And I want a pony.

Any other ideas?

Wednesday, June 25, 2014

Riding the SDN Hype Wave

In case you haven't noticed, Software-Defined Networking has become the guiding meme for most innovation in networks over the past few years.  It's a great meme because it sounds cool and slightly mysterious.  The notion was certainly inspired by Software-Defined Radio, which had a relatively well-defined meaning, and has since spread to other areas such as Software-Defined Storage and, coming soon, the Software-Defined Economy.
As a networking veteran who has fought in the OSI and ATM wars, I like making fun of these fads like the next person—buzzword bingo anyone? But actually, I consider the SDN hype a really good thing.  Why? Because, to quote Cisco CTO Padmasree Warrior, "networking is cool again", and that's not just good for her company but a breath of fresh air for the industry as a whole.
What I like in particular is that SDN (because nobody knows exactly what it means and where its limits are...) legitimates ideas that would have quickly been rejected before ("it has been known for years that this doesn't scale/that this-and-that needs to be done in hardware/...").  Of course this also means that many tired ideas will get another chance by being rebranded as SDN, but personally I think that this does less damage.

SDN Beyond OpenFlow

The public perception of SDN has been pretty much driven by OpenFlow's vision of separating forwarding plane (the "hardware" function) and control plane, and using external software to drive networks, usually using a "logically centralized" control approach.
The Open Networking Forum attempts to codify this close association of SDN and OpenFlow by publishing their own SDN definition.  OpenFlow has huge merits as a concrete proposal that can be (and is) implemented and used in real systems.  Therefore it deserves a lot of credit for making people take the "SDN" vision seriously.  But I think the SDN meme is too beautiful to be left confined to OpenFlow-based and "logically centralized" approaches.  I much prefer JR Rivers's (Cumulus Networks) suggestion for what SDN should be: "How can I write software to do things that used to be super hard and do them super easy?" That's certainly more inclusive!

x86 as a Viable High-Speed Packet Processing Platform

Something that I definitely consider an SDN approach is to revisit generic computing hardware (mostly defined as "x86" these days) and see what you can do in terms of interesting packet processing on such platforms.  It turns out that these boxes have come a long way over the past few years! In particular, recent Intel server CPUs (Sandy Bridge/Ivy Bridge) have massively increased memory bandwidth compared to previous generations, CPU cores to spare.  On the interface front, most/all of today's 10 Gigabit Ethernet adapters have many helpful performance features such as multiple receive/transmit queues, segmentation offload, hardware virtualization support and so on.  So is it now possible to do line-rate 10Gb/s packet processing on this platform?
The dirty secret is that even the leading companies in ASIC-based backbone routers are already using regular multi-core CPUs for high-performance middleboxes such as firewalls (as opposed to previous generations that had to use expensive-to-design and program network processors, FPGAs and/or ASICs).
Intel has its DPDK (Data Plane Development Kit) to support high-performance applications using their network adapters and processors, and there are several existence proofs now that you can do interesting packet processing on multiple 10Gb/s interfaces using one core or less per interface—and you can get many of those cores in fairly inexpensive boxes.

Snabb Switch

One of my favorite projects in this space is Luke Gorrie's Snabb Switch.  If CPU-based forwarding approaches are at the fringe of SDN, Snabb Switch is at the fringe of CPU-based forwarding approaches... hm, maybe I just like being different.
Snabb Switch is based on the Lua scripting language and on the excellent LuaJIT implementation.  It runs entirely in user space, which means that it can avoid all user/kernel interface issues that make high-performance difficult, but also means that it has to implement its own device drivers in user space! Fortunately Lua is a much friendlier platform for developing those, and one of Luke's not-so-secret missions for Snabb is that "writing device drivers should be fun again".
The Snabb Switch project has gained of traction over the year or so since its inception.  A large operator is investigating its use in an ambitious backbone/NFV project; high-performance integration with the QEMU/KVM hypervisor has been upstreamed; and the integration into OpenStack Networking is making good progress, with some hope of significant parts being integrated for the "Juno" release.  And my colleague and long-time backbone engineering teammate Alex Gall is developing a standalone L2VPN (Ethernet over IPv6) appliance based on Snabb Switch, with the ultimate goal of removing our only business requirement for MPLS in the backbone.  Now that should convince even the curmudgeons who fought in the X.25 wars, no?
The final proof that Snabb Switch's world domination is inevitable is that it was featured in the pilot episode of Ivan Pepelnjaks new Software Gone Wild podcast.
(In fact that is the very reason for this post, because yours truly also had a (small) appearance in that episode, and I had to give up the address of my blog... and now I'm afraid that some of the venerable readers of Ivan's blog will follow the link and find that nothing has been posted here lately, even less so related to networking.  Welcome anyway!)

Come on in, the water's fine!

So turn your bullshit detectors way down, embrace the hype, and enjoy the ride! There are plenty of good ideas waiting to be implemented once we free ourselves from the rule of those ASIC-wielding vertically integrated giga-companies...

Sunday, May 13, 2012

Hello World with ØMQ (ZeroMQ), Part I

Why I am interested in ØMQ

Several months ago, I stumbled across an interview with Pieter Hintjens about ØMQ (ZeroMQ) in episode 195 of FLOSS Weekly by Randal L. Schwartz. From what I got from the interview, ØMQ is a pretty powerful "message queue" system that is somehow implemented in a light-weight way as a linkable library. There are also many language bindings, including all fashionable and many exotic languages (but, sadly, not Common Lisp).
I had heard about message queue systems for a long time, but have never really used any, and they always seemed a little scary. The currently popular message queue system seems to be RabbitMQ, and despite the cute name, I hear that it is somewhat big. At the same time, I'm sure that message queues serve a useful purpose, and may be a great basis for distributed systems with fewer reinvented wheels and, thus, better behavior (including performance), so they probably deserve a closer look. And ØMQ seems to be successful in several respects, and at the same time "lightweight" enough for me to understand something. This makes it an attractive system to investigate.

First steps

I have finally found time to start reading the ØMQ Guide during a train ride from Geneva to Zurich.
The introduction ("Fixing the world") at first looks a little pompous, but is in fact full of very good thoughts, both original and convincing, about big problems in programming large distributed software systems. Apparently ØMQ aspires at solving an important part of these problems. Judging from the introduction, the people who wrote this seem very smart. And from the interview, I know that the designers have had a lot of practical experience building real systems, and they knew the deficiencies (but also the achievements) of other messaging systems before they started rolling their own.

Walking through the "Hello World" example

So now I'm reading through the first, "Hello World", example in the guide, trying to understand as much as possible about ØMQ's concepts. The example starts with the server side (which responds "World"). I cut & paste the complete code to a file server.c, which compiles easily enough on my Ubuntu 12.04 system with libzmq-dev installed:
    : leinen@momp2[zmq]; gcc -c server.c
    : leinen@momp2[zmq]; gcc -o server server.o -lzmq
    : leinen@momp2[zmq]; 
When trying to understand an ØMQ API call, I first guess a little what the names and arguments could mean, then I look at the respective man page to validate my guesses and to learn the parts that I was unable to guess.
    void *context = zmq_init (1);
Why is the result type void *, rather than something like ZqmContext *? I'll explain in a mpment why I would strongly prefer the latter.
And wouldn't it be nice if the size of the thread pool (the 1 in the call) could be left undefined? The ØMQ system could either optimize it dynamically, or it could be controlled by standardized external configuration. But maybe this isn't really practical anyway.
    //  Socket to talk to clients
    void *responder = zmq_socket (context, ZMQ_REP);
The socket call is simple enough, taking just one argument beyond the necessarily required context - the socket type (here: ZMQ_REP), "which determines the semantics of communication over the socket." So what does ZMQ_REP mean, and what other types are available? Aha, "REP" is for "reply", and the complementary type is ZMQ_REQ, for "request". So these must be for the standard request/response pattern in classical client/server protocols.
Other types include ZMQ_PUB/ZMQ_SUB for pub/sub protocols, ZMQ_PUSH/ZMQ_PULL for processing pipelines, and a few others related to, if I understand correctly, load balancing, request routing etc.
That's a great choice of patterns to support, because they cover a huge subspace of socket applications in real applications.
    zmq_bind (responder, "tcp://*:5555");
In passing, I notice that this call doesn't take a "context" argument. This probably means that it gets the context from the socket (here: "responder"). Convenient. But, coming back to my previous complaint, responder is also of type void *. So what happens when someone is only slightly confused and passes context to zmq_bind? The compiler certainly has no way of catching this. In fact, when I deliberately introduce this error, I find that even the library doesn't catch this at runtime! I just get a server program that mysteriously sits there and doesn't listen on any TCP port. That is really not nice. Maybe real programmers don't make these kinds of mistakes, but I have my doubts. As an old Lisper, I'm certainly not religious about static type checking. But type checking at some point would really be beneficial.
OK, back to the zmq_bind call. We have a responding socket, so we need to bind it to a "listening" port. The API uses URL-like strings to specify endpoint addresses (tcp://*:5555). This is fine, although I'm slightly worried whether the API developers try to adhere to any standards here (is there a standard for "tcp" URLs?), or whether they just make things up.
The URL-like string approach is certainly superior to what people using the C socket API have to put up with: Use the right sockaddr structures, use getaddrinfo() (and NOT use gethostbyname() anymore! :-), do address resolution error handling by hand, etc.

ØMQ could support IPv6, but does it?

One can easily imagine that the library just Does The Right Thing (DTRT) concerning multiple network-protocol support, e.g. that the above call results in a socket that accepts connections over both IPv4 and IPv6 if those are supported.

Not just yet, it seems.

Unfortunately this doesn't seem to be the case however: When I run the compiled server.c binary under system-call tracing, I get
    : leinen@momp2[zmq]; strace -e bind ./server
    bind(16, {sa_family=AF_INET, sin_port=htons(5555), sin_addr=inet_addr("")}, 16) = 0
      C-c C-c: leinen@momp2[zmq]; 
So this is obviously an IPv4-only socket. Still, if I look inside libzmq.a (in GNU Emacs in case you are curious, though nm | grep would also work), I notice that it references getaddrinfo, but not gethostbyname. So there is at least some chance that someone thought of IPv6. Maybe one has to set special options, or maybe the people who built the Ubuntu (or Debian) package forgot to activate IPv6, or whatever?
Looking at the man page of zmq_tcp, ("ØMQ unicast transport using TCP"), it only talks about "IPv4 addresses". The way I interpret this is that they only support IPv4 right now, but at least they don't ignore the existence of IPv6, otherwise they would have just said "IP addresses" and still meant IPv4 only. So there is at least a weak hope that IPv6 could once be supported.
When I briefly had connectivity (because the train stopped at a station with WiFi), I googled for [zmq ipv6] and found a few entries that suggest that there has been some work on this. Maybe the most promising result I got from Google was this:
[zeromq-dev] IPv6 support - Grokbase
15 Aug 2011 – (5 replies) Hi all, Steven McCoy's IPv6 patches were merged into the master. The change should be completely backward compatible.

So this fairly new. I also noticed that the libzmq in my brand-new Ubuntu 12.04 is only version 2.11.11, and some other Google hits suggest that IPv6 support is planned for ØMQ 3.0. So when I get home I'll check whether I can find ØMQ 3 or better, and use that in preference to the Ubuntu package.

What about multi-transport support?

So we learned that multiple network-layer support (e.g. IPv4 and IPv6) seems possible, and even planned. What about support for multiple different transports? This is certainly not possible using low-level sockets, and it seems difficult, but not impossible, to provide this transparently using a single ZMQ-level socket. For example, a server endpoint could listen requests on both TCP and SCTP connections.
I have a hunch that ØMQ doesn't support this, because the standard interface has only a single scalar socket-type argument. In some sense this is a pity, because having a single socket that supports multiple protocols would make it easier to write programs that live in a multi-protocol world, just like support for multiple network protocols in a single socket makes it easier to write programs that live in a world with multiple such protocols, like IPv4 and IPv6 these days.
Conceptually, ØMQ looks powerful enough to support this without too much pain. It might also be possible to build "Happy Eyeballs" functionality into the library, so that in such multi-protocol contexts, the library could make sure that a reasonably well-performing substrate is always used, so that developers using the library don't have to worry about possible problems when they add multi-protocol support.

Swiss trains go too fast, or Switzerland is too small

So I only got through a whopping two lines of Hello World code for now. But I hope my sidelines and confused thoughts didn't turn you off, and you still agree with me that ØMQ is something worth watching. I sure hope I'll find the time to continue exploring it.

Tuesday, February 07, 2012

Chrome Beta for Android, first impressions

So Google published a beta of Chrome for Android.  It's only available for Android 4.0 "Ice Cream Sandwich", which caused many complaints.  I find this somewhat understandable because Chrome uses fancy graphics, e.g. for the interface of changing between multiple tabs.  What I have a harder time understanding is why they restricted it to a handful of countries in Android Market.  Fortunately the comments on a Google+ post contain hints on how this can be circumvented - thanks, +AJ Stang!
First impressions from using this for a few minutes on a Nexus S: The multiple-tabs feature seems very powerful, and the UI for changing between them seems to work really well on a small mobile device.  Being able to use the Chrome development tools (profiler, DOM inspector etc.) over USB is also quite cool.  It does seem a little slower than the standard Web browser in Android though.  As a heroic experiment on myself I'm making this my default browser for now.