Mark McLoughlin

Reporting Fedora Virtualization Bugs

When reporting bugs against Fedora, the BugsAndFeatureRequests wiki page is a great place to get some information on the kind of things you can do to provide useful information in the bug report.

I’ve just added a page on reporting virtualization bugs. Hopefully it’ll help people find narrow down bugs, log files, get debug spew etc.

And, of course, it’s a wiki page – so go ahead and add your own tips!

Checksums, Scatter-Gather I/O and Segmentation Offload

When dealing with virtualization and networking in the kernel, a number of fairly difficult concepts come up regularly. Last week, while tracking down some bugs with KVM and virtio in this area, decided to write up some notes on this stuff.

Thanks to Herbert Xu for checking over these.

Also, a good resource I came across while looking into this stuff is Dave Miller’s How SKBs Work. If you don’t know your headroom from your tailroom, that’s not a bad place to start.

Checksumming

TCP (and other protocols) have a checksum in its header which is a sum of the TCP header, payload and the “pseudo header” consisting of the source and destination addresses, the protocol number and the length of the TCP header and payload.

A TCP checksum is the inverted ones-complement sum of the pseudo header and TCP header and payload, with the checksum field set to zero during the computation.

Some hardware can do this checksum, so the networking stack will pass the packet to the driver without the checksum computed, and the hardware will insert the checksum before transmitting.

Now, with (para-)virtualization, we have a reliable transmission medium between a guest and its host and any other guests, so a PV network driver can claim to do hardware checksumming, but just pass the packet to the host without the checksum. If it ever gets forwarded through a physical device to a physical network, the checksum will be computed at that point.

What we actually do with virtualization is compute a partial checksum of everything except the TCP header and payload, invert that (getting the ones-complement sum again) and store that in the checksum field. We then instruct the other side that in order to compute the complete checksum, it merely needs to sum the contents of the TCP header and payload (without zeroing the checksum field) and invert the result.

This is accomplished in a generic way using the csum_start and csum_offset fields – csum_start denotes the point at which to start summing and csum_offset gives the location at which the result should be stored.

Scatter-Gather I/O

If you’ve ever used readv()/writev(), you know the basic idea here. Rather than passing around one large buffer with a bunch of data, you pass a number of smaller buffers which make up a larger logical buffer. For example, you might have an array of buffer descriptors like:

    struct iovec {

        size_t iov_len;     /* Number of bytes to transfer */

        void  *iov_base;    /* Starting address */

    };

In the case of network drivers, (non-linear) data can be scattered across page size fragments:

    struct skb_frag_struct {

        struct page *page;

        __u32 page_offset;

        __u32 size;

    };

sk_buff (well, skb_shared_info) is designed to be able to hold a 64k frame in page size[1] fragments (skb_shinfo::nr_frags and skb_shinfo::frags). The NETIF_F_SG feature flag lets the core networking stack know that the driver supports scatter-gather across this paged data.

Note, the skb_shared_info frag_list member is not used for maintaining paged data, but rather it is used for fragmentation purposes. The NETIF_F_FRAGLIST feature flag relates to this.

Another aspect of SG a flag, NETIF_F_HIGHDMA, which specifies whether the driver can handle fragment buffers that were allocated out of high memory.

You can see all these flags in action in dev_queue_xmit() where if any of these conditions are not met, skb_linearize() is called which coalesces any fragments into the skb buffer.

[1] – These are also known as non-linear skbs, or paged skbs. This what “pskb” stands for in some APIs.

Segmentation Offload

TCP Segmentation Offload (TSO). UDP Fragmentation Offload (UFO). Generic Segmentation Offload (GSO). Yeah, that stuff.

TSO is the ability of some network devices to take a frame and break it down to smaller (i.e. MTU) sized frames before transmitting. This is done by breaking the TCP payload into segments and using the same IP header with each of the segments.

GSO is a generalisation of this in the kernel. The idea is that you delay segmenting a packet until the latest possible moment. In the case where a device doesn’t support TSO, this would be just before passing the skb to the driver. If the device does support TSO, the unsegmented skb would be passed to the driver.

See dev_hard_start_xmit() for where dev_gso_segment() is used to segment a frame before passing to the driver in the case where the device does not support GSO.

With paravirtualization, the guest driver has the ability to transfer much larger frames to the host, so the need for segmentation can be avoided completely. The reason this is so important is that GSO enables a much larger *effective* MTU between guests themselves and to their host. The ability to transmit such large frames significantly increases throughput.

An skb’s skb_shinfo contains information on how the frame should be segmented.

gso_size is the size of the segments which the payload should be broken down into. With TCP, this would usually be the Maximum Segment Size (MSS), which is the MTU minus the size of the TCP/IP header.

gso_segs is the number of segments that should result from segmentation. gso_type indicates e.g. whether the payload is UDP or TCP.

drivers/net/loopback.c:emulate_large_send_offload() provides a nice simple example of the actions a TSO device is expected to perform – i.e. breaking the TCP payload into segments and transmitting each of them as individual frames after updating the IP and TCP headers with the length of the packet, sequence number, flags etc.

Git Workflow

Havoc’s recent post on git was interesting because it shows how frustrating git can be if you try and treat it as “just another CVS”. From that perspective, git just seems like it’s just some bizarre way for kernel hackers to torture those who just want to get work done.

I turned that corner with git when I learned about “git-rebase -i” and came to the startling realisation that git’s history is editable. Basically, this allows you to change your workflow such that you can hack away at will, commit often and then rewrite the history of your hacking session so that you have a coherent set of patches/commits at the end of it with a useful changelog.

e.g. you can go from:

A1---B1---A2---A3---C1---B2---C2---C3

to:

A1---A2---A3---B1---B2---C1---C2---C3

or even:

A'---B'---C'

Using git rebasing, I found that I could use a similar workflow to using quilt with CVS, or mercurial with its patch queue (mq) extension. The revision history becomes less about tracking the progress of your work, and more a maleable mechanism for preparing patches before submitting upstream.

Red Hat Magazine has a nice article explaining all this, and I even picked up some new tricks to try out:

git-merge --squash : merge a branch/tag into the current branch, but squash all the commits together as an uncommitted change to the working tree. When you go to commit the result, the changelog of all the merged commits is available in the commit message editor so you can munge them together into a useful changelog.
git-cherry-pick --no-commit : apply the changes from a given commit to your working tree, but do not commit it. Could be used to achieve something similar to a squashed merge, but where you selectively merge only some of the commits.
git-add --patch/--interactive : add some changes from the working tree to the index, but e.g. selectively add only some of the patch hunks from a given file. Allows you to make a bunch of changes to a file, but commit the changes as individual commits.

Fedora 9 Xen pv_ops

For the past couple of weeks, I’ve been helping out with the Fedora 9 pv_ops effort, specifically helping get the pv_ops based dom0 kernel going.

Well, following on from sct getting dom0 booting, I made a nice breakthrough this morning – a pv_ops dom0 booting a pv_ops domU:
$> dmesg | grep paravirt Booting paravirtualized kernel on Xen $> virsh create ./test-domu.xml Domain Test created from ./test-domu.xml $> virsh console Test | grep paravirt Booting paravirtualized kernel on Xen

What’s this pv_ops business all about? Well, as Dan explained, for a long time we’ve been forward-porting Xensource’s (now 2.6.18 based) kernel tree in an effort to try and have our Xen kernel not lag behind Fedora’s bare-metal kernel. Now that the upstream kernel has gained the ability to run on Xen using pv_ops (but only as i386 DomU, currently) we’ve taken the decision to stop wasting our time forward porting Xensource’s tree and put all our focus into improving the feature set of pv_ops based Xen.

pv_ops itself is a set of hooks in the kernel so that support for running on different hypervisors can be cleanly added to the kernel, with the added bonus that the kernel can detect at runtime which hypervisor it is running on and adapt itself accordingly. This means that, in the long run, Xen support should be more akin to a device driver than a huge fork of the kernel.

(Note: for any others who ever to debug Xen’s booting of a guest, here’s a tiny Xen domain builder)

Dublin Marathon

Thanks to Olav, I can post here again after nearly 10 months (!). Not that I had anything to say anyway 😛

But for the past couple of months I’ve been writing about stuff like hiking, running and sailing on another blog and today’s tidbit is that I finished my first marathon yesterday.

Happy Finisher

Woo!

Virtual networking

Dan I have been discussing how to “fix virtual networking”, not just Xen’s networking but also getting something sane wrt. QEMU/KVM etc.

Anyone interested should read this writeup. To discuss, libvirt-list is probably the best place.

QEMU Networking

QEMU has a number of really nice ways to set up networking for its guests. It can be a little bewildering to figure out how each of the options work, so I thought I’d write up what I found. Excuse the ‘orrid ascii art 🙂

GNOME SVN and jhbuild

If you’re wondering how to move your GNOME jhbuild from CVS now that the SVN migration has happened … here’s what I had to do.

Checkout jhbuild from SVN:

  $> mkdir -p /gnome/head/svn && cd /gnome/head/svn
  $> svn co svn+ssh://markmc@svn.gnome.org/svn/jhbuild/trunk jhbuild
  $> make install

Update ~/.jhbuildrc so that e.g.

  repos['svn.gnome.org'] = 'svn+ssh://markmc@svn.gnome.org/svn/'
  checkoutdir = '/gnome/head/svn'

Copy /gnome/head/cvs/pkgs to /gnome/head/svn/pkgs so that you won’t have to download as many new tarballs
Run jhbuild build

Note, this is with the gnome-2.18 moduleset. Things are still a little in flux right now.

Xen and X Pointer Issues

Just back from a nice relaxing holiday and, at first, I was totally perplexed by all this talk of the Xen “absolute pointer” problem. “It’s just VNC”, I thought, “it can’t be that hard. It must be just a simple bug somewhere”.

The background is:

In Xen guests we have a “xenfb” driver, which acts just like a normal framebuffer device as far as the Xserver is concerned, but the contents of the framebuffer is exported to Dom0 via XenBus and shared memory.
Similarly, we have a “xenkbd” driver, which takes input events from Dom0 and makes them available to the Xserver.
In Dom0, we have a little daemon which acts as a VNC server. It exports the framebuffer contents from the guest and injects input events into the guest.

The problem here is that pointer motion events arrive at the Xserver as if they came directly from hardware. And just like normal mouse events, they are relative – i.e. you move your mouse up X amount and across Y amount.

This is unusual, because a VNC server receives motion events with absolute co-ordinates and can normally warp the pointer to those exact co-ordinates.

What we have might not be too bad – we might be able to reliably control the absolute pointer position in X by injecting events with relative co-ordinates – except that these events are subject to acceleration. If we try and move the pointer by injecting an event that says “move 100 pixels to the right”, the Xserver may accelerate that and move it, say, 200 pixels (with a ratio of 2/1). So, Pete’s first going to come up with a quick hack to disable acceleration.

It’s still stupid to try and move the pointer to an absolute position by injecting relative pointer motion events, though. The ideal solution is that the pointer device in the Xen guest behaves just like a grapics tablet. We would pass the absolute pointer co-ordinates to the guest and the driver would pass those on to the Xserver as though it was tablet device.

The Wind That Shakes the Barley

We went to see The Wind That Shakes the Barley last night. I went along expecting some Michael Collins or Braveheart romanticised brit-bashing light entertainment, but no.

This one wasn’t easy to watch. It’s set during the Irish War of Independence and Irish Civil War which is only now about to drop out of living memory in Ireland. The emphasis isn’t so much on the fighting, but on the heartbreaking impact it had on families.

I like this comment on the IMDb page:

I saw this film at a private screening and found it difficult yet beautiful to watch.

…

This film is a template for what film makers can achieve with a small budget, dedicated performers and a timeless topic.

…

The sacrifices made 80 years ago still resonate today but the Republic of Ireland is now the third richest country in Europe. The question still debated is Was it Worth it? The question we ask is how’s Scotland and Wales doing?