Checksums, Scatter-Gather I/O and Segmentation Offload

When dealing with virtualization and networking in the kernel, a number of fairly difficult concepts come up regularly. Last week, while tracking down some bugs with KVM and virtio in this area, decided to write up some notes on this stuff.

Thanks to Herbert Xu for checking over these.

Also, a good resource I came across while looking into this stuff is Dave Miller’s How SKBs Work. If you don’t know your headroom from your tailroom, that’s not a bad place to start.


TCP (and other protocols) have a checksum in its header which is a sum of the TCP header, payload and the “pseudo header” consisting of the source and destination addresses, the protocol number and the length of the TCP header and payload.

A TCP checksum is the inverted ones-complement sum of the pseudo header and TCP header and payload, with the checksum field set to zero during the computation.

Some hardware can do this checksum, so the networking stack will pass the packet to the driver without the checksum computed, and the hardware will insert the checksum before transmitting.

Now, with (para-)virtualization, we have a reliable transmission medium between a guest and its host and any other guests, so a PV network driver can claim to do hardware checksumming, but just pass the packet to the host without the checksum. If it ever gets forwarded through a physical device to a physical network, the checksum will be computed at that point.

What we actually do with virtualization is compute a partial checksum of everything except the TCP header and payload, invert that (getting the ones-complement sum again) and store that in the checksum field. We then instruct the other side that in order to compute the complete checksum, it merely needs to sum the contents of the TCP header and payload (without zeroing the checksum field) and invert the result.

This is accomplished in a generic way using the csum_start and csum_offset fields – csum_start denotes the point at which to start summing and csum_offset gives the location at which the result should be stored.

Scatter-Gather I/O

If you’ve ever used readv()/writev(), you know the basic idea here. Rather than passing around one large buffer with a bunch of data, you pass a number of smaller buffers which make up a larger logical buffer. For example, you might have an array of buffer descriptors like:

    struct iovec {

        size_t iov_len;     /* Number of bytes to transfer */

        void  *iov_base;    /* Starting address */


In the case of network drivers, (non-linear) data can be scattered across page size fragments:

    struct skb_frag_struct {

        struct page *page;

        __u32 page_offset;

        __u32 size;


sk_buff (well, skb_shared_info) is designed to be able to hold a 64k frame in page size[1] fragments (skb_shinfo::nr_frags and skb_shinfo::frags). The NETIF_F_SG feature flag lets the core networking stack know that the driver supports scatter-gather across this paged data.

Note, the skb_shared_info frag_list member is not used for maintaining paged data, but rather it is used for fragmentation purposes. The NETIF_F_FRAGLIST feature flag relates to this.

Another aspect of SG a flag, NETIF_F_HIGHDMA, which specifies whether the driver can handle fragment buffers that were allocated out of high memory.

You can see all these flags in action in dev_queue_xmit() where if any of these conditions are not met, skb_linearize() is called which coalesces any fragments into the skb buffer.

[1] – These are also known as non-linear skbs, or paged skbs. This what “pskb” stands for in some APIs.

Segmentation Offload

TCP Segmentation Offload (TSO). UDP Fragmentation Offload (UFO). Generic Segmentation Offload (GSO). Yeah, that stuff.

TSO is the ability of some network devices to take a frame and break it down to smaller (i.e. MTU) sized frames before transmitting. This is done by breaking the TCP payload into segments and using the same IP header with each of the segments.

GSO is a generalisation of this in the kernel. The idea is that you delay segmenting a packet until the latest possible moment. In the case where a device doesn’t support TSO, this would be just before passing the skb to the driver. If the device does support TSO, the unsegmented skb would be passed to the driver.

See dev_hard_start_xmit() for where dev_gso_segment() is used to segment a frame before passing to the driver in the case where the device does not support GSO.

With paravirtualization, the guest driver has the ability to transfer much larger frames to the host, so the need for segmentation can be avoided completely. The reason this is so important is that GSO enables a much larger *effective* MTU between guests themselves and to their host. The ability to transmit such large frames significantly increases throughput.

An skb’s skb_shinfo contains information on how the frame should be segmented.

gso_size is the size of the segments which the payload should be broken down into. With TCP, this would usually be the Maximum Segment Size (MSS), which is the MTU minus the size of the TCP/IP header.

gso_segs is the number of segments that should result from segmentation. gso_type indicates e.g. whether the payload is UDP or TCP.

drivers/net/loopback.c:emulate_large_send_offload() provides a nice simple example of the actions a TSO device is expected to perform – i.e. breaking the TCP payload into segments and transmitting each of them as individual frames after updating the IP and TCP headers with the length of the packet, sequence number, flags etc.

Fedora 9 Xen pv_ops

For the past couple of weeks, I’ve been helping out with the Fedora 9 pv_ops effort, specifically helping get the pv_ops based dom0 kernel going.

Well, following on from sct getting dom0 booting, I made a nice breakthrough this morning – a pv_ops dom0 booting a pv_ops domU:

$> dmesg | grep paravirt
Booting paravirtualized kernel on Xen
$> virsh create ./test-domu.xml
Domain Test created from ./test-domu.xml
$> virsh console Test | grep paravirt
Booting paravirtualized kernel on Xen

What’s this pv_ops business all about? Well, as Dan explained, for a long time we’ve been forward-porting Xensource’s (now 2.6.18 based) kernel tree in an effort to try and have our Xen kernel not lag behind Fedora’s bare-metal kernel. Now that the upstream kernel has gained the ability to run on Xen using pv_ops (but only as i386 DomU, currently) we’ve taken the decision to stop wasting our time forward porting Xensource’s tree and put all our focus into improving the feature set of pv_ops based Xen.

pv_ops itself is a set of hooks in the kernel so that support for running on different hypervisors can be cleanly added to the kernel, with the added bonus that the kernel can detect at runtime which hypervisor it is running on and adapt itself accordingly. This means that, in the long run, Xen support should be more akin to a device driver than a huge fork of the kernel.

(Note: for any others who ever to debug Xen’s booting of a guest, here’s a tiny Xen domain builder)