IPC Performance: The return of the report

20/05/2010

Last week, I published a report comparing various IPC methods when transferring a lot of data between two programs. In this report, I compared how DBus performed against a custom IPC system using UNIX sockets. The test case was transferring the result set of a large query ran against an SQLite database (300 000 rows, around 220MB to pass from server to client). As expected, the UNIX socket system was much faster and cheaper, around 20% slower than directly accessing the database.

We got quite a lot of useful feedback about the report. In particular, Alexander Larsson suggested that we look at a feature present in DBus 1.3 allowing to pass UNIX file descriptors over the bus. The idea was to use DBus for “high level” stuff (client/server discovery, preparing query etc.) and a pipe for passing results from the server to the client. And he was also kind enough to help me when I was trying to use vmsplice in the server, he therefore deserves a beer at next GUADEC 🙂

So I compiled a recent DBus, and started experimenting with FD passing. It turns out it is remarkably easy to do, FDs are passed in messages just like any other DBus parameter. So what happens is that the server creates a pipe when asked to prepare a request. The client then calls Fetch, and the server sends the results over the pipe.

In our benchmarks, this solution scored even better than the UNIX socket based IPC. I also tried to experiment with vmsplice to save one memory copy, but didn’t succeed because of some kernel bits missing.

The updated report, along with a spreadsheet containing the result of my benchmarks are available in the same git repository that also holds all the test programs and measure scripts.

Also, thanks to the Planet GNOME friends who added me!

Exposing a SQLite database remotely: comparison of various IPC methods

Adrien Bustany
Computer Sciences student
National Superior School of Informatics and Applied Mathematics of Grenoble (ENSIMAG)

This report was originally published on May 31, 2010 and was updated on May 19, 2010

This study aims at comparing the overhead of an IPC layer when accessing a SQLite database. The two IPC methods included in this comparison are DBus, a generic message passing system, and a custom IPC method using UNIX sockets. As a reference, we also include in the results the performance of a client directly accessing the SQLite database, without involving any IPC layer. Newer versions¹ of DBus can pass a file descriptor in messages, allowing the creation of a direct pipe between the client and server. This hybrid approach (between bare socket and DBus) is also evaluated in this report.

Comparison methodology

In this section, we detail what the client and server are supposed to do during the test, regardless of the IPC method used.

The server has to:

Open the SQLite database and listen to the client requests
Prepare a query at the client’s request
Send the resulting rows at the client’s request

Queries are only “SELECT” queries, no modification is performed on the database. This restriction is not enforced on server side though.

The client has to:

Connect to the server
Prepare a “SELECT” query
Fetch all the results
Copy the results in memory (not just fetch and forget them), so that memory pages are really used

Test dataset

For testing, we use a SQLite database containing only one table. This table has 31 columns, the first one is the identifier and the 30 others are columns of type TEXT. The table is filled with 300 000 rows, with randomly generated strings of 20 ASCII lowercase characters.

Implementation details

In this section, we explain how the server and client for both IPC methods were implemented.

Custom IPC (UNIX socket based)

In this case, we use a standard UNIX socket to communicate between the client and the server. The socket protocol is a binary protocol, and is detailed below. It has been designed to minimize CPU usage (there is no marshalling/demarshalling on strings, nor intensive computation to decode the message). It is fast over a local socket, but not suitable for other types of sockets, like TCP sockets.

Message types

There are two types of operations, corresponding to the two operations of the test: prepare a query, and fetch results.

Message format

All numbers are encoded in little endian form.

Prepare

Client sends:

Size	Contents
4 bytes (int)	Prepare opcode (0x50)
4 bytes (int)	Size of the query (without trailing )
…	Query, in ASCII

Server answers:

Size	Contents
4 bytes (int)	Return code of the sqlite3_prepare_v2 call

Fetch

Client sends:

Size	Contents
4 bytes (int)	Fetch opcode (0x46)

Server sends rows grouped in fixed size buffers. Each buffer contains a variable number of rows. Each row is complete. If some padding is needed (when a row doesn’t fit in a buffer, but there is still space left in the buffer), the server adds an “End of Page” marker. The “End of page” marker is the byte 0xFF. Rows that are larger than the buffer size are not supported.

Each row in a buffer has the following format:

Size	Contents
4 bytes (int)	SQLite return code. This is generally SQLITE_ROW (there is a row to read), or SQLITE_DONE (there are no more rows to read). When the return code is not SQLITE_ROW, the rest of the message must be ignored.
4 bytes (int)	Number of columns in the row
4 bytes (int)	Index of trailing for first column (index is 0 after the “number of columns” integer, that is, index is equal to 0 8 bytes after the message begins)
4 bytes (int)	Index of trailing for second column
…
4 bytes (int)	Index of trailing for last column
…	Row data. All columns are concatenated together, and separated by

For the sake of clarity, we describe here an example row
100 4 1 7 13 19 1aaaaabbbbbccccc

The first 100 is the return code, in this case SQLITE_ROW. This row has 4 columns. The 4 following numbers are the offset of the terminating each column in the row data. Finally comes the row data.

Memory usage

We try to minimize the calls to malloc and memcpy in the client and server. As we know the size of a buffer, we allocate the memory only once, and then use memcpy to write the results to it.

DBus (without file descriptor passing)

The DBus server exposes two methods, Prepare and Fetch.

Prepare

The Prepare method accepts a query string as a parameter, and returns nothing. If the query preparation fails, an error message is returned.

Fetch

Ideally, we should be able to send all the rows in one batch. DBus, however, puts a limitation on the message size. In our case, the complete data to pass over the IPC is around 220MB, which is more than the maximum size allowed by DBus (moreover, DBus marshalls data, which augments the message size a little). We are therefore obliged to split the result set.

The Fetch method accepts an integer parameter, which is the number of rows to fetch, and returns an array of rows, where each row is itself an array of columns. Note that the server can return less rows than asked. When there are no more rows to return, an empty array is returned.

DBus (with file descriptor passing)

In this case, we use DBus for high level communication between the server and client, but use a dedicated pipe for transferring the data itself. This allows very fast transmission rates, while keeping the comfort of using DBus for remote method calls.

We keep the same methods, Prepare and Fetch, although their prototype change a bit.

Prepare

The prepare method accepts a query, and returns a file descriptor from where to read the results. In case the preparation fails, an error message is returned. Internally, the server is creating a pipe, and returning its descriptor to the client.

Fetch

When the Fetch method is called, the server will write the results on the previously returned file descriptor. The protocol used is the same as with UNIX sockets.

Results

All tests are ran against the dataset described above, on a warm disk cache (the database is accessed several time before every run, to be sure the entire database is in disk cache). We use SQLite 3.6.22, on a 64 bit Linux system (kernel 2.6.33.3). All test are ran 5 times, and we use the average of the 5 intermediate results as the final number.

For both the custom IPC and the method using DBus with file descriptor passing, we test with various buffer sizes varying from 1 to 256 kilobytes. For simple DBus, we fetch 75000 rows with every Fetch call, which is close to the maximum we can fetch with each call (see the paragraph on DBus message size limitation).

The first tests were to determine the optimal buffer size when transmitting data over a socket or a pipe. The following graph describes the time needed to fetch all rows, depending on the buffer size:

The graph shows that the buffer size leading to the best throughput is 64kB. Those results depend on the type of system used, and might have to be tuned for different platforms. On Linux, a memory page is (generally) 4096 bytes, as a consequence buffers smaller than 4kB will use a full memory page when sent over the socket and waste memory bandwidth.

After determining the best buffer size, we run tests for speed and memory usage, using a buffer size of 64kb for both the UNIX socket based and the DBus using file descriptors method.

Speed

We measure the time it takes for various methods to fetch a result set. Without any surprise, the time needed to fetch the results grows linearly with the amount of rows to fetch.

IPC method	Best time
None (direct access)	2910 ms
UNIX socket	3470 ms
DBus	12300 ms
DBus with FD passing	3329 ms

Memory usage

Memory usage varies greatly (actually, so much that we had to use a log scale) between IPC methods. Memory usage when using DBus without file descriptor passing is explained by the fact that we fetch 75 000 rows at a time, and that DBus has to allocate all the message before sending it, while the socket/pipe based IPC methods use 64 kB buffers.

A note about vmsplice

While a good share of the CPU time is obviously spent in SQLite itself, memory copies are also pretty expensive. When experimenting with DBus file descriptors and pipes, I tried to use the vmsplice function of the Linux kernel. vmsplice used with its SPLICE_F_GIFT flag allows us to map some user data directly into a pipe, saving therefore a memory copy. The memory being gifted to the kernel, it must not be modified until it’s not being read by anyone (else, data is being read and modified at the same time). In our case, the server is using a ring buffer (allocated once at the start of the program), which can’t be reused while the client is still reading data. The trick to be sure that mapped memory is not in use anymore is to map to the pipe a buffer twice as large as the maximum pipe buffer size. When vmsplice returns, we are sure that at least a half of our buffer has been “consumed”, so we can use it again and write to its first half (since its second half might still be under IO). However, the system call to get the maximum pipe buffer size, F_GETPSZ, is not implemented as of today. I wasn’t therefore able to produce a vmsplice based version that was robust against memory corruption. Anyway the performance gains observed the few times the test program using vmsplice worked were not significant.

Alexander Larsson deserves for that part a huge thanks for his help on trying to get things right with vmsplice.

Conclusions

The results clearly show that in such a specialized case, designing a custom IPC system can highly reduce the IPC overhead. The overhead of a UNIX socket based IPC is around 19%, while the overhead of DBus is 322%. However, it is important to take into account the fact that DBus is a much more flexible system, offering far more features and flexibility than our socket protocol. Comparing DBus and our custom UNIX socket based IPC is like comparing an axe with a swiss knife: it’s much harder to cut the tree with the swiss knife, but it also includes a tin can opener, a ball pen and a compass (nowadays some of them even include USB keys).

Actually, the new file descriptor passing feature of DBus (added by Lennart Poettering) brings us the best of both worlds: high level, flexible IPC for method calls, and fast data transfer using a pipe for heavy transfers. On the benchmarks, pipes are even a bit faster than sockets.

The code source used to obtain these results, as well as the numbers and graphs used in this document can be checked out from the following git repository: git://git.mymadcat.com/ipc-performance . Please check the various README files to see how to reproduce them and/or how to tune the parameters.

1DBus 1.3, not released as “stable” as of today

Posted by Adrien Bustany
Filed in Uncategorized

6 Comments »

6 Responses to “IPC Performance: The return of the report”

Peter Lund Says:

20/05/2010 at 3:37 pm
You should probably change the caption “Page size (in bytes)” to “Buffer size (in bytes)”.
Will Stephenson Says:

21/05/2010 at 7:58 am
Wow, that’s the longest statement ever that DCOP wasn’t so bad after all ;).
- Adrien Bustany Says:
  
  21/05/2010 at 11:01 am
  @Will: Can you elaborate ? I don’t know at all how DCOP was working internally
  - Will Stephenson Says:
    
    21/05/2010 at 1:01 pm
    The main implementation used X11 ICE, a “unix socket based protocol”:
    
    http://www.xfree86.org/current/ice.pdf
    
    There was also a UDP socket implementation, IIRC, but that was only a proof of concept because unix sockets were identified as being a barrier to KDE on Windows.
    - Adrien Bustany Says:
      
      21/05/2010 at 6:31 pm
      @Will: Well, DBus uses a socket too… The “problem” when using DBus to transfer a lot of data is that messages are not chunked, and marshalling has a cost. I don’t know how DCOP would have handled that.
Tweets that mention » IPC Performance: The return of the report Experiments in GNOMEland -- Topsy.com Says:

21/05/2010 at 9:15 pm
[…] This post was mentioned on Twitter by Maykel Moya, Zuissi. Zuissi said: Planet Gnome: Adrien Bustany: IPC Performance: The return of the report: Last week, I published a report comparing… http://bit.ly/bMrxFZ […]

Comments are closed.

Experiments in GNOMEland

IPC Performance: The return of the report

20/05/2010

Comparison methodology

Test dataset

Implementation details

Custom IPC (UNIX socket based)

Message types

Message format

Prepare

Fetch

Memory usage

DBus (without file descriptor passing)

Prepare

Fetch

DBus (with file descriptor passing)

Prepare

Fetch

Results

Speed

Memory usage

A note about vmsplice

Conclusions

6 Responses to “IPC Performance: The return of the report”

Tags