SQLite, VACUUM, and auto_vacuum

6 January 2015 Jim 4 Comments

The week before Christmas I hunkered down and took on a nagging issue in Geary I’m almost embarrassed to discuss. The long and short of it is, prior to this commit, Geary never deleted emails from its internal SQLite database, even if they were deleted on the server. It was a classic problem of garbage collection and reference counting. (If you’re interested in the problem and why Geary didn’t simply delete an email when it was not visible in a mail folder, the original ticket is a good place to start.)

When I first ran the new garbage collector on my four year-old 2.7GB SQLite database, it detected a tremendous number of emails to remove—almost 50,000 messages. (Yeah, fixing this bug was long overdue.) After it deleted them I quickly checked the size of the file to see what kind of disk savings I’d just earned for myself. As you might imagine, I was surprised to see the file size had not changed—but only for a moment.

As is true so often in programming, closing one door tends to open two or more new doors. In this case, by deleting a significant amount of data from the database I now confronted a question all applications using SQLite face at some point: to VACUUM or not to VACUUM?

What is VACUUM?

Why would you want to VACUUM an SQLite database? To conserve disk space and improve performance. (There’s a third, more esoteric reason: certain configuration changes require a VACUUM. More on that later.)

Conserving disk space and improving performance are really tantilizing things. If you look around the Internet, you’ll find lots of people asking how to achieve one or both with SQLite. VACUUM sounds like a magic pill, this one-shot solution that will solve all your problems. But the solution comes at a price, and a pretty steep one.

To step back, what exactly does VACUUM do? According to SQLite’s documentation, VACUUM “rebuilds the entire database.” Okay…so what does that do?

Here’s my guess as to the simplified steps SQLite actually performs when instructed to VACUUM a database:

SQLite creates an empty database file in the same directory as the original database file.
It reads the original database’s entire schema and applies it to the new one. This creates the tables, indices, and so forth. Certain PRAGMA settings are also copied.
For each table in the original, SQLite reads each row sequentially and writes it to the new database. (Note that rowids are not preserved unless they were marked as INTEGER PRIMARY KEY.)
SQLite deletes the original database file.
Finally, it renames the new database to the original’s file name.

There’s more to the process than this, but I suspect conceptually that’s the bulk of the operation. Note that SQLite is doing this in a transaction-safe way, that is, if the power fails at any step, you’re not hosed.

Why VACUUM?

What does this buy you? It helps to think of the SQLite database as a kind of dynamically resizable filesystem-in-miniature. Like a filesystem, it has free pages and pages in-use. Writing a new row to a table is like writing a new file: free pages are removed from a free list, the data is written to those pages, and then they’re linked to an internal index (which marks them as in-use). The difference is, most filesystems have a static number of pages to work with (i.e. a fixed partition size). With SQLite, if it runs out of free pages, it expands the file to create new ones. When you delete a file the filesystem merely marks those pages as free. Same with SQLite; when you delete a row, its pages are added to the free list and can be re-used later.

Imagine if you insert 1,000 rows to an empty SQLite database. SQLite will expand the file to generate the space needed to store all that information. Now delete 999 rows. In its default configuration, SQLite does not magically move that last row to the beginning of file and truncate (thereby returning the free space to the operating system). Instead, the last row remains in place and the pages for the 999 deleted rows are marked as free. If your 1,000 row database takes 1MB on disk and you delete 999 rows, the file remains 1MB in size.

Judging from the SQLite forums and Stack Overflow questions I’ve read, this simple detail has caused considerable anguish. But from an implementation point of view, the approach has a certain logic. If you needed 1,000 rows at one point, you’ll probably need most or all of that space again later.

Now take that 1MB one-row database and run it through steps 1–5 above. You’ll quickly understand why VACUUM can conserve disk space. Because SQLite is lazy about allocating pages, the new database file will only be large enough for that one row. 1MB becomes 1K (or so).

What about improved performance? Vacuuming a database sequentially re-serializes each row’s data in the new file. If the row requires multiple pages, those pages may be spread all over the original file. Vacuuming stores them side-by-side in the new file. That means VACUUM can create cache locality effects. Related data will be located closer together, so traditional performance gains (fewer disk seeks, filesystem cache hits, L2 cache hits, and so forth) come into play. Not only will your data be closer together, but SQLite’s internal lookup tables will be compacted as well, improving access times. Remember disk defragmenters? VACUUM defragments your database. That’s the performance improvement.

A few catches

So VACUUM sounds like a pretty sweet deal: less disk space, speedier performance, what’s not to like? There are, of course, trade-offs and warnings to be aware of.

The first one is the deal-breaker for most people: the database is completely locked during a VACUUM, which can take minutes (or tens of minutes) to complete, depending on its size. No database access is possible during a VACUUM. And VACUUM is not an asynchronous or non-blocking operation; if you want to update the user interface while it’s working (a progress bar or busy spinner, for example), you’ll need to VACUUM in a background thread or a separate process. Oh yeah—you’ll want to read SQLite’s warnings about threads before going down that path.

The technical note in step #3 (above) about INTEGER PRIMARY KEY should not be ignored. If you forgot to include explicit primary key declarations in your schema and your relational structure cannot handle every single row being assigned a new rowid, then VACUUM is not for you.

VACUUM creates a new file. It never happens in-place. That means your disk requirements could as much as double while vacuuming.

The amount of time VACUUM takes is dependent on the number of used pages in your database. Vacuuming a database you’ve just vacuumed will not complete instantly. It will take just as much time the second iteration as the first, only to produce a byte-for-byte copy.

I know of no way to cancel a VACUUM. I’d be happy to hear about a safe, sane way of doing so.

When to VACUUM

When I was researching the issue I closely studied this 2011 blog post on how the Liferea RSS reader attacked the question of vacuuming. (It appears the Liferea developers have been worried about VACUUM since 2008. I know VACUUM has been a recurring issue for Firefox too, which uses SQLite for maintaining bookmarks, history, places, and more.) The Liferea post details many different questions and discoveries about vacuuming and it’s well worth a read.

At one point Liferea was running vacuum every time the application started, and later once a day, causing the blog writer to state:

…running vacuum everyday provides no significant benefit (it only provides a benefit when there’s a significant amount of stuff to vacuum). If you’re vacuuming on every startup, you’re using a nuclear weapon to dig a hole in a small garden.

I wholeheartedly agree. VACUUM sounds like a magic solution, but it’s more like a magic drastic solution. VACUUM is not a fine-tuned solver of problems, it’s a big red RESET button on your database’s on-disk layout. It really is the nuclear option.

One page on the Firefox wiki (“Avoid SQLite In Your Next Firefox Feature”—overkill, but it makes some good points) lays down this dictum:

You should have a vacuum plan from the get-go. This is a requirement for all new code.

I have to agree: any application using SQLite should devise a VACUUM policy early on. It’s almost as important as planning your table schemas.

My advice? Anyone considering VACUUM should first ask themselves if it’s necessary at all.

When not to VACUUM

Your database may never need to be vacuumed, ever.

One bit of lore spread around is that an SQLite database will grow without bounds if you don’t VACUUM. That’s not true; free pages will be reused. Without vacuuming, an SQLite database will expand to its largest size and remain that size. That’s a little different than growing without bounds.

If you’re worried about disk space, ask yourself at what threshold the savings are worth it. If VACUUM saves anything less than 50MB for a modern desktop application, I would say it’s not worth it. (Some might even go as high as 500MB, or higher.) You should also consider auto_vacuum, which I explain below.

As far as page fragmentation, that’s a harder metric to come up with, but at least understand that if your application is slow, VACUUM is not a magic bullet to make it faster. Projects wanting to speed up SQLite should look closer at their SQL, schemas, and parallelization before turning to VACUUM.

Trigger mechanisms, heuristics

If vacuuming is necessary, I recommend coding a trigger mechanism, one or more heuristics that decides when vacuuming is due. Don’t just blindly fire off VACUUMs at regular intervals hoping they will solve your performance woes.

Heuristic #1: Disk space

If you’re interested in conserving disk space, the database has to have free pages to release back to the filesystem; it’s as simple as that. One heuristic is to examine the free versus in-use page counts and trigger a vacuum when a certain threshold is reached (say, 25% of the total pages are free). That information is available via SQLite’s page_count and freelist_count PRAGMAs.

Another possibility is to multiply the free pages by the page_size PRAGMA and trigger when an absolute size is reached. This is smarter than vacuuming at a certain ratio, since 25% of 1MB on a modern desktop machine does not represent significant savings in an era of terabyte hard drives.

Heuristic #2: Elapsed time

I also suggest subsequent vacuums only occur after a period of time has elapsed.

Consider a user who opens an application, waits for a vacuum to complete, deletes a lot of data, closes the app, then re-runs it—and has to endure another vacuum. It makes more sense to wait a period of time (a day, a week, a month, or more) before allowing a second vacuum, regardless of the free page count. Not only is this good for the user’s sanity, it acknowledges that those freed pages stand to be re-used in the near future.

A more advanced heuristic might only trigger when a certain amount of free pages exists for a duration of time, suggesting the user won’t be needing them soon, if ever.

Heuristic #3: Fragmentation

I’ve yet to find a straightforward way to retrieve or calculate page fragmentation. One could imagine a number of black-box schemes to determine when performance is suboptimal—timing transactions, for example—but that risks running a VACUUM for bogus reasons, such as another application hammering on the disk.

If the rows in your tables are, on average, of similar size, or if your database is not hit particularly hard with UPDATEs, INSERTs, and/or DELETEs, then it may make sense to never VACUUM it for fragmentation reasons. (For example, a web browser’s bookmark file.) If your tables are populated with data blobs of widely varied sizes and is actively updated (such as a web browser’s cache), then periodic vacuuming might make sense even if the free page ratio is low, just to defragment it.

It might sound like over-engineering, but you could track the number of UPDATE/INSERT/DELETE transactions made on the database (in, say, a special table) and code your heuristic to vacuum when a threshold is reached. It’s no guarantee of significant fragmentation, though, just an indicator of the activity that can lead to it.

Other reasons to VACUUM: page_size, auto_vacuum

There is another reason your application may need to vacuum the database. If you adjust the database’s page_size or auto_vacuum PRAGMAs after you’ve created any tables, those changes won’t go into effect until a VACUUM is performed. This is a big motivator to get those values right the first time.

I discuss auto_vacuum later so I won’t go into that yet. As for page_size, I’ll pass on an anecdote.

During Geary’s development, we discovered a major problem with startup time after adding a Full-Text Search (FTS) table to the database. When the FTS table was first touched by our code, SQLite would block for an excruciatingly long amount of time—database locked—with no indication why. It didn’t matter what kind of transaction was performed, simple or complex, read or write, the first transaction against the FTS table always set off a storm of disk thrashing.

Without going into the whole story, how we solved that problem was fairly simple: we increased the page_size from its default (1024 bytes on most Linux systems) to 4096 bytes and vacuumed the database. That was it. VACUUM didn’t solve the problem, page_size did.

For any non-trivial application using SQLite on a modern desktop operating system, I strongly recommend a 4096 page size minimum. (Read the discussion about page_size in SQLite’s documentation before deciding on the value right for your project.) Larger page sizes might make sense for your needs, but I doubt anything less than 4KB is right for any desktop application. (Mobile and embedded devices are a different story.) New applications should also investigate write-ahead logging.

If you’re using journaling or specialized locking modes, it’s also worth reading the documentation on the journal_size_limit PRAGMA and how VACUUM affects the rollback file.

How to VACUUM

First, a few tips about running VACUUM.

As mentioned earlier, executing the VACUUM command will block the current thread, so it makes sense to run it in a background thread if your application needs to remain responsive to the user.

VACUUM cannot run from within a transaction. This will fail:

BEGIN TRANSACTION
VACUUM
END TRANSACTION

It also cannot be executed when transactions are open on other connections; it will fail immediately. It’s upon you, the application writer, to ensure VACUUM is invoked when no other work is being performed on the database.

The Firefox wiki has some interesting numbers on improving VACUUM’s speed. Adjusting the journal_mode and synchronous PRAGMAs can improve the time it takes to complete a VACUUM at the cost of database integrity. Turning off synchronous risks data loss if VACUUM is interrupted at the wrong time. Changing journal_mode to MEMORY or OFF risks losing rollback and atomicity, but it’s unclear to me how important those are for VACUUM. The original database is not altered until it’s deleted and replaced with the new file, which is where synchronous, not journal_mode, is important. (And would you ever want to roll back a VACUUM?) But don’t take my word for it.

If you change synchronous or journal_mode to speed up VACUUM, remember to reset them to their original values when it completes.

With that out of the way, how can your application work around VACUUM’s limitations? What follows is a compilation of approaches I’ve discovered while searching around the web. All assume you’re using a trigger mechanism with a heuristic and not simply vacuuming the database every time.

1. Delayed startup

At start time, throw up a busy indicator for the user and run VACUUM in a background thread.

This is the most common approach and the one developers and users dread. For some people, this is simply unacceptable.

2. Delayed shutdown

When the application’s exiting, close or hide the UI but keep the process running as it vacuums the database.

Some users will hate you for seeing a busy indicator on their drive light even though nothing appears to be running. Since your application is technically executing and the database is locked, what do you do if the user turns around and re-runs your program?

Leaving the UI on the screen is a possibility (with a busy indicator), but users will hate you for that too. You’ll need to account for the variety of ways an application can be stopped, too. You can count on users starting your program, but you can’t count on it being cleanly shut down every time.

3. Ask the user

When a vacuum is required, ask the user if they want to do so. (This is discussed in the Liferea blog post.) Or, have a menu option or a button that launches VACUUM. Instead of starting immediately, either approach could store some state that indicates VACUUM should run the next time the application starts (i.e. delayed startup).

It seems fairly antiquated in 2015 to be asking the end-user a technical question about databases. Although the delayed startup approach is generally hated, users do understand when an application says “Please wait while updating…” (This often happens after a software upgrade, for example.) Asking a user “Do you want to compact your database?” wasn’t even acceptable in the 1990s. There may be an elegant way to do this, but I’m at a loss.

4. At idle

Wait until the application (or system?) is idle, indicating the user is away from the machine, and launch VACUUM. Another variation is to use a cron/at job that executes in the middle of the night.

The user isn’t inconvenienced, but note that not all applications access their database solely due to user events, so “idle” may mean different things. (For example, an email client, where mail may arrive at any time.) You also need to deal with the situation where the user returns in the middle of a VACUUM. As mentioned, I have yet to find a way to sanely cancel the command.

As for cron or at jobs, most SQLite applications don’t deal well with separate processes modifying the database. (If your program stores even a single rowid in memory between transactions, it’s one such application.) The background job would probably need to abort if it detects the application is already running. For applications people leave open all the time (again, an email client), that might be a problem. And nightly jobs are not reliably launched on devices that are often sleeping, such as notebooks.

5. VACUUM a copy in the background

Copy the database file and vacuum the duplicate in a background thread. Since the application remains open and usable, track all subsequent database changes and apply them to the vacuumed database when it’s ready.

This one was suggested in the comments section of the Liferea blog post and on Stack Overflow (I’ve lost the link to it, though). This approach sounds like the best of all possible worlds: the operation happens in the background, the user is not notified or queried, and the application is not locked. It’s also the hardest one to get right.

First, remember that VACUUM already generates a copy of the database. Even if the vacuumed file is one-half the size of the original, with this approach you have three copies on disk: the working database, the copy being vacuumed, and the final vacuumed version. For my 2.7GB Geary file, that’s 2.7GB + 2.7GB + 1.4GB = 6.8GB, all for a process whose ultimate goal is to conserve disk space. (The worst case? 8.1 gigabytes.)

Second, copying an SQLite file is not without its cost. It takes time to copy large files, and to do it right, you’ll need to block all writes to the source file while duplicating those bits.

Third, there’s the rather large question of how to apply subsequent changes to the vacuumed database. One approach is to walk every row in every table in both databases and apply detected changes. That’s not a trivial algorithm to write (and don’t forget that rowids can be reassigned without INTEGER PRIMARY KEY). Like copying the file, writes to the original database will need to be blocked while this is occurring.

Another approach would be to journal all changes, but that opens up another can of worms: where to save this journal? If it’s in memory, you risk losing the journal if the app crashes. If on disk, is that in a new database, a special table in the original database, or neither?

For both approaches, unless it’s completely generic, any change to the database’s schema or relations must be reflected in the synchronization/journaling code.

This strategy seems rife with possibilities of data loss and maintenance headaches. Writing a database synchronizer or an ad hoc journaling system for the occasional VACUUM seems a big pill to swallow. I can’t advocate this approach for any project unless they can bring considerable resources to bear.

auto_vacuum

SQLite offers auto_vacuum. You might wonder why I didn’t mention this first. After all, if SQLite will automatically vacuum the database, doesn’t that solve all the problems already discussed? Not really.

auto_vacuum has three modes: NONE (the default), FULL, and INCREMENTAL. Like page_size, enabling or disabling auto_vacuum after creating tables requires rebuilding the entire database with VACUUM. Thus, it’s worth pondering page_size and auto_vacuum (and write-ahead logging) before you start building your database. (Note that you can switch between INCREMENTAL and FULL once auto_vacuum is in place.)

When auto_vacuum=FULL is enabled, after each transaction SQLite moves freed pages to the end of the file and truncates them. This is the magic “re-mapping” of rows I alluded to earlier.

Unfortunately, FULL does not compact used pages, so re-mapping could actually worsen fragmentation. If pages can be re-mapped after a transaction, they are, meaning you have no control over how often it occurs or how much time it takes. That results in longer write and EXCLUSIVE lock times, especially important if your application relies on simultaneous transactions.

The way I read SQLite’s documentation, if every transaction on a database is performed with FULL auto-vacuuming, at the end of each transaction there will be no free pages in a database file. This could lead to a kind of thrashing, where successive transactions free pages, then allocate pages, then free pages, then allocate pages, and so on.

INCREMENTAL is a potential solution to FULL’s shortcomings. It only re-maps pages when you invoke PRAGMA incremental_vacuum. The PRAGMA accepts a page count as an argument, meaning you can control how many pages are re-mapped at any one time. This at least offers some degree of control over the vacuum process without locking the database for an indeterminate amount of time. The fragmentation problem remains, however. As I’ve never used INCREMENTAL, I can’t vouch for how much time it takes for SQLite to locate and move those free pages. (Remember, pages aren’t really “moved,” they’re copied from one file offset to another file offset.)

There’s a lot of confusion out there about auto_vacuum. I suspect its name is the problem, which makes it sound like a full-blown alternative to VACUUM. auto_compact or auto_release_pages would have been more appropriate.

auto-vacuum strategies

One auto_vacuum strategy would be to use auto_vacuum=FULL, eat the incurred write costs, and enjoy the disk savings while accepting potential performance degradation.

Another would be to use auto_vacuum=INCREMENTAL and run incremental_vacuum when the free page ratio hits a certain cut-off point. The maximum number of vacuumed pages could be capped to prevent long lock times.

Another is to use INCREMENTAL and blindly run incremental_vacuum at each startup with a small page count to limit the amount of time it takes. If the database file has a lot of free pages, this will chip away at them slowly over time without inconveniencing the user. You could do something similar in a background idle thread.

At the end of the day, I’ve yet to see a major application relying on SQLite that wasn’t worried about its performance. If you go the auto_vacuum route and your application hits the database hard, updating it with a variety of blobs big and small, you probably need to VACUUM occasionally, if only to defragment the pages. It doesn’t seem there’s any way around it.

Announcements, California

Announcing California 0.2

30 September 2014 Jim 23 Comments

I’m pleased to announce the release of California 0.2, Yorba’s GNOME 3 calendar application. A lot has happened since we announced California (way back in March) and I’m happy to say that we got more features into this first release than I thought we’d make. Version 0.2 offers the following:

Month and Week views of events
Add and remove Google, CalDAV, and webcal (.ics) calendars
Integrates with Evolution Data Server, so your existing Evolution calendars are automatically available
Add, view, edit, and remove events (including recurring events)
A natural-language Quick Add parser for easily adding events: just type in the information and California schedules the event(s)
F1 online help (thanks Jim Campbell!)
Smooth animations and popovers for viewing information effortlessly

The California 0.2.0 tarball is available for download at https://download.gnome.org/sources/california/0.2/california-0.2.0.tar.xz

California is also available for Ubuntu Utopic (and its derivatives) at Yorba’s PPA.

Announcements, Geary, Shotwell

Announcing Shotwell 0.20 and Geary 0.8

19 September 2014 Jim 17 Comments

We’ve released Geary 0.8 and Shotwell 0.20 today and I’m pretty excited about getting these out the door to our users. Both releases include important fixes and some great new features.

Geary 0.8

While Geary 0.8 has a slew of new features and improvements, I would say the most visible for our users (compared to 0.6) are the following:

Robert Schroll’s redesign of the mail composer. Not only does it look a lot sharper and more modern than before, it also operates inline in the main window—that is, you type your reply right below the email you’re responding to. This means replying to a conversation is a more natural operation than opening a separate window or switching to a new view. You can still pop the composer out into a separate window, just press the Detach button and you’re on your way.
Gustavo Rubio’s hard work to get signature support into Geary. Now Geary will automatically insert a signature of your design into an email, whether new or replying to another. This is one of the most-requested features for Geary, so it’s good to get this in.
I’ve put in some hard work on improving database speed and IMAP connection stability. There’s still a couple of kinks here and there, but I feel like 0.8 is a big step forward in making Geary the kind of application you can leave on for days at a time without worrying about it slowing down, crashing, or losing its connection to the server.

In other words, if you’re a Geary user, you really should upgrade.

That said, here’s a more formal list of improvements:

Major redesign of email composer, now presented inline in main window
Composer will automatically add signature to emails
Saving drafts to server can be disabled
Improved interface, now using GtkHeaderBar and modern widgets
Database speed optimizations to reduce lags and improve read times
Improved connection handling and reestablishment
Show attachments lacking a Content-Disposition
Important bug fixes
Updated translations

The tarball for Geary 0.8 is available here. Visit the Geary home page for more information.

Shotwell 0.20

Shotwell 0.20 has a more modest set of improvements, but it’s still growing and developing. In particular, new photo sharing plugins were added and stability fixes have been included:

Support for Rajce.net and Gallery 3 photo services
Set background image for lock screen
Better detection of corrupt images during import
Important stability bug fixes
Updated translations

The tarball for Shotwell 0.20 is available here. Visit the Shotwell home page for more information.

Yorba

The new 501(c)(3) and the future of free software in the United States

30 June 2014 Jim 138 Comments

Note: I’m a software developer, not a lawyer. I suspect I’m not the only coder whose eyes roll to the back of their head when legal or tax matters are discussed.

However, if you’re involved in the free software movement—especially in the United States—you may want to read through this, as long as it may seem. It appears that the United States’ Internal Revenue Service has strongly shifted its views of free and open-source software, and to the detriment of the movement, in my opinion.

What follows should not be construed as legal or tax advice or professional interpretation of those laws. If you have questions, please consult a professional.

Earlier this month the Yorba Foundation received a formal notice from the Internal Revenue Service (IRS) denying Yorba 501(c)(3) tax-exempt status. It’s possible this is nothing to be concerned with (at least, not unless you’re a part of Yorba). Reading their response, I believe this denial is actually a cause for concern for free software groups within the United States, and perhaps abroad.

A quick primer

501(c) is the section of the United States’ tax code dealing with tax-exempt organizations. The third type (i.e. 501(c)(3)) are for organizations that are “organized and operated exclusively for one or more of the following purposes: religious, charitable, scientific, testing for public safety, literary, educational, fostering national or international amateur sports competition, or the prevention of cruelty to children or animals”. IRS publication 557 gives the full run-down. Wikipedia has a good explanation of 501(c) and 501(c)(3) as well.

Free/libre/open software organizations such as the GNOME Foundation, Mozilla Foundation, Apache Software Foundation, Linux Kernel Organization, WordPress Foundation, Django Software Foundation and more operate under a 501(c)(3) status.

One misconception is that 501(c)(3)’s don’t pay any taxes. 501(c) only provides exemption from Federal income tax. Most states honor Federal exemption and will exempt those organization from state income taxes as well. The organization must still fulfill other tax obligations (such as payroll, unemployment, and sales taxes).

The advantages of 501(c)(3) go beyond income tax exemption. The status also allows donations to the organization to be treated as a tax exemption by the donor. For those of you giving $25 or $50 that’s not much of an advantage (although those donations are most certainly appreciated!). However, Yorba has seen donors offering potentially thousands of dollars back away because of our lack of 501(c)(3) status. Many large charitable foundations and grants will only consider donating to groups with a 501(c)(3) status.

Last year there was a bit of a dust-up—a scandal to some, a distraction to others, depending on their politics—when many right-wing nonprofit organizations in the United States began complaining they were being unfairly targeted by the IRS. Media inquiries determined IRS examiners were given “BOLOs” (Be On The Lookout) for certain keywords in 501(c) applications, including “Open Source Software”. Last year I spoke with Wired about the issue.

The question of the IRS targeting certain groups has not died off, although the connection to free software has fallen off the radar screen.

Yorba’s application

The Yorba Foundation applied for 501(c)(3) in December 2009. We applied as a charitable, scientific, and educational organization. Remember that we only needed to meet the criteria for one of those to receive 501(c)(3) status.

We received two requests for clarification, one on June 23, 2010, and another on September 14, 2010, which we responded to in full. We received a notice on October 5, 2011 that our application was still being processed.

The requests for clarification contained mostly non-surprising questions. For example, “Describe whether your organization provides any goods or services for a fee.” (We don’t.) Some were odd: “Will any of your directors or employees reside at your facility [i.e. our office]?” (Ah…no.)

Other than those three notices and a couple of phone calls with our representatives at the Software Freedom Law Center, that was it.

The final determination letter, the denial of exemption, is dated May 22, 2014, almost four and a half years after we first applied. That strikes me as excessive, particularly since, as the above list of open-source foundations suggests, ample positive precedent existed.

The new 501(c)(3)

What I find alarming are some of the statements made by the IRS in their denial letter. This is what could have a direct impact on the free software movement, at least here in the United States. What follows are the most hair-raising statements in their denial letter and my interpretation and response (IRS’ statements are in italics):

You have a substantial nonexempt purpose because you develop software published under open source compatible licenses that authorize use by any person for any purpose, including nonexempt purposes such as commercial, recreational, or personal purposes, including campaign intervention and lobbying.

(To help with the legalese, remember that Yorba is applying as a tax-exempt entity, and so nonexempt purposes are those that are not charitable, scientific, etc.)

The IRS reasons that since Yorba’s open source software may be used for any purpose, Yorba is not a charity. Consider all the for-profit and non-charitable ways the Apache server is used; I’d still argue Apache is a charitable organization. (What else could it be?)

There’s a charitable organization here in San Francisco that plants trees throughout the city for the benefit of all. If one of their tree’s shade falls on a cafe table and cools the cafe’s patrons as they enjoy their espressos, does that mean the tree-planting organization is no longer a charity?

Mere publishing under open source licenses for all to use does not show that the poor and underprivileged actually use the Tools. … You do not limit your distribution and do not know who uses the Tools much less if they use them for artistic purposes. … you do not know who uses the Tools much less what kind of content they create with the Tools.

(Here and elsewhere, “Tools” is IRS shorthand for Yorba’s software.)

The IRS is correct that Yorba does not know who is using our software or for what purposes, nor does Yorba limit the distribution of our software to a particular charitable segment of society. But when I spend three milliseconds imagining how that would work, I shudder.

What’s more, these objections clash with three of the Four Software Freedoms and copyleft in general:

The freedom to run the program as you wish, for any purpose (freedom 0).
The freedom to redistribute copies so you can help your neighbor (freedom 2).
The freedom to distribute copies of your modified versions to others (freedom 3).

In other words, we (and, presumably, everyone else) cannot license our software with a GNU license and meet the IRS’ requirements of a charitable organization.

Freedom 1 (“The freedom to study how the program works”) isn’t attacked as non-charitable by the IRS, but it is defined as non-educational:

The purpose of source code is so that people can modify the code and compile it into object code that controls a computer to perform tasks. Anything learned by people studying the source code is incidental.

Which is like saying the only point of an algorithm is its final answer, and so Einstein publishing E=mc² offered nothing more to the world than a way to accurately measure the amount of energy in, say, a cube of sugar or a block of cheese. Any deeper learning is incidental.

I can directly trace the start of my year career in software development to the first BASIC programs I encountered as a 9 year-old. I pressed the Break key, typed LIST, and learned. I didn’t receive any formal education in programming until my junior year in high school. I know for a fact I’m not the only one.

How many coders learned from studying and modifying existing code? Think about UNIX, BASIC, HyperCard, and just about every scripting language devised. The availability of source code and its relation to learning how to program is so fundamentally correlated, it’s zen.

The development and distribution of software is not a public work even if published under open source or creative commons compatible licenses because software is not a facility ordinarily provided to the community at public expense. … In the face of such consistency of the key characteristics over four centuries we are constrained from extending the term public works to encompass intangibles such as software.

The “four centuries” of terminology being referenced here is that software is not a lake, dam, bridge, highway, etc. In other words, because 17th century English Common Law doesn’t mention IMAP email clients or JPEG decoding, software is not a public work.

Sarcasm aside, these statements are annoying because they create a kind of Möbius strip Catch-22 with the earlier statements I quoted. Since Yorba makes our software widely available to the public at large, we’re not truly charitable; but since software doesn’t meet the IRS’ definition of “public works”, making our software widely available is not charitably serving the public at large.

And then there’s this humdinger, which sounds like it came from a Douglas Adams novel:

…public works must serve a community. Open source licensing ensures the Tools are accessible to the world. We have not found any authority for the proposition that the world is a community within the meaning of § 501(c)(3).

There’s something delicious about the phrase “We have not found any authority for the proposition that the world is a community.” Mahatma Gandhi, Jesus Christ, and Martin Luther King Jr. are three I can name off the top of my head.

You are the copyright holder of some Tools code. Private persons are the copyright holders of the portion of Tools code you do not own. … Even though you are the copyright holder to a portion of Tools code, the portion of Tools code owned by private persons cannot be a public work within the meaning of § 501(c)(3).

I believe what the IRS is inadvertently requiring here is copyright assignment. Since Yorba does not require copyright assignment from our contributors, the IRS appears to think our software cannot be a public work.

Copyright assignment is controversial in the free software community. (A nice overview can be found here; the controversy up-close and in-person can be found here and here.)

I hope I’m wrong about this. I doubt they’re going to start enforcing this in the future for organizations that already enjoy exemption. If they do, it will be a royal mess for those projects having to contact every author of every non-trivial contribution and get them to sign over their rights. This is all a big if, of course.

Where Yorba stands

This does not spell disaster for Yorba. The Foundation’s existence does not hinge on 501(c)(3) status. It certainly would’ve been advantageous if the IRS had granted it. It certainly would’ve been a better world if the IRS hadn’t waited four and a half years to inform us of their decision.

We have no plans to appeal their decision. It looks to be an arduous legal battle we cannot afford.

I hope other open source projects will take note of this decision, especially projects considering applying for 501(c) status.

For those who think I’m being alarmist, I encourage them to consider the above statements by the IRS and ask themselves how the good projects already granted 501(c)(3) would’ve stacked up under the IRS’ new parameters.

I also recognize that I’m cherry-picking statements from the IRS for my commentary. I selected the ones I thought would be of most interest to the community.

The full PDF of the IRS’ decision can be found here.

Geary

Inline composer comes to Geary

22 May 2014 Jim 14 Comments

A year in the making, I’m pleased to announce that we’ve landed a major new feature in Geary: an inline email composer. What’s that mean? In short, when you go to reply to a conversation, instead of a new window popping up on the screen, the composer is embedded in the window right below the message you’re replying to. Want a separate window? Just press the Detach button and you’re writing emails just like Geary used to work. Old School, as the kids say.

This great addition to Geary is thanks to the hard and tireless work of Robert Schroll who put this together on a private branch and has been maintaining it for some time now. Serendipity led Robert to San Francisco last week, and he generously spent a good chunk of his time here working with me to finalize snapping the pieces of the puzzle together and polishing the chrome. It’s pretty sweet, I must say.

The inline composer is only available in git master at the moment. It’ll be available for general release in our next stable version, Geary 0.8. In the meantime, if you’re so bold and want to give it a test drive, you can build Geary from master. Or, if you’re running Ubuntu, install it from Yorba’s Daily Build PPA (but be sure to read the warnings on that page!) The more eyeballs the better. If you find a bug, please let us know.

Geary

Geary bounties galore

10 April 2014 Jim 2 Comments

A number of Geary bounties have popped up in recent weeks that our users may want to know about. Bounties represent reward money for coder(s) who successfully land their improvements in the program. (Yorba created a bounty a few months back for Geary, you can read about it here.) Some of the new Geary bounties include:

“Add option to sort folder into read and unread” – There’s a number of ways to approach this; I would be happy to simply have a switch or toggle button that filtered read conversations from the list, leaving only unread to peruse.

“Notify of new messages at startup” – This is a long-standing feature request and it would be great to get this landed in Geary. There’s a number of fancy ways this could be achieved, but I think the easiest way to approach this would be for Geary to be launched at login time with a magic command-line option that hides the main window. As new messages come in, notifications are displayed. If the user clicks on the Geary icon or the notification bubble, the hidden window is displayed. The added complication here is that closing the window should merely hide it, while the Quit option would, in fact, cause Geary to exit.

“Ubuntu online accounts integration” – The basic thrust of this problem is to fetch account information from UOA and start pulling down mail with no user interaction (other than starting Geary, of course).

With all bounties, please be sure to read over the linked Bugzilla ticket and understand all the in’s and out’s of the task. Tickets are also the best place to ask questions for the Geary team. We’re here to help!

Some of these bounties are courtesy our good friends at elementary while some have been initiated by independent users who simply would like to see Geary improved. Follow the above links to see how much money is up for grabs on each.

If you see a feature you really, really want to see added to Geary, consider how much it’s worth to you and pledge that amount. High dollar values encourage attention from developers and gets traction and movement. And if one of the above doesn’t tickle your fancy, there’s a whole host of other outstanding bugs that are listed but have no money behind them; pledge and get them started!

As always, Yorba developers will not collect bounties, but we certainly encourage everyone out there to think about (and act upon!) how they can contribute toward improvements.

Jim Nelson + Yorba Foundation archives

SQLite, VACUUM, and auto_vacuum

What is VACUUM?

Why VACUUM?

A few catches

When to VACUUM

When not to VACUUM

Trigger mechanisms, heuristics

Heuristic #1: Disk space

Heuristic #2: Elapsed time

Heuristic #3: Fragmentation

Other reasons to VACUUM: page_size, auto_vacuum

How to VACUUM

1. Delayed startup

2. Delayed shutdown

3. Ask the user

4. At idle

5. VACUUM a copy in the background

auto_vacuum

auto-vacuum strategies

Announcing California 0.2

Announcing Shotwell 0.20 and Geary 0.8

Geary 0.8

Shotwell 0.20

The new 501(c)(3) and the future of free software in the United States

A quick primer

Yorba’s application

The new 501(c)(3)

Where Yorba stands

Inline composer comes to Geary

Geary bounties galore

JIm Nelson's blog + archives from Yorba Foundation's original blog