Safe as Milk

The Ladybird Guide to Business Intelligence

January 26, 2011 work 6 Comments

Recently I have found myself frustrated by the lack of a very simple overview for Business Intelligence explaining what problems it solves, and how.

For example, the Pentaho BI platform FAQ has a promising first question: “What is a business intelligence (BI) platform?” The answer is typical of BI overviews I have seen:

A comprehensive development and runtime environment for building complete solutions to business intelligence problems. The Pentaho BI Platform is the infrastructure and core services that integrate business intelligence components to complete the BI Suite. This includes the infrastructure necessary to build, deploy, execute and support applications.

I don’t know about you, but that gives me more questions than answers. What type of problems are business intelligence problems? What are the core services provided by a BI platform? What are BI components, and a complete BI suite? In short, what does it do?

The wikipedia article on business intelligence is a bit better, but still gets into heavy acronyms quite early.

I think I have figured out what Don Norman calls a conceptual model which is about right, so for those who have struggled as I have recently, here is the Ladybird guide to Business Intelligence.

What problems does BI solve?

Let’s say you are the CEO of a company, and you want to track what the costs of the company are, across payroll, purchasing, marketing and sales, overall and by division. You also want to track revenues by division, product line, market and month. For each variable, you’d like to drill down when you see a figure that looks odd. Payroll in Asia increased 20% this year – did we buy a company? Are there savings to be made?

All of this information spans dozens of different computer systems, applications, databases. What you want is one application to rule them all, from which you can get nice graphical clickable data.

Let’s say you’re a free software project manager or community manager. You have lots of infrastructure for people working on the project – source control, mailing lists, forums, translation infrastructure, documentation, bug tracking, downloads, …

You want to know if your community is growing, shrinking or stagnant. You’d like to know if translators are up, and spot when something is up – we lost 3 Thai translators last cycle, does the Thai translation team have a problem? Is there a problem with wiki spam? A correlation between people active on the forums and commits to the project? Some of these questions span different applications, systems, and databases. What you want is one application to rule them all, where you can get a quick overview of what’s happening in the community, and click on something to drill down into the data, or create complex queries to spot correlations and patterns across different apps.

BI software is ideally suited to helping in both of these situations.

How does it work?

Very simply, a BI platform is a web application that allows you to create queries and visualise the results across a variety of data sources. At its simplest, you bring big lumps of data together and extract some useful numbers from it. If you’ve ever used a pivot table in a spreadsheet application, you’ve written a BI query.

Now we get into the acronyms and the jargon. Here’s a quick lexicon of commonly used BI terms:

ETL: Extract/Transform/Load – An ETL module allows you to script and automate the extraction of data from a funky data source (say, CSV files on a server, an auto-generated spreadsheet, or screen-scraping data from a HTTP query, or just an SQL database), and transform it into some other format (typically basic transformations like joins, mapping inputs to database fields, or applying simple arithmetic to convert to an agreed unit), and then store the result in a database.
OLAP: Online Analytical Processing – a fancy name for “queries”. There is a de facto standard query format called MDX and the database needs to be optimised for “multidimensional queries” (aka joins – like pivot tables in a spreadsheet).
Data Warehouse: A fancy name for database.
Reporting: The presentation of the results of queries in a graphical way.

In brief, then, a BI suite provides you with a way to suck in data from a variety of sources, store the data (if you need to) in a custom database which is optimised for querying across different data sources, a nice way to define the queries in which you are interested, and then present the results of those queries in a nice graphical way.

If you don’t need to do any transformation of data, and you can operate directly on SQL databases, then you can typically provide the BI platform access to them directly. If you have any unusual data sources, or want to transform data, you will need an ETL module. If you are dealing with a lot of data and want to optimise query time, you might need a specific OLAP server. A query editor will help you create queries to get the information you want out of your data. You will need a reporting module to convert query results from raw tabular form to pie charts, bar charts and the like. And the BI server provides hooks for all of these various modules to work together, sucking in, storing, manipulating and presenting data in interesting ways.

Is this all right?

I would love to know if my mental model is flawed – so if I’m missing anything important, or I’ve said something which is a pile of rubbish, please do add a comment and let me know.

I know how hard it can be to cut through the jargon in an area where it’s ubiquitous and the first step in enterprise software is usually the hardest, so hopefully this will be useful to someone other than myself.

The Lifecycle of a Patch (or: Working Upstream)

January 14, 2011 community, freesoftware, gimp, gnome, inkscape, maemo, meego 5 Comments

Reposted from Neary Consulting

Yesterday I looked into what it means to be a maintainer of a package. Today, I’m going to examine how to affect change in a distribution like MeeGo, and what it means to work upstream. To do so, we’re going to look at how code gets from a developer’s brain into the hands of a user.

So – how can you make a change in a Linux-based distribution? Here’s what happens when everything works as it should:

You open a bug report for the feature against your distribution
You identify the module or modules you need to change to implement the new feature
You open bug reports for each of the modules concerned, detailing the feature and the changes needed in that module for the feature
You write a patch to implement the feature, and propose it (appropriately cut up for ease of review) to the maintainers of those modules
Once the code has gone through the appropriate review process, it will be committed to the source control of the module(s)
Some time later, the maintainer of each module will include that code in a stable release of the module
Some time after that, the new stable versions will be packaged and uploaded to MeeGo
Your code will be included in the next release of the distribution following the upload.

When people talk about “working upstream” in MeeGo or Linaro, this is what they mean.

To simplify matters for our analysis, let’s consider that the feature we want to implement is self-contained in one module (or related modules which release together). There are two different scenarios we’ll consider:

The module is maintained by people not associated with your distribution (for example, a GNU or GNOME project)
The module is maintained by people closely related to your distribution (for example, Unity in Ubuntu, or oFono in MeeGo)

We will also look at a third situation, where you find and fix a bug in the software you are using – that is, a released version of a distribution (the proverbial “scratching an itch”).

For each case, I will try to pick a representative feature/patch and follow it from developer through to distribution to Real Users.

What if your code changes different projects?

If your code touches several modules (for example, if you are proposing some new API in GTK+ which you want to use in the GIMP) then things can get complicated – you will need a stable version of GTK+ to be released before you can ship a stable release of the GIMP which depends on it.

This issue of staggered releases is one that Andrew Cowie pointed out a few years ago for language bindings. To avoid making bindings on shifting sands, he preferred to package new APIs once they had been included in a stable GNOME release. In turn, Java GNOME developers rarely depend on development release bindings, and they would wait for the new API to be included in a stable bindings release. For example, the gtk_orientable_get_orientation, added to GTK+ at the end of September 2008, was released in GTK+ 2.16, in March 2009. The first version of Java-GNOME which depended on GTK+ 2.16 was version 4.0.13, released in August 2009. That was packaged in distributions in Autumn 2009, and so most users would not have access to the newer bindings for a few months after that – perhaps early 2010 – at which point, the API was written 18 months beforehand.

And that is when you have a regular release schedule you can rely on! Pity the developer who wants to release a GIMP plug-in which depends on some API included in GIMP 2.8 – the last stable GIMP release, 2.6, came out in October 2008, and over two years later, 2.8 still has not released. And when you combine unreliable release schedules for distributions and applications, the results are cumulative: users of the stable Debian distribution are still using GIMP 2.4 releases. The GIMP 2.4 released in October 2007. Features added to the GIMP in late 2007 are still not in the hands of users of stable Debian distributions.

Getting features to users

It is difficult to generalise when users upgrade their Linux distributions, or even to say what proportion of Linux users are new users at any given time. It would be over-simplifying to say that developers use bleeding-edge distributions, power users upgrade early to the latest and greatest, new users install the latest distributions available, but will only upgrade every 18 months or so afterwards, and conservative users stick with “Long term service” or stable distributions. Most developers I know use their computer for work (and thus want a stable distribution) and only install the latest versions of various dependencies they need to work on their project. But let’s generalise and say that this is roughly the case. So (guesstimating) about 10% of your users will be upgrading to the latest distribution very quickly after its release, a further 20% in the months after when the bugs are shaken out, and the rest will follow along in their own time, perhaps 12 or 18 months later.

To make this concrete, let’s follow the life of a single patch. This is complete anecdata, but in my defence, the patch has been chosen by random, from a project which I know has good community processes and release management in place. The patch we’re going to follow adds an extension to Inkscape to render objects along triangular paths.

Bug #226001 opened on 2008-05-03 by inductiveload, with a description of the feature to be added, and proposed code to implement it. The code, as an extension, may have a lower bar for acceptance than code which is core to a project.
Patch submission reviewed on 2008-05-03, minor comments, but patch is accepted (note: This was not the authors first submission to Inkscape)
Patch corrected to respond to comments and committed on 2008-05-03 (did I mention these guys had good community processes!?!)
Inkscape 0.47-pre0, containing the Triangle extension, released on 2009-07-02
Inkscape 0.47-pre4 included in Ubuntu 9.10

So for a feature developed in mid 2008, most Inkscape users will still not have the feature by the end of 2009, 18 months later. This is both a typical and atypical example: in many projects, patch proposals lay unreviewed for days, weeks, sometimes months, but the 0.47 release cycle was a particularly long one for Inkscape. However, I think the lag from code written to presence on user’s hard drives of ~12 to 18 months is about correct.

Does it have to be this hard?

If this were the only way to get features into a distribution, trying to improve MeeGo by contributing upstream would be a very frustrating experience. Happily, there are ways to accelerate the process. Taking the MeeGo kernel as an example, where Greg Kroah-Hartman recently threw in the towel on persuading people to propose patches upstream; the process is supposed to work like this:

Propose a patch for inclusion upstream. This patch will then ship in a future stable kernel release (let’s say 2.6.38).
After peer review, when the code has been accepted for inclusion in the kernel upstream, propose a backport for inclusion in the MeeGo kernel. The back-ported patch will be maintained across the next MeeGo release, and will be dropped when the kernel version included in the MeeGo project catches up with 2.6.38.

The overhead here is reduced basically to the peer review process of the upstream project, and the cumulative cost of merging a patch over the course of 6 months.

As a distributor (or a developer working on a specific distribution), this allows you to get code to everyone, eventually, and have that code included in your distribution as soon as you are sure that it is up to the standard expected by the community. Currently in MeeGo, the trend seems to be more towards submitting patches concurrently upstream and to MeeGo kernel maintainers (or even submitting them upstream once they have been accepted into the MeeGo kernel). In the case that a patch requires substantial modifications, or is rejected outright, upstream, the kernel maintainers are then left carrying a patch indefinitely in the distribution. For one patch, this might not be a big deal, but for thousands of patches, the maintenance and integration burden of these patches adds up.

It is also not unusual for kernel developers to maintain their own git branches for a long time. Three examples that come to mind are inotify, which Robert Love maintained for over a year for both Novell and in the kernel before it was accepted into the mainline, ReiserFS, which was maintained for several years out-of-tree before being shipped with the Linux kernel in 2001, and the fast desktop patchset which Con Kolivas maintained for almost five years on the -ck kernel branch. Distributions will occasionally ship a substantial diff to upstream if there is a maintainer committed to getting the code upstream eventually. Allocating someone to work over a long period to make everyone happy and comfortable with your code may enable you to ship a big patch to upstream, but this will not be sustainable long term.

To summarise: when working upstream, as a distribution, you should only ship with patches which have been accepted in a development version of upstream already, if you can help it.

Meetings in telephone boxes

Sometimes, however, when upstream and downstream coincide, you can simplify things considerably, while also adding a small measure of risk.

In MeeGo, to continue with that example, the distribution architects have a pretty good idea when they can expect emergency telephony to be ready for oFono and the MeeGo telephony stack, because they’re writing it. By co-ordinating the upstream release management with downstream packaging, you can make promises as a distribution which you can’t with community-developed software.

When upstream and downstream are co-ordinating each other, we cut out the middleman. The workflow becomes:

Report a bug/feature request against a component of the distribution
Develop a patch which implements the feature, and submit it directly to the distribution bug tracker
Once it has been reviewed and accepted, you know that your patch will be included in the next version of the distribution.

This gives a distribution much more control, both over what gets done, and when, and explains both the Ayatana and MeeGo UX development projects. However, being able to plan around the release is no guarantee that the release will happen on time: GNOME has in the past been stung by planning during the 2.6 development cycle to depend on a new version of GTK+, only to find that the release was delayed. In the end, the GTK+ release shipped in time for the 2.6 release at the end of March.

Scratch scratch

The other patch lifecycle I’d like to mention, because it is so relevant to distributions, was pointed out to me by Federico Mena Quintero yesterday. What happens to a patch that someone makes and submits to a distribution when they find a bug in stable released software? This is one of the key advantages of free software – if you find a bug in the software you use, and you have the wherewithall, you can fix the bug and share that fix with everyone else.

However, as we have seen, there is typically a lag of several months from the time that software is released and the time it is being used by large numbers of users through distributions. With releases of Red Hat Enterprise Linux, Novell Suse Linux Desktop and Ubuntu LTS being supported for up to 5 years, it is possible that important bugs will be fixed in these stable versions for years after the original developers have moved on, and are no longer maintaining older stable versions.

Let’s say I find and fix a bug in Rhythmbox 0.12.5, which ships with Ubuntu 9.10. I open a bug report on Launchpad, attach a fix to the source .deb there, and I update my local copy. As a user, I’m happy – I have fixed my problem and shared the solution with others. If I’m particularly conscientious, I might open a bug on gnome.org against Rhythmbox and attach my patch there, but since the development version is now 0.13.2, the best you can hope for is that the patch applies cleanly to the master branch, and will be included in the next release. It is very unlikely that the upstream maintainers will release another update to the 0.12 series at this point.

Now imagine that you are a maintainer for Suse, and someone reports the same bug against a long-term service release.In practice, there are several different versions being maintained by different distributions, and no good way to know if the same bug has been reported and fixed by someone else. You end up searching for a fix in upstream bug trackers, and in the bug trackers of each of the other main distributions. According to Federico at the time:

Patches for old versions are traded in the black market. You have friends in another distro? You ask them first, “did you guys already fix this?” Those patches don’t ever manage to reach CVS, where everyone would be able to get them.

Ideally, you could collaborate ahead of time with other distributions to ensure that you are all using the same branch of upstream modules, and are committing patches upstream. The Linux kernel is moving to this model, and there are also discussions underway in GNOME to co-ordinate this type of activity. Mark Shuttleworth has also pushed for something similar by encouraging projects in the core Linux platform to have a regular cadence of releases, so that everyone can synchronise their longer term service offerings every couple of years.

But at the moment, the best you can hope for is that your patch will be included in an upcoming release for your distribution, and which point other users of the distro can avail of it, and that upstream will patch their development version and latest stable versions, and get your patch to everyone in a few months.

Working upstream

The goal of this article is to explain what working upstream actually means, and how to make that more palatable for a distribution that wants to get features written and included in their next release. Hopefully, by pointing out some of the shortcomings of the way patches circulate from developers to users, some of these issues can be addressed.

In any case, one thing is clear – if you are carrying a patch as a distribution without ever submitting it upstream, you are making a costly mistake. You will be carrying code that others won’t, and bearing all of the merge and maintenance burden for that code for years to come. The path to maximum happiness is to co-ordinate with other distributions and with upstream to ensure that everyone is working in the same place, and sharing work as much as possible.

What’s involved in maintaining a package?

January 13, 2011 community, freesoftware, gnome, maemo, meego 7 Comments

Reposted from Neary Consulting

An interesting question was asked on a MeeGo mailing list recently: What does it mean to be a maintainer of something? How much time does it take to maintain software? It resulted in a short discussion which went down a few back alleys, and I think has some useful general information for people working with projects like MeeGo, which are part software development, part distribution.

Are you maintaining software, or a package?

The first question is whether you are asking about maintaining something in the Debian sense, or the GNOME sense?

A Debian package maintainer:

Tracks upstream development, and ensures new releases of software are packaged and uploaded in a timely manner
Work with distribution users and other maintainers to identify bugs and integration issues
Ensure bugs and feature requests against upstream software are reported upstream, and bugs fixed upstream are propagated to the distribution packages
Fix any packaging related issues, and maintain any distribution-specific patches which have not (yet) been accepted or released upstream

A GNOME project maintainer:

Makes regular releases of the software they maintain (typically a .tar.gz with “./configure; make; make install” to build)
Are the primary guardians of the roadmap for the module, and sets the priorities for the project
Works with packagers, documenters, translators and other contributors to the software to ensure clear communication of release schedules and priorities
Acts as a central point of contact for release planning, bug reports and patch review and integration
A typical maintainer is also the primary developer of the software in question, but this is not necessarily the case

Obviously, these two jobs are very different. One places a high priority on coding & communication, another on integration, testing, and communication.

So how much time does maintaining software take?

Well, how long is a piece of string?

To give opposite extremes as examples: Donald Knuth probably spends a median time of 0 hours per week maintaining TeX and Metafont. On the other hand, Linus Torvalds has worked full time maintaining the Linux kernel for at least the past 15 years, and has been increasingly delegating large chunks of maintenance to lieutenants. The maintenance of the Linux kernel is a full time job for perhaps dozens of people.

On a typical piece of GNOME software (let’s take Brasero as an example) much of the work is simplified by following the GNOME release schedule – the schedule codifies string freezes and interface freezes to simplify the co-ordination of translation and documentation. In addition, outside of translation commits, Brasero has had contributions from its maintainer, Philippe Rouquier, and 6 other developers in the last 3 months. Most of these changes are related to the upcoming GTK+ 3 API changes, and involve members of the GTK+ 3 team helping projects migrate.

In total since the 2.32.0 release, there have been 55 commits relating to translations, 50 commits from Philippe, 9 from Luis Medina, co-maintainer of the module, and there were 4 commits by other developers. Of Philippe’s 50 commits, 14 were related to release management or packaging (“Update NEWS file”), 5 were committing patches by other developers that had gone through a review process, and the remainder were features, bug fixes or related to the move to the new GTK+. Of Luis’s commits, 2 were packaging related, and 2 were committing patches by other developers.

This is a lot of detail, but the point I am making is that the “maintenance” part of the work is relatively small, and that the bigger part of maintenance is actually sending out the announcements, paying attention to bug reports and performing timely patch review. I would be interested to know how much time Philippe has spent working on Brasero over the past release cycle. I would guess that he has spent a few hours (somewhere between 5 and 10) a week.

On the other hand, the Debian maintainer for the Brasero package has a different job. There are 6 bugs currently forwarded upstream from the Debian bug tracker, and another 35 or so awaiting some final determination. A number of these look like packaging bugs (“you need version X of dependency Y installed”). The last release packaged and uploaded was 2.30.3-2, dating from November, and there have been 4 releases packaged in the past 8 months, none by the maintainer.

A typical Debian maintainer is a “Debian developer” for several packages. Pedro Fragoso, the Debian maintainer of Brasero, maintains 5 packages. I think it is fair to say that the amount of time a package maintainer spends maintaining an individual package is quite low, unless it is extremely popular. Perhaps a few hours a month.

The package maintainer has little or no say (beyond interacting with the project maintainer and forwarding on bug reports & feature requests) in what happens upstream, or which features have a high priority. His influence comes primarily from the fact that he is representing a larger user base and can indicate which bugs his distro’s users are running into and reporting regularly, or which feature requests are generating a lot of feedback.

What’s in a word?

It’s clear that a package maintainer is not the same thing as a project maintainer. So when Sivan asked on the MeeGo developer list how he could become a maintainer, he clarified later to say that what he was really asking was “How can I affect change in MeeGo?” To do that, you need to write some code that changes a module, or a number of modules, and then you need to get that code into MeeGo.

How that happens, in all its gory details, is the next instalment in this series of at least 2 articles: The Lifecycle of a Patch (or: Working Upstream).

Community Building Guide

January 6, 2011 community, freesoftware, gnome, maemo 4 Comments

I wrote another guest article for the VisionMobile blog last week, which just went live yesterday, titled “Open Source community building: a guide to getting it right”.

Exerpt:

Community software development can be a powerful accelerator of adoption and development for your products, and can be a hugely rewarding experience. Working with existing community projects can save you time and money, allowing you to get to market faster, with a better product, than is otherwise possible. The old dilemma of “build or buy” has definitively changed, to “build, buy or share”.

Whether you’re developing for Android, MeeGo , Linaro or Qt, understanding community development is important. After embracing open development practices, investing resources wisely, and growing your reputation over time, you can cultivate healthy give-and-take relationships, where everyone ends up a winner. The key to success is considering communities as partners in your product development.

By avoiding the common pitfalls, and making the appropriate investment of time and effort, you will reap the rewards. Like the gardener tending his plants, with the right raw materials, tools and resources, a thousand flowers will bloom.

After focusing recently on a lot of the things that people do wrong, I wanted to identify some of the positive things that companies can do to improve their community development experiences: try to fit in, be careful who you pick to work in the community, and ensure that your developers are engaging the project well. If you are trying to grow a community development project around a piece of software, then you should ensure that you lower the barriers to entry for new contributors, ensure that you create a fair and just environment where everyone is subject to the same rules, and don’t let the project starve for lack of attention to things like patch review, communication, public roadmapping and mentoring.

The original title of the article was “Here be dragons: Best practices for community development” – I’ll let you decide whether the VisionMobile editors made a good decision to change it or not.