Redmine analytics with Python, Pandas and iPython Notebooks

ORIGINALLY POSTED ON IXA.IO

We use Redmine at Infoxchange to manage our product backlog using the Agile plugin from Redmine CRM. While this is generally good enough for making sure we don’t lose any cards and has features like burndown charting you still have to keep your own metrics. This is sort of annoying because you’re sure the data is in there somehow and what do you do when you want to explore a new metric?

This is where iPython Notebooks and the data analysis framework Pandas come in. iPython Notebooks are an interactive web-based workspace where you can intermix documentation and Python code, with graphs and tables you produce output directly into the document. Individual “cells” of the notebook are cached so you can work on part of the program, and experiment (great for data analysis) without having to run big slow number crunching or data download steps.

Pandas is a library for loading and manipulating data. It is based on the well-known numpy and scipy scientific packages and extends them to be able to load data from almost any file type or source (i.e. a Redmine CSV straight from your Redmine server) without having to know much programming. Your data becomes intuitively exposed. It also has built-in plotting and great integration with iPython Notebooks.<!–more–>The first metric I wanted to collect was to study the relationship between our story estimates (in points), our tasking estimates (in hours) and reality. First let’s plot our velocity. It is a matter of saving a custom query whose CSV export URL I can copy into my notebook. This means I can run the notebook at any time to get the latest data.

Remember that iPython notebooks cache the results of cells so I can place accessing Redmine into its own cell and I won’t have to constantly be downloading the data. I can even work offline if I want.

import pandas
data = pandas.read_csv('https://redmine/projects/devops/issues.csv?query_id=111&key={key}'.format(key=key))

Pandas exposes the result as a data frame, which is something we can manipulate in meaningful ways. For instance, we can extract all of the values of a column withdata['column name']. We can also select all of the rows that match a certain condition:

data[data['Target version'] != 'Backlog']

We can go a step further and group this data by the sprint it belongs to:

data[data['Target version'] != 'Backlog']\
    .groupby('Target version')

We can then even create a new series of data by summing the points column each group to find our velocity (as_index means we wish to make the sprint version the row identifier for each row, this will be useful when plotting the data):

data[data['Target version'] != 'Backlog']\
    .groupby('Target version', as_index=True)\
    ['Story Points'].sum()

If you’ve entered each of these into an iPython Notebook and executed them you’ll notice that the table appeared underneath your code block.

We can use this data series to do calculations. For example to find the moving average of our sprint velocity we can do this:

velocity = data[data['Target version'] != 'Backlog']\
    .groupby('Target version', as_index=True)\
    ['Story Points'].sum()

avg_velocity = pandas.rolling_mean(velocity, window=5, min_periods=1)

We can then plot our series together (note the passing of ax to the second plot, this allows us to share the graphs on one plot):

ax = velocity.plot(kind='bar', color='steelblue', label="Velocity",
                   legend=True, title="Velocity", figsize=(12,8))
avg_velocity.plot(ax=ax, color='r', style='.-', label="Average velocity (window=5)",
                  legend=True)
ax.xaxis.grid(False)

You can view the full notebook for this demo online. If you want to learn more about Pandas read 10 minutes to Pandas. Of course what makes iPython Notebooks so much more powerful than Excel, SPSS and R is our ability to use any 3rd party Python package we like. Another time I’ll show how we can use a 3rd party Python-Redmine API to load dataframes to extract metrics from the individual issues’ journals.

Returning screenshots in Gitlab CI (and Travis CI)

ORIGINALLY POSTED ON IXA.IO

Our code base includes a large suite of functional tests using Lettuce, Selenium and PhantomJS. When a test fails, we have a hook that captures the current screen contents and writes it to a file. In an ideal CI system these would be collected as a failure artifact (along with the stack trace, etc.) but that’s not currently possible withGitlab CI (hint hint Gitlab team, investigate Subunit for streaming test output).

Instead what we do on Gitlab is output our images as base64 encoded text:

if $SUCCESS; then
    echo "Success"
else
    echo "Failed ------------------------------------"
    for i in Test*.html; do echo $i; cat "$i"; done
    for i in Test*.png; do echo $i; base64 "$i"; echo "EOF"; done
fi

$SUCCESS[/bash]
Of course now you have test output full of meaningless base64’ed data. Enter Greasemonkey.

// ==UserScript==
// @name View CI Images
// @namespace io.ixa.ci
// @description View CI Images
// @version 1
// @match https://gitlabci/projects/*/builds/*
// ==/UserScript==

(function ($) {
    var text = $('#build-trace').html();
    text = text.replace(/(Test_.+\.png)\n([A-Za-z0-9\n\/\+=]+)\nEOF\n/g,
                        '&lt;h2&gt;$1&lt;/h2&gt;&lt;img src="data:image/png;base64,$2" ' +
                        'style="display:block; max-width:800px; width: auto; height: auto;" ' +
                        '/&gt;');
 
    $('#build-trace').html(text);
})(jQuery);

Web browsers (handily) can already display base64’ed images. This little user script will match builds on the CI server you specify and then replace those huge chunks of base64 with the image they represent.

This technique could easily be replicated on Travis CI by updating the jQuery selector.

Using an SSL intermediate as your CA cert with Python Requests

Originally posted on ixa.io

We recently had to work around an issue integrating with a service that did not provide the full SSL certificate chain. That is to say it had the server certificate installed but not the intermediate certificate required to chain back up to the root. As a result we could not verify the SSL connection.

Not a concern, we thought, we can just pass the appropriate intermediate certificate as the CA cert using the verify keyword to Requests. This is after all what we do in test with our custom root CA.

response = requests.post(url, body, verify=settings.INTERMEDIATE_CA_FILE)

While this worked using curl and the openssl command it continued not to work in Python. Python instead gave us a vague error about certificate validity. openssl is actually giving us the answer though by showing the root CA in the trust chain.

The problem it turns out is that you need to provide the path back to a root CA (the certificate not issued by somebody else). The openssl command does this by including the system root CAs when it considers the CA you supply, but Python considers only the CAs your provide in your new bundle. Concatenate the two certificates together into a bundle:

-----BEGIN CERTIFICATE-----
INTERMEDIATE CA CERT
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
ROOT CA CERT THAT ISSUES THE INTERMEDIATE CA CERT
-----END CERTIFICATE-----

Pass this bundle to the verify keyword.

You can read up about Intermediate Certificates on Wikipedia.

Searching for documents with arrays of objects using Elasticsearch

Originally posted on ixa.io.

Elasticsearch is pretty nifty in that searching for documents that contain an array item requires no additional work to if that document was flat. Further more, searching for documents that contain an object with a given property in an array is just as easy.

For instance, given documents with a mapping like so:

{
    "properties": {
        "tags": {
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We can do a query to find all the documents with a tag object with tag term find me:

{
    "query": {
        "term": {
            "tags.tag": "find me"
        }
    }
}

Using a boolean query we can extend this to find all documents with a tag object with tag find me or tagtype list.

{
    "query": {
        "bool": {
            "must": [
                {"term": {"tags.tag": "find me"},
                {"term": {"tags.tagtype": "list"}
            ]
        }
    }
}

But what if we only wanted documents that contained tag objects of tagtype list *and* tag find me? While the above query would find them, and they would be scored higher for having two matches, what people often don’t expect is that hide me lists will also be returned when you don’t want them to.

This is especially surprising if you’re doing a filter instead of a query; and especially-especially surprising if you’re doing it using an abstraction API such as elasticutils and expected Django-esque filtering.

How Elasticsearch stores objects

By default the object mapping type stores the values in a flat dotted structure. So:

{
    "tags": {
        "tag": "find me",
        "tagtype": "list"
    }
}

Becomes:

{
    "tags.tag": "find me",
    "tags.tagtype": "list"
}

And for lists:

{
    "tags": [
        {
            "tag": "find me",
            "tagtype": "list"
        }
    ]
}

Becomes:

{
    "tags.tag": ["find me"],
    "tags.tagtype": ["list"]
}

This saves a whole bunch of complexity (and memory and CPU) when implementing searching, but it’s no good for us finding documents containing specific objects.

Enter: the nested type

The solution to finding what we’re looking for is to use the nested query and mark our object mappings up with the nested type. This preserves the objects and allows us to execute a query against the individual objects. Internally it maps them as separate documents and does a child query, but they’re hidden documents, and Elasticsearch takes care to keep the documents together to keep things fast.

So what does it look like? Our mapping only needs one additional property:

{
    "properties": {
        "tags": {
            "type": "nested",
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We then make a nested query. The path is the dotted path of the array we’re searching. query is the query we want to execute inside the array. In this case it’s our boolquery from above. Because individual sub-documents have to match the subquery for the main-query to match, this is now the and operation we are looking for.

{
    "query": {
        "nested": {
            "path": "tags",
            "query": {
                "bool": ...
            }
        }
    }
}

Using nested with Elasticutils

If you are using Elasticutils, unfortunately it doesn’t support nested out of the box, and calling query_raw or filter_raw breaks your Django-esque chaining. However it’s pretty easy to add support using something like the following:

    def process_filter_nested(self, key, value, action):
        """
        Do a nested filter

        Syntax is filter(path__nested=filter).
        """

        return {
            'nested': {
                'path': key,
                'filter': self._process_filters((value,)),
            }
        }

Which you can use something like this:

S().filter(tags__nested=F(**{
    'tags.tag': 'find me',
    'tags.tagtype': 'list',
})

Running Django on Docker: a workflow and code

It has been an extremely long time between beers (10 months!). I’ve gotten out of the habit of blogging and somehow I never blogged about the talk I co-presented at PyCon AU this year on Pallet and Forklift the standard and tool we’ve developed at Infoxchange to help make it easier to develop web-applications on Docker1.

Infoxchange is one of the few places I’m aware of that runs Docker in prod. If you’re looking at using Docker to do web development, it’s worth checking out what we’ve been doing over on the Infoxchange devops blog.

  1. There’s also Straddle Carrier, a set of Puppet manifests for loading Docker containers on real infrastructure, but they’ve not been released yet as they rely too much on our custom Puppet config. []

lazy-loading class-based-views in Django

So one of the nice things with method-based views in Django is the ability to do this sort of thing to load a view at the path frontend.views.home:

urlpatterns = patterns(
    'frontend.views',

    url(r'^$', 'home', name='home'),
)

Unfortunately, if you’re using class-based-views, you can’t do this:

urlpatterns = patterns(
    'frontend.views',

    url(r'^$', 'HomeView', name='home'),
)

And instead you had to resort to importing the view and calling HomeView.as_view(). Sort of annoying when you didn’t want to import all of those views.

It turns out however that overloading the code to resolve HomeView is not that difficult, and we can do it with a pretty straightforward monkeypatch. This version uses the kwargs argument of url() to pass keyword arguments to as_view().

from django.conf import urls
from django.views.generic import View


class ClassBasedViewURLPattern(urls.RegexURLPattern):
    """
    A version of RegexURLPattern able to handle class-based-views

    Monkey-patch it in to support class-based-views
    """

    @property
    def callback(self):
        """
        Hook locating the view to handle class based views
        """

        view = super(ClassBasedViewURLPattern, self).callback

        if isinstance(view, type) and issubclass(view, View):
            view = view.as_view(**self.default_args)
            self.default_args = {}

        return view

urls.RegexURLPattern = ClassBasedViewURLPattern

A libnss plugin for Docker

So we’ve been starting to use Docker at Infoxchange. A little while ago my Infoxchange colleague Ricky Cook wrote a Twisted service to resolve Docker container names into their private IP addresses.

I thought this was a pretty neat idea, although you had to do some things like return SERVFAIL to round-robin to another DNS server if it wasn’t a Docker host. However it got problematic when something like dhclient/NetworkManager was in charge of your resolv.conf.

I realised on Friday night the the Truely UNIX solution was to write an NSS plugin. This was not quite as trivial as I thought it would be. However I spent a bit of time staring at Lennart’s mdns plugin and started writing one of my own.

docker-nss should work well enough to resolve container IDs and temporary assigned names that you get from docker ps.

There’s a bit of work to do. It can’t reverse IPs back into container names yet. Nor does it return a full list of aliases for a container. Nor does it return a list of IPs for an image name. Patches welcome! It currently depends on GLib, which drags GLib into the memory space of every process on your system, however it wouldn’t be too hard to port away from it (I’m just lazy).

It has unit tests via a mocked docker client. You do have to have modified your nsswitch.conf in order to be able to run the tests. Although if this really bothers someone we could make the test suite itself a Docker container.

Django utility methods (including New Relic deployment notification)

So we’ve moved to Github here at Infoxchange as our primary development platform because pull requests and Travis CI are much nicer than yelling across the room at each other1. To enable Travis to build our code, we’ve needed to move our little utility libraries to Github too. Since some of these were already on pip, it made sense to open source the rest of them too.

The most useful is a package called IXDjango which includes a number of generally useful management commands for Django developers. Especially useful are deploy which will run a sequence of other commands for deployment, and newrelic_notify_deploy which you can use to notify New Relic of your deployment, which annotates all of your graphs with the version number.

We hope these are useful to people.

  1. big shout out to both Github and Travis CI for supporting our not-for-profit mission with gratis private accounts []

Writing your first web app using Python and Flask

I presented a tutorial at linux.conf.au a couple of weeks ago on what there was for Python developers between CGI scripts and Django. That developers needn’t still be writing CGI scripts in 2014 (it happens) and there were frameworks that met your needs.

This tutorial introduces the microframework Flask, and shows of a whole bunch of things you can do with it, up to being a fully-fledged replacement for Django if you’re so inclined.

Video
Examples source

In other news, I am now a maintainer of Lettuce, a BDD framework for Python/Django. So expect a few more Lettuce related blog posts (if I stop ignoring my blog).

More than a side salad: behaviour driven testing and test driven design in Django with Lettuce

Been quiet lately because I am super busy getting a project out of the door. However I did find time to give this talk last night on behaviour driven testing with Lettuce at MelbDjango.

Apologies for the glitches in the PDF. reveal.js is amazing for doing presentations but I didn’t have time to fix the glitches in the PDF output. The presentation is also able to be cloned from Git (view index.html), but you require the fonts (they’re all libre but I was too lazy to use webfonts).

Hopefully, when I get a breather, I can write about some other code I’ve written, or am writing. Or maybe catch up on the rest of my life/projects.