Searching for documents with arrays of objects using Elasticsearch

Originally posted on ixa.io.

Elasticsearch is pretty nifty in that searching for documents that contain an array item requires no additional work to if that document was flat. Further more, searching for documents that contain an object with a given property in an array is just as easy.

For instance, given documents with a mapping like so:

{
    "properties": {
        "tags": {
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We can do a query to find all the documents with a tag object with tag term find me:

{
    "query": {
        "term": {
            "tags.tag": "find me"
        }
    }
}

Using a boolean query we can extend this to find all documents with a tag object with tag find me or tagtype list.

{
    "query": {
        "bool": {
            "must": [
                {"term": {"tags.tag": "find me"},
                {"term": {"tags.tagtype": "list"}
            ]
        }
    }
}

But what if we only wanted documents that contained tag objects of tagtype list *and* tag find me? While the above query would find them, and they would be scored higher for having two matches, what people often don’t expect is that hide me lists will also be returned when you don’t want them to.

This is especially surprising if you’re doing a filter instead of a query; and especially-especially surprising if you’re doing it using an abstraction API such as elasticutils and expected Django-esque filtering.

How Elasticsearch stores objects

By default the object mapping type stores the values in a flat dotted structure. So:

{
    "tags": {
        "tag": "find me",
        "tagtype": "list"
    }
}

Becomes:

{
    "tags.tag": "find me",
    "tags.tagtype": "list"
}

And for lists:

{
    "tags": [
        {
            "tag": "find me",
            "tagtype": "list"
        }
    ]
}

Becomes:

{
    "tags.tag": ["find me"],
    "tags.tagtype": ["list"]
}

This saves a whole bunch of complexity (and memory and CPU) when implementing searching, but it’s no good for us finding documents containing specific objects.

Enter: the nested type

The solution to finding what we’re looking for is to use the nested query and mark our object mappings up with the nested type. This preserves the objects and allows us to execute a query against the individual objects. Internally it maps them as separate documents and does a child query, but they’re hidden documents, and Elasticsearch takes care to keep the documents together to keep things fast.

So what does it look like? Our mapping only needs one additional property:

{
    "properties": {
        "tags": {
            "type": "nested",
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We then make a nested query. The path is the dotted path of the array we’re searching. query is the query we want to execute inside the array. In this case it’s our boolquery from above. Because individual sub-documents have to match the subquery for the main-query to match, this is now the and operation we are looking for.

{
    "query": {
        "nested": {
            "path": "tags",
            "query": {
                "bool": ...
            }
        }
    }
}

Using nested with Elasticutils

If you are using Elasticutils, unfortunately it doesn’t support nested out of the box, and calling query_raw or filter_raw breaks your Django-esque chaining. However it’s pretty easy to add support using something like the following:

    def process_filter_nested(self, key, value, action):
        """
        Do a nested filter

        Syntax is filter(path__nested=filter).
        """

        return {
            'nested': {
                'path': key,
                'filter': self._process_filters((value,)),
            }
        }

Which you can use something like this:

S().filter(tags__nested=F(**{
    'tags.tag': 'find me',
    'tags.tagtype': 'list',
})

vim + pathogen + git submodules

As part of PyCon Au this weekend I did a lot of hacking on my laptop, which is not something I’ve done for a while, given how much I used to hack on my laptop. I was frequently getting annoyed at how my vim config wasn’t the same as it’s currently on my work desktop.

Back in uni, I used to keep my dotfiles in revision control on a machine I could connect to. All I needed was my ssh agent and I could get all my config.

Recently, when wanting to extend my vim config, people’s pages tell me to do it through vim pathogen, which I didn’t have set up.

Also there have been times when people have asked for my vim setup, which I wasn’t easily able to provide.

Given all of that, I decided it was time to rebuild my config using pathogen + git submodules from the ground up. As part of this, I updated quite a few plugins, and started using a few new things I hadn’t had available without pathogen, so it’s still a bit of a work in progress, but it’s here.

Installing new plugins is easy with git submodule add PATH bundle/MODULE_NAME.

If you’re a vim user, I strongly recommend this approach.

free as in gorgeous

I’m at PyCon Au.1 I made it this year. There were no unexpected collisions in the week leading up to it.

I decided at the last moment to do a lightning talk on a piece of Django-tech I put together at Infoxchange, a thing called Crisper, which I use in a very form-heavy app I’m developing. Crisper is going to be up on Github, just as soon as I have time to extract it from the codebase it’s part of (which mostly consists of working out where it’s pointlessly coupled).

The title of my post relates to deciding to do this talk, and the amazing free and open graphic design assets I have available at my fingertips to create an instant title slide. 5 minutes in Inkscape, using the Tango palette and Junction (available in Fedora as tlomt-junction-fonts) and Inkscape to put together things like this:

crisper

Anyway so some pretty good content today at DjangoCon Au. I especially enjoyed the food for thought about whether we should be targeting new development at Python 3. It has struck me that writing greenfields code on Python 2 is a dead end that will eventually result in my having to port it anyway. Thinking I will conduct a quick stocktake of the dependencies we’re using at work to evaluate a Python 3 port. I think generally we’re writing pretty compatible code using modern Python syntax, so from the point of view of our apps, at port at this stage would be fairly straightforward.

Some interesting discussion too on the future of frameworks like Django. Is the future microframeworks + rich JS on the frontend using Angular or Ember. How do things like Django adapt to that? Even though JS has nothing on Python, is the future node.js on the back and somethingJS on the front?2

~

This is actually a nice aside that I’ve been working on a pure in-web app for visualising data from Redmine using D3.js and Redmine’s RESTful API.3 Intending to add configuration in localstorage (probably via Angular) once I have the visualisations looking good. It both serves a practical purpose and an interesting experiment on writing web apps without a web server.

~

Almost back on topic, I had dinner with a bunch of cyclists tonight, and got to talking about my crowd-sourced cycle maps project,4 which hasn’t gone anywhere because I still don’t have any hosting for it. I should try having dinner with cyclists who work for PaaS hosting providers.5

Finally, I’m trying to recruit talented web developers who have a love of Agile, test driven design and Python. Also a technical tester. Come and talk to me.

  1. Thanks to Infoxchange, who sent me, and are also sponsoring the conference. []
  2. It always cracked me up that my previous web-app was this HTML5-y app, but designed without the use of frameworks. Old and new both at once. []
  3. https://github.com/danni/redmine-viz []
  4. https://github.com/danni/cycle-router []
  5. I need PostGres 9.2 and PostGIS 2.0. Come chat to me! []

calling Melbourne Python and Perl programmers

InfoxchangeFor the last couple of months I’ve been working for Infoxchange, a not-for-profit that provides technology to other not for profit. I’ve been working in the webapp development team, where we mostly work on webapps in the health and community sector using both Perl (for the older stuff) and Python/Django (for the newer stuff).

The government didn’t really work out for me, so this is a nice change. It’s relaxed, people wear jeans to work. We have fair trade tea and coffee and a delivery of CSA fruit every week.

It is busy though. We’ve got a lot going on, and not enough people to do it, so we’re looking for more. So if you’re a committed, talented Perl, Python or Web/Javascript programmer or devops who is in (or willing to move to) Melbourne, who wants to make a difference, you should get in touch with me. The pay is good (you will not be working for peanuts) and the team is fun.

We love open source, we contribute upstream, we have an organisation Github account. You can run what you want on your desktop. Developers have technical ownership over their work. Development is Agile.

Some technologies we love are Perl, Mojolicious, Python, Django, Javascript, jQuery, Bootstrap, Less, NodeJS, AngularJS, Github, Puppet, Debian and PostgreSQL. If you love any of those too, you should totally get in touch. If you love more technologies, bring those along too.

First thoughts on RedHat OpenShift

OpenShift logoI’m looking for a PaaS provider that isn’t going to cost me very much (or anything at all) and supports Flask and PostGIS. Based on J5’s recommendation in my blog the other day, I created an OpenShift account.

A free account OpenShift gives you three small gears1 which are individual containers you can run an app on. You can either run an app on a single gear or have it scale to multiple gears with load balancing. You then install components you need, which OpenShift refers to by the pleasingly retro name of cartridges. So for instance, Python 2.7 is one cartridge and PostgreSQL is another. You can either install all cartridges on one gear or on separate gears based on your resource needs2.

You choose your base platform cartridge (i.e. Python-2.6) and you optionally give it a git URL to do an initial checkout from (which means you can deploy an app that is already arranged for OpenShift very fast). The base cartridge sets up all the hooks for setting up after a git push (you get a git remote that you can push to to redeploy your app). The two things you need are a root setup.py containing your pip requirements, and a wsgi/application file which is a Python blob containing an WSGI object named application. For Python it uses virtualenv and all that awesome stuff. I assume for node.js you’d provide a package.json and it would use npm, similarly RubyGems for Ruby etc.

There’s a nifty command line tool written in Ruby (what happened to Python-only Redhat?) that lets you do all the sort of cloud managementy stuff, including reloading cartridges and gears, tailing app logs and SSHing into the gear. I think an equivalent of dbshell would be really useful based on your DB cartridge, but it’s not a big deal.

There are these deploy hooks you can add to your git repo to do things like create your databases. I haven’t used them yet, but again it would make deploying your app very fast.

There are also quickstart scripts for deploying things like WordPress, Rails and a Jenkins server onto a new gear. Speaking of Jenkins there’s also a Jenkins client cartridge which I think warrants experimentation.

So what’s a bit crap? Why isn’t my app running on OpenShift yet? Basically because the available cartridges are a little antique. The supported Python is Python 2.6, which I could port my app too; or there are community-supported 2.7 and 3.3 cartridges, so that’s fine for me (TBH, I thought my app would run on 2.6) but maybe annoying for others. There is no Celery cartridge, which is what I would have expected, ideally so you can farm tasks out to other gears, and although you apparently can use it, there’s very little documentation I could find on how to get it running.

Really though the big kick in the pants is there is no cartridge for Postgres 9.2/PostGIS 2.0. There is a community cartridge you can use on your own instance of OpenShift Origin, but that defeats the purpose. So either I’m waiting for new Postgres to be made available on OpenShift or backporting my code to Postgres 8.4.

Anyway, I’m going to keep an eye on it, so stay tuned.

  1. small gears have 1GB of disk and 512MB of RAM allocated []
  2. I think if you have a load balancing (scalable) application, your database needs to be on its own gear so all the other gears can access it. []

scratching my own itch [or: why I still love open source]

I’m visiting my parents for the long weekend. Sitting in the airport I decided I should use my spare time to write some documentation, so sitting in the airport was the first time I’d tried to connect to work’s OpenVPN server. While it’s awesome that Network Manager can now import OpenVPN configs, it didn’t work because NM doesn’t support the crucial keysize parameter.

Rather than work around the problem, which some people have done, but would annoyingly break my other OpenVPNs, I used the fact that it’s open source to fix the problem properly.

My Dad asked if I was working. No, well, not really. I’m fixing the interface to my VPN client so I can connect to work’s VPN, I replied. Unglaublich! my father remarked. Not unbelievable, because it’s open source!

IdentitiesOnly + ssh-agent

I’m really hoping that someone can provide me with some enlightenment.

I have a lot of ssh keys. 6 by today’s count. On my desktop I have my ssh configured with IdentitiesOnly yes and an IdentityFile for each host. This works great.

I then forward my agent to my dev VM. I can see the keys with ssh-add -l. So far so good. If I then ssh into a host, I can see it trying every key from the agent in sequence, which is sometimes going to fail with too many keys tried. However, if I try IdentitiesOnly yes in my dev VM config, it doesn’t offer any keys, if I add IdentityFile it doesn’t work because I don’t have those key files on my VM.

So what’s the solution? What I want is to specify identities by their identifier in the agent, e.g. danni@github, however I can’t see config to do that. Anyone got a nifty solution?

elevation data and APIs

So I found a bit of time to hack on my project today. Today’s task was to load and validate data coming from Geoscience Australia’s SRTM digital elevation model data, which I downloaded from their elevation data portal last week.1 The data is Creative Commons, so I might just upload it somewhere, if I can find a place for 2GB of elevation data.

This let me load elevation data in a 3 arcsecond (about 100m) grid, which I did using the ubiquitous GDAL via its Python API. Initial code is here. It doesn’t do anything super clever yet, like check and normalise the projection, because I don’t need to.2

Looking at plots of values can give you a gist of what’s what (oh look, it goes out to sea, and then the data is masked out) but it doesn’t really validate anything. I could do validation runs against my GPS tracks, but for a first pass, I decided it would be easier to validate using Google’s Elevation API. This is a pretty neat web service that you make a request to, and it gives you back some JSON (or XML). There are undoubtedly Python APIs to access this, but it’s pretty easy to do a simple call with urllib2 or httplib. I chose to reuse my httplib Client wrapper from my RunKeeper/HealthGraph API. I wrote it directly in the test.

For a real unit test, I would have probably calculated the residuals, and ensured they sat within some acceptable range, but I’m lazy, so instead I just plotted them together. Google’s data, you will notice, includes bathymetry, which is actually pretty neat.

SRTM v Google Elevation

  1. Note to the unwary, seems to be buggy in Chrome? []
  2. I did write a skeleton context manager for gdal.Open. I say skeleton because it doesn’t actually do anything smart like turning errors from GDAL into Exceptions, because I didn’t have any errors to handle. []

Investigating cycling speed anomalies

So as I’ve spent the last year learning Melbourne as a cyclist, there’s been a few times where I’ve found myself on an absolutely staggering hill, only to go down it again, and worse find there was another way that totally avoided the hill; or I’ve chosen routes that subject me to staggering head winds only to be told I should have taken another route instead.

This got me thinking, with everyone tracking their cycles on their smartphones, why couldn’t I feed all of this data into a model, along with some data like NASA’s elevation grids, or the Bureau of Meteorology’s wind observations. As something to keep me occupied over Christmas, I started a little project on the plane to Perth.

It turns out RunKeeper has this handy API that lets you access everything stored there. Unfortunately it seems that no one has really written a good Python API for this, so I put one together.

Throw in a bit of NumPy, PyProj (to convert to rectilinear coordinates) and Matplotlib (to plot it) and you can get a graph that looks like this (which thankfully looks a lot like RunKeeper’s graph):
Speed Anomalies
If we do some long window smoothing, we can get an idea of a cyclist’s average speed and then calculate a percentage anomaly from this average speed. This lets us compensate for different cyclists, how tired they are or if they’re riding with someone else1.

If we then do this for lots of tracks and grid the results based on whether the velocity vector at each point is headed towards or away from Melbourne2 we can get spatial plots that look like this (blue is -1 and red is 1):
Directional Speed Anomalies
If you squint at the graphs you can sort of see that there are many places where the blue/red are inverted, which is promising, it meant something was making us faster one way and slower the other (a hill or wind or the pub). You can also see that I still don’t really have enough data, I tend to always cycle the same routes. If I want to start considering factors that are highly temporally variable, like wind, I’m going to need a lot more data to keep the number of datapoints (always want to call this fold) high in my temporal bins.

The next step I suppose is to set up the RunKeeper download as a web service, so people can submit their RunKeeper tracks to me. This means I’m going to have to fix up some hard coded assumptions in the code, like the UTM zone for rectilinear projection, and what constitutes an inbound or an outbound route. Unsurprisingly this has become a lot more ambitious than a summer project.

If you feel like having a play, there is source code.

  1. This does have the side effect of reducing the signal from head/tail winds, especially on straight trips, I need to think about this more. []
  2. We need this, otherwise the velocity anomaly would average out depending on which direction we’re headed up/down a hill. []

Finding Ada — Elaine Miles

October 16th is Ada Lovelace Day. A day that showcases women in engineering, maths, science and technology by profiling a woman technologist, scientist, engineer or mathematician on your blog.

Elaine Miles portrait

This year I’m writing about Elaine Miles, a researcher at the Australian Bureau of Meteorology, who I met through mutual colleagues over lunch one day. She has since become one of my go-to people whenever I require a crash-course in something. She awesomely let me interview her for Ada Lovelace Day.

Miles is a physicist working at the Centre for Australian Weather at Climate Research (CAWCR), a joint project between the Bureau of Meteorology and the Commonwealth Scientific and Industrial Research Organisation (CSIRO), where she is investigating the use of dynamic models to predict sea level in the Western Pacific.

Miles studied Applied Mathematics and Physics at the University of Melbourne. She attributes her love of maths to primary school, where she recalls being chastised by her teacher for attempting the subtraction problems further on in the workbook before they had been taught subtraction. She says she would always lament when another class ran over and cut into the maths lesson. Her love of maths originally led her to enroll in Electrical Engineering, but she didn’t like the black box thinking that engineering encourages, preferring to understand concepts from first principles.

Completing a Bachelor in Applied Mathematics, she went on to do Honours in Physics, working on a project in art conservation, which it turns out is an extremely technical field. She built a laser interferometer using off-the-shelf parts (laser, CCD camera and a laptop) to monitor canvas artworks and detect the problems caused to art by changes in microclimate.

After teaching English in Japan for a year, Miles returned to Melbourne where she began her PhD (Miles is not related to Dr Elaine Miles the glass artist). Miles says she wanted to be learning or developing new things (plus she had unfinished business in art conservation) and so a PhD was the logical progression. Her PhD focused on two areas: the science of paint drying (literally watching paint dry she says) and subsurfacing imaging. She had a focus on south-east Asia, where Western art production techniques are prevalent but unsuitable because of the different climate.

Miles spent 3 months working with galleries in the Philippines where she used her own laser speckle interferometers to study artwork hanging in the gallery, in-situ. As far as she’s aware, studying art in-situ had never been done before. This work allows conservators to determine best practice and a course of action for storing and restoring works of art.

With her PhD close to being submitted, Miles began work at CAWCR where she first worked on data assimilation of weather balloon observations into weather forecasting models. She then moved on to verifying rainfall prediction models and getting weather radar data assimilated into the model.

For the last 10 months she has worked with the Pacific-Australia Climate Change Science and Adaptation Planning Program (PACCSAPP), where she investigates applying POAMA (Predictive Ocean Atmosphere Model for Australia), a dynamic, coupled ocean-atmospheric, multi-model ensemble global seasonal prediction model, to forecast global sea level anomalies 1-9 months in the future, specifically validating predictions with observations in the Western Pacific. This is the first time dynamic models have been used to predict medium-term sea level, and forms an extremely important part of helping the Pacific adapt to the immediate effects of climate change.

As for the future, Miles looks forward to getting her PhD submitted, but would like to continue working with sea level modelling. She hopes to start leading projects in Australia and around the world.

Elaine Miles verifying data