Searching for documents with arrays of objects using Elasticsearch

Originally posted on ixa.io.

Elasticsearch is pretty nifty in that searching for documents that contain an array item requires no additional work to if that document was flat. Further more, searching for documents that contain an object with a given property in an array is just as easy.

For instance, given documents with a mapping like so:

{
    "properties": {
        "tags": {
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We can do a query to find all the documents with a tag object with tag term find me:

{
    "query": {
        "term": {
            "tags.tag": "find me"
        }
    }
}

Using a boolean query we can extend this to find all documents with a tag object with tag find me or tagtype list.

{
    "query": {
        "bool": {
            "must": [
                {"term": {"tags.tag": "find me"},
                {"term": {"tags.tagtype": "list"}
            ]
        }
    }
}

But what if we only wanted documents that contained tag objects of tagtype list *and* tag find me? While the above query would find them, and they would be scored higher for having two matches, what people often don’t expect is that hide me lists will also be returned when you don’t want them to.

This is especially surprising if you’re doing a filter instead of a query; and especially-especially surprising if you’re doing it using an abstraction API such as elasticutils and expected Django-esque filtering.

How Elasticsearch stores objects

By default the object mapping type stores the values in a flat dotted structure. So:

{
    "tags": {
        "tag": "find me",
        "tagtype": "list"
    }
}

Becomes:

{
    "tags.tag": "find me",
    "tags.tagtype": "list"
}

And for lists:

{
    "tags": [
        {
            "tag": "find me",
            "tagtype": "list"
        }
    ]
}

Becomes:

{
    "tags.tag": ["find me"],
    "tags.tagtype": ["list"]
}

This saves a whole bunch of complexity (and memory and CPU) when implementing searching, but it’s no good for us finding documents containing specific objects.

Enter: the nested type

The solution to finding what we’re looking for is to use the nested query and mark our object mappings up with the nested type. This preserves the objects and allows us to execute a query against the individual objects. Internally it maps them as separate documents and does a child query, but they’re hidden documents, and Elasticsearch takes care to keep the documents together to keep things fast.

So what does it look like? Our mapping only needs one additional property:

{
    "properties": {
        "tags": {
            "type": "nested",
            "properties": {
                "tag": {
                    "type": "string"
                },
                "tagtype": {
                    "type": "string"
                }
            }
        }
    }
}

We then make a nested query. The path is the dotted path of the array we’re searching. query is the query we want to execute inside the array. In this case it’s our boolquery from above. Because individual sub-documents have to match the subquery for the main-query to match, this is now the and operation we are looking for.

{
    "query": {
        "nested": {
            "path": "tags",
            "query": {
                "bool": ...
            }
        }
    }
}

Using nested with Elasticutils

If you are using Elasticutils, unfortunately it doesn’t support nested out of the box, and calling query_raw or filter_raw breaks your Django-esque chaining. However it’s pretty easy to add support using something like the following:

    def process_filter_nested(self, key, value, action):
        """
        Do a nested filter

        Syntax is filter(path__nested=filter).
        """

        return {
            'nested': {
                'path': key,
                'filter': self._process_filters((value,)),
            }
        }

Which you can use something like this:

S().filter(tags__nested=F(**{
    'tags.tag': 'find me',
    'tags.tagtype': 'list',
})

Author: Danielle

Danielle is an Australian software engineer, computer scientist and feminist. She doesn't really work on GNOME any more (sadly). Opinions and writing are solely her own and so not represent her employer, the GNOME Foundation, or anyone else but herself.

Leave a Reply

Your email address will not be published. Required fields are marked *

Creative Commons Attribution-ShareAlike 2.5 Australia
This work by Danielle Madeley is licensed under a Creative Commons Attribution-ShareAlike 2.5 Australia.