Query hierarchical structures in Python

3 minute read

Recently I need to write tests in Python for code which generates big hierarchical structures (in my case it was dicts of lists of dicts, etc.) Asserting all of that was not nice task at all. I’ve started to think about some query language for that. Quick googling shows that there are few projects which addressed that issue: ObjectPath, jsonpath-rw and jsonpath-ng.

It turns out I do not like all of them. I do not like approach, syntax.

After a dozen of minutes of thinking about my next step, I come up with an alternate self-made approach which showed itself to be very effective for my task.

This approach was inspired by .NET LINQ (for .NET collections), that I used a long time ago and MatLab/Octave which I was experimenting with recently.

So I got an idea to run queries on standard collections and chaining them as if you use extension methods for collections from LINQ. And I decided to represent all data as lists (iterables to be precise) similar to “everything is a matrix” approach from Octave. During writing this, I found it very similar to UNIX pipes and filtering/processing of output lines. I hope this will help to get the idea faster if you are familiar with this technologies.

This is a bit simpler version of what I’ve got:

import typing as t
from itertools import chain

class ListFlow:
    def __init__(self, iterable: t.Iterable[t.Any]) -> None:
        self.__iterable = iterable

    def attr(self, name: str) -> 'ListFlow':
        return ListFlow((self.__getattr(i, name) for i in self.__iterable))

    def filter(self, cond: t.Callable[[t.Any], bool]) -> 'ListFlow':
        return ListFlow((i for i in self.__iterable if cond(i)))

    def eq(self, attr: str, value: t.Any) -> 'ListFlow':
        return self.filter(lambda i: self.__getattr(i, attr) == value)

    def neq(self, attr: str, value: t.Any) -> 'ListFlow':
        return self.filter(lambda i: self.__getattr(i, attr) != value)

    def map(self, func: t.Callable[[t.Any], t.Any]) -> 'ListFlow':
        return ListFlow((func(i) for i in self.__iterable))

    def keys(self) -> 'ListFlow':
        return self.map(lambda i: set(i.keys()))

    def values(self) -> 'ListFlow':
        return self.map(lambda i: list(i.values()))

    def select(self, *names: str) -> 'ListFlow':
        def gen() -> t.Iterable[t.Any]:
            for i in self.__iterable:
                new_i = {}
                for f in names:
                    new_i[f] = self.__getattr(i, f)
                yield new_i

        return ListFlow(gen())

    def flatten(self) -> 'ListFlow':
        return ListFlow(chain.from_iterable(self.__iterable))

    def list(self) -> t.List[t.Any]:
        return list(self.__iterable)

    def set(self) -> t.MutableSet[t.Any]:
        return set(self.__iterable)

    def is_empty(self) -> bool:
        return len(self.list()) == 0

    def count(self) -> int:
        count = 0

        for _ in self.__iterable:
            count += 1

        return count

    @staticmethod
    def __getattr(item: t.Any, attr: str) -> t.Any:
        if isinstance(item, dict):
            return item[attr]
        return getattr(item, attr)

Usage examples

We are going to do out queries on iterables. If we have just single value we will create single element list like:

data = { "foo": "bar" }

q = ListFlow([data])

Now we can use .attr method to navigate by attributes (object attributes or dict values). Use .flatten to make lists flat. Use .keys or .values to get list of all keys from all dicts in current iterable or all values. Use .filter (or more specific .eq, .neq) to filter current items. We can use .select to convert items to dicts with copied values. Or finaly we can use .set, .list or .count, .is_empty to extract data.

Note that all operations are lazy.

Now imagine we have data like that:

sample_data = {
    "foo": {
        "bar": [
            {
                "a": 1,
                "foo": "bar",
                "b": "xxx",
                "c": "yyy"
            },
            {
                "a": 0,
                "foo": "abc",
                "b": None,
                "c": None
            },
            {
                "a": 0,
                "foo": "bar",
                "b": None,
                "c": None
            }
        ]
    }
}

Selecting subitems:

ListFlow([sample_data]).attr("foo").attr("bar")

gives us iterable with this structure

[
    [
        {
            "a": 1,
            "foo": "bar",
            "b": "xxx",
            "c": "yyy"
        },
        {
            "a": 0,
            "foo": "abc",
            "b": None,
            "c": None
        },
        {
            "a": 0,
            "foo": "bar",
            "b": None,
            "c": None
        }
    ]
]

Flatten internal lists:

# returns ["bar", "abc", "bar"]
ListFlow([sample_data]).attr("foo").attr("bar").flatten().attr("foo").list()

Assert that “b” and “c” always is None if “a” equal 0:

bar_data = ListFlow([sample_data]).attr("foo").attr("bar").flatten().list()

this.assertEqual(
    {None},
    ListFlow(bar_data).eq("a", 0).select("b", "c").values().flatten().set()
)

Assert that we have only “a”, “b”, “c” and “foo” keys:

this.assertEqual(
    {"a", "b", "c", "foo"},
    ListFlow(bar_data).keys().flatten().set()
)

And so on.

This class is simple but gives the ability to select data easily and expressively. It’s easy to extend functionality to match your needs (etc. indexing selection).

Now I’m thinking to create a library out of it.