pachicoHub

Thinking out loud

ClickHouse: geo search

Today we will explore how to save geographical polygons in ClickHouse to see if a set of coordinates falls inside one of them.

TL;DR

Clone the demo in https://github.com/pachico/demo-clickhouse-geo-search and follow instruction or keep reading for some notes.

Features we will use

  • ClickHouse nested type Array(Tuple(Float32, Float32)), which is how polygons are stored
  • ClickHouse JsonEachRow format for INSERTs
  • ClickHouse pointInPolygon function
  • jQ black magic

My crazy use case: in which city is my friend’s apartment?

My good friend John told me he can give me the keys of his apartment in Andorra in case I want to spend some days with my family there, now that the snow season is coming. He also sent me the coordinate of his apartment.

This is great news, thanks John!

Now I want to know where the apartment is. For that, I could just open Google Maps and introduce the coordinates… but that’s not how we’re going to do it.

The apartment coordinates are 1.5995782462755148, 42.56633318931792.

So… what if I download a gazetteer for Andorra, parse all the location geoJSONs in a way I can import them to ClickHouse and do a pointInPolygon match against those coordinates? How hard can that be?

How we can do it

This is our tables definition:

CREATE TABLE IF NOT EXISTS location_polygon (
    `id` UInt64,
    `name` String,
    `country_code` FixedString(2),
    `polygon` Array(Tuple(Float32, Float32))
) Engine = Memory;

Note 1: Since each location can have multiple polygons, I might have more than one row per location.

Note 2: Pay attention to the polygon field definition since that’s crucial to be able to do this kind of query.

Firstly, I will do a shallow clone of Who’s on first Andorra’s gazetteer and store it somewhere locally with:

git clone --depth 1 --single-branch --no-tags git@github.com:whosonfirst-data/whosonfirst-data-admin-ad.git resources/gazetteer

Here comes the fun. In a one-liner I will (in order):

  • filter out all those files that don’t match [[digit]].geojson naming since these are the only ones I’m looking for
  • filter out all those files that don’t have a polygon based geometry type
  • create a new JSON for each polygon inside each location that will contain id, name, country_code and an array of coordinate tuples forming the polygon
  • send each of these JSON documents to ClickHouse for ingestion

Here’s the magic:

find resources/gazetteer/data/ -type f -regex ".*/[0-9]*.geojson" \
    -exec cat {} + \
    | jq 'select(.geometry.type == "Polygon")' \
    | jq '{id, name:(.properties["wof:name"]), country_code:(.properties["iso:country"]) ,polygon:(.geometry["coordinates"][])}' -c \
    | docker-compose exec -T clickhouse clickhouse-client --query="INSERT INTO geo.location_polygon FORMAT JSONEachRow"

Once I have done that I should be able to launch my query and see to which location the apartment belongs to. Let’s see:

SELECT
    id,
    name
FROM geo.location_polygon
WHERE pointInPolygon((1.5995782462755148, 42.56633318931792), polygon) = 1
FORMAT PrettyCompactMonoBlock

┌───────id─┬─name────┐
 85667933  Canillo 
 85632343  Andorra 
└──────────┴─────────┘

It turns out the apartment is in Canillo, that of course falls within the Andorra polygon. Case closed!

Knowing this, I could actually download all gazetteers for the entire world and have some fun with it. Let me know if you do it!

Closing notes

I created a scripted demo that will spawn this scenario using docker-compose so you can play with it.

Find it at https://github.com/pachico/demo-clickhouse-geo-search

Hope you find it as fun as I did.