ClickHouse: geo search
Today we will explore how to save geographical polygons in ClickHouse
to see if a set of coordinates falls inside one of them.
TL;DR
Clone the demo in https://github.com/pachico/demo-clickhouse-geo-search and follow instruction or keep reading for some notes.
Features we will use
- ClickHouse nested type
Array(Tuple(Float32, Float32))
, which is how polygons are stored - ClickHouse JsonEachRow format for INSERTs
- ClickHouse pointInPolygon function
- jQ black magic
My crazy use case: in which city is my friend’s apartment?
My good friend John told me he can give me the keys of his apartment in Andorra in case I want to spend some days with my family there, now that the snow season is coming. He also sent me the coordinate of his apartment.
This is great news, thanks John!
Now I want to know where the apartment is. For that, I could just open Google Maps and introduce the coordinates… but that’s not how we’re going to do it.
The apartment coordinates are 1.5995782462755148, 42.56633318931792
.
So… what if I download a gazetteer for Andorra, parse all the location geoJSONs in a way I can import them to ClickHouse
and do a pointInPolygon
match against those coordinates? How hard can that be?
How we can do it
This is our tables definition:
CREATE TABLE IF NOT EXISTS location_polygon (
`id` UInt64,
`name` String,
`country_code` FixedString(2),
`polygon` Array(Tuple(Float32, Float32))
) Engine = Memory;
Note 1: Since each location can have multiple polygons, I might have more than one row per location.
Note 2: Pay attention to the polygon
field definition since that’s crucial to be able to do this kind of query.
Firstly, I will do a shallow clone of Who’s on first Andorra’s gazetteer and store it somewhere locally with:
git clone --depth 1 --single-branch --no-tags git@github.com:whosonfirst-data/whosonfirst-data-admin-ad.git resources/gazetteer
Here comes the fun. In a one-liner I will (in order):
- filter out all those files that don’t match
[[digit]].geojson
naming since these are the only ones I’m looking for - filter out all those files that don’t have a
polygon based geometry type
- create a new JSON for each polygon inside each location that will contain
id
,name
,country_code
and an array of coordinate tuples forming the polygon - send each of these JSON documents to
ClickHouse
for ingestion
Here’s the magic:
find resources/gazetteer/data/ -type f -regex ".*/[0-9]*.geojson" \
-exec cat {} + \
| jq 'select(.geometry.type == "Polygon")' \
| jq '{id, name:(.properties["wof:name"]), country_code:(.properties["iso:country"]) ,polygon:(.geometry["coordinates"][])}' -c \
| docker-compose exec -T clickhouse clickhouse-client --query="INSERT INTO geo.location_polygon FORMAT JSONEachRow"
Once I have done that I should be able to launch my query and see to which location the apartment belongs to. Let’s see:
SELECT
id,
name
FROM geo.location_polygon
WHERE pointInPolygon((1.5995782462755148, 42.56633318931792), polygon) = 1
FORMAT PrettyCompactMonoBlock
┌───────id─┬─name────┐
│ 85667933 │ Canillo │
│ 85632343 │ Andorra │
└──────────┴─────────┘
It turns out the apartment is in Canillo
, that of course falls within the Andorra polygon. Case closed!
Knowing this, I could actually download all gazetteers for the entire world and have some fun with it. Let me know if you do it!
Closing notes
I created a scripted demo that will spawn this scenario using docker-compose
so you can play with it.
Find it at https://github.com/pachico/demo-clickhouse-geo-search
Hope you find it as fun as I did.