ClickHouse: MongoDB as source for dictionaries
Let’s talk today about how MongoDB can be used as source for ClickHouse external dictionaries.
This is a pretty recent and powerful feature that you can now explore by simply downloading the demo.
TL;DR
Clone the demo in https://github.com/pachico/demo-clickhouse-mongodb-dictionary and follow its instructions or keep reading to know more.
Our use case: where is John sailing to?
My friend John called from his satellite phone to tell me he’s sailing in the Atlantic Ocean. He’s arriving to Portugal and sees a lighthouse far away. He challenges me to find where he’s heading to by just giving me his current coordinates.
How we can do it
Once more, we could simply use Google Maps but that’s boring and we won’t learn much, so we will do it the epic way.
The only thing we knows are his coordinates (37.01763537452769, -9.032827682889476
) and that he’s heading Portugal. So let’s figure it out.
What we will do, instead, is:
- Download the entire GeoName’s gazetteer for Portugal
- Import the downloaded file (tsv) in MongoDB by using
mongoimport
. - Query for the closest lighthouse to coordinate using
greatCircleDistance
function
If you clone the repo and follow its instructions, the following is that is going to happen.
We have already used Geonames in the past so we won’t explain much about it. We will simply remember that it provides its gazetteer also by country, so we won’t have to download it all.
Unfortunately, our MongoDB docker image doesn’t provide curl
nor wget
but it’s based on Debian, so we will install curl
and unzip
to decompress the file. (I know, I could have used other tools to decompress it.)
Once it’s done, let’s download and import the data with:
curl http://download.geonames.org/export/dump/PT.zip --output /tmp/PT.zip
cd /tmp/ && unzip -o /tmp/PT.zip
mongoimport --db=geonames --collection=portugal --file=/tmp/PT.txt \
--mode=upsert --type=tsv \
--fields=_id,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date
At this point we will have roughly 37k features in our MongoDB collection. We should be able to browse them by opening the following url http://localhost:8081/db/geonames/portugal.
Conveniently, for each feature there is a code that illustrate what type of feature that is and, of course, its pair of coordinates.
Now we will create a dictionary in ClickHouse that will read data from MongoDB. Its DDL is the following:
CREATE DICTIONARY portugal
(
_id UInt32,
name String,
feature_class String,
feature_code String,
latitude Float32,
longitude Float32
)
PRIMARY KEY _id
SOURCE(MONGODB(
host 'mongo'
port 27017
user ''
password ''
db 'geonames'
collection 'portugal'
))
LAYOUT(DIRECT());
Note 1: The layout is set as DIRECT
, which means it won’t store any data in ClickHouse side (not in memory, not in disk). For real use cases, you might want to study the layout that fits best to your requirements.
Note 2: We’re using no authentication and this is why certain fields are empty.
Let’s see if everything has worked as planned with a simple query:
SELECT
count() AS count,
feature_class AS class
FROM geonames.portugal
GROUP BY feature_class
ORDER BY count DESC
┌─count─┬─class─┐
│ 16608 │ P │
│ 7608 │ S │
│ 4606 │ A │
│ 4375 │ H │
│ 3121 │ T │
│ 583 │ L │
│ 373 │ V │
│ 53 │ R │
│ 2 │ U │
└───────┴───────┘
Perfect! Now, according to this Geoname’s reference, lighthouses have the code LTHSE
so we’ll search the closest to John’s location.
Let’s do the query:
WITH
37.01763537452769 AS current_latitude,
-9.032827682889476 AS current_longitude
SELECT
_id,
name,
latitude,
longitude,
greatCircleDistance(current_longitude, current_latitude, longitude, latitude) AS distance
FROM geonames.portugal
WHERE feature_code = 'LTHSE'
ORDER BY distance ASC
LIMIT 1
┌─────_id─┬─name───────────────────────────┬─latitude─┬─longitude─┬──distance─┐
│ 9855568 │ Cabo de São Vicente Lighthouse │ 37.02289 │ -8.99663 │ 3266.2986 │
└─────────┴────────────────────────────────┴──────────┴───────────┴───────────┘
Got it! He’s approaching Cabo de São Vicente Lighthouse, what a spectacular place! We found you, John!
Wrapping up
Remember to check out the demo at https://github.com/pachico/demo-clickhouse-mongodb-dictionary.
MongoDB is a very popular database, which I’m sure many already have installed and working. ClickHouse, being able to read from it allows you to work combine both technologies.
References:
- https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-sources/#dicts-external_dicts_dict_sources-mongodb
- https://clickhouse.tech/docs/en/sql-reference/functions/geo/coordinates/#greatcircledistance
I hope you found this little demo fun. Don’t hesitate to leave a comment since I treasure all feedback