pachicoHub

Thinking out loud

ClickHouse: MongoDB as source for dictionaries

Let’s talk today about how MongoDB can be used as source for ClickHouse external dictionaries.
This is a pretty recent and powerful feature that you can now explore by simply downloading the demo.

TL;DR

Clone the demo in https://github.com/pachico/demo-clickhouse-mongodb-dictionary and follow its instructions or keep reading to know more.

Our use case: where is John sailing to?

My friend John called from his satellite phone to tell me he’s sailing in the Atlantic Ocean. He’s arriving to Portugal and sees a lighthouse far away. He challenges me to find where he’s heading to by just giving me his current coordinates.

How we can do it

Once more, we could simply use Google Maps but that’s boring and we won’t learn much, so we will do it the epic way.

The only thing we knows are his coordinates (37.01763537452769, -9.032827682889476) and that he’s heading Portugal. So let’s figure it out.

What we will do, instead, is:

  • Download the entire GeoName’s gazetteer for Portugal
  • Import the downloaded file (tsv) in MongoDB by using mongoimport.
  • Query for the closest lighthouse to coordinate using greatCircleDistance function

If you clone the repo and follow its instructions, the following is that is going to happen.

We have already used Geonames in the past so we won’t explain much about it. We will simply remember that it provides its gazetteer also by country, so we won’t have to download it all.

Unfortunately, our MongoDB docker image doesn’t provide curl nor wget but it’s based on Debian, so we will install curl and unzip to decompress the file. (I know, I could have used other tools to decompress it.) Once it’s done, let’s download and import the data with:

curl http://download.geonames.org/export/dump/PT.zip --output /tmp/PT.zip
cd /tmp/ && unzip -o /tmp/PT.zip
mongoimport --db=geonames --collection=portugal --file=/tmp/PT.txt \
    --mode=upsert --type=tsv \
    --fields=_id,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date

At this point we will have roughly 37k features in our MongoDB collection. We should be able to browse them by opening the following url http://localhost:8081/db/geonames/portugal.
Conveniently, for each feature there is a code that illustrate what type of feature that is and, of course, its pair of coordinates.

Now we will create a dictionary in ClickHouse that will read data from MongoDB. Its DDL is the following:

CREATE DICTIONARY portugal
(
    _id UInt32,
    name String,
    feature_class String,
    feature_code String,
    latitude Float32,
    longitude Float32
)
PRIMARY KEY _id
SOURCE(MONGODB(
    host 'mongo'
    port 27017
    user ''
    password ''
    db 'geonames'
    collection 'portugal'
))
LAYOUT(DIRECT());

Note 1: The layout is set as DIRECT, which means it won’t store any data in ClickHouse side (not in memory, not in disk). For real use cases, you might want to study the layout that fits best to your requirements.
Note 2: We’re using no authentication and this is why certain fields are empty.

Let’s see if everything has worked as planned with a simple query:

SELECT 
    count() AS count,
    feature_class AS class
FROM geonames.portugal
GROUP BY feature_class
ORDER BY count DESC

┌─count─┬─class─┐
 16608  P     
  7608  S     
  4606  A     
  4375  H     
  3121  T     
   583  L     
   373  V     
    53  R     
     2  U     
└───────┴───────┘

Perfect! Now, according to this Geoname’s reference, lighthouses have the code LTHSE so we’ll search the closest to John’s location.

Let’s do the query:

WITH 
    37.01763537452769 AS current_latitude,
    -9.032827682889476 AS current_longitude
SELECT 
    _id,
    name,
    latitude,
    longitude,
    greatCircleDistance(current_longitude, current_latitude, longitude, latitude) AS distance
FROM geonames.portugal
WHERE feature_code = 'LTHSE'
ORDER BY distance ASC
LIMIT 1

┌─────_id─┬─name───────────────────────────┬─latitude─┬─longitude─┬──distance─┐
 9855568  Cabo de São Vicente Lighthouse  37.02289   -8.99663  3266.2986 
└─────────┴────────────────────────────────┴──────────┴───────────┴───────────┘

Got it! He’s approaching Cabo de São Vicente Lighthouse, what a spectacular place! We found you, John!

Wrapping up

Remember to check out the demo at https://github.com/pachico/demo-clickhouse-mongodb-dictionary.

MongoDB is a very popular database, which I’m sure many already have installed and working. ClickHouse, being able to read from it allows you to work combine both technologies.

References:

I hope you found this little demo fun. Don’t hesitate to leave a comment since I treasure all feedback