pachicoHub

Thinking out loud

ClickHouse: S3 table functions

Today I want to explore with you how to import and export data from and to S3 using ClickHouse.
We will also use ClickHouse’s greatCircleDistance() function to calculate the distance between a set of coordinates and all the cities in GeoNames free gazetteer.

TL;DR

Clone the demo in https://github.com/pachico/demo-clickhouse-s3-table-functions and follow its instructions or keep reading to know more.

Our use case: find closest cities to coordinate

Once again, a fake use case to justify my demo, so here we go.
My friend John asks for help. This time, he gives me a set of coordinates (55.652743, 12.085123) and challenges me to find its closest cities… for some obscure reason.

How we can do it

Of course, I could use Google Maps, insert the coordinates and check what’s around them but that would be no fun at all so I’ll do it the hard (but cool) way.

What we will do, instead, is:

  • Upload to an S3 bucket (see note below) the GeoNames gazetteer with all the cities in the world as tsv file
  • Create a temporary table in ClickHouse where the gazetteer will be stored
  • Import the tsv to ClickHouse using S3 function
  • Query for the closest cities to coordinate using greatCircleDistance function
  • Upload to the same bucket the result using S3 function

Note: Rather than using AWS S3, I will use MinIO, which is an open source S3 API compatible software, so we can entirely run the demo without requiring an AWS account.

If you clone the repo and follow its instructions, the following is that is going to happen.

GeoNames is a geographical database that I exploited a lot in the past that offers a free to download gazetteer for multiple purposes.
Today we will use the file called cities500.zip that contains all cities with population >= 500. It should be more than enough for our use case.

So, once both ClickHouse and MinIO are running, I’ll download the tsv file and put it inside a MinIO bucket by running the following inside the MinIO container:

curl http://download.geonames.org/export/dump/cities500.zip --output /tmp/cities500.zip
gunzip -S .zip -N /tmp/cities500.zip
mkdir -p /data/geonames
mv -f /tmp/cities500 /data/geonames/cities500.tsv

Luckily, the Bitnami image that we use for MinIO includes both curl and gunzip so we don’t need any extra tools.

Now we are ready to import the file in ClickHouse using S3 functions.
At this point we should already have a table where data will be stored (which is created as soon as the container starts), which will look like:

CREATE TABLE IF NOT EXISTS cities (
    geonameid UInt32,
    name String,
    asciiname String,
    alternatenames String,
    latitude Float32,
    longitude Float32,
    feature_class FixedString(1),
    feature_code LowCardinality(String),
    country_code FixedString(2),
    cc2 String,
    admin1_code String,
    admin2_code String,
    admin3_code String,
    admin4_code String,
    population Int64,
    elevation Int32,
    dem String,
    timezone String,
    modification_date Date
) ENGINE = Memory();

And now we will import the data from the S3 object with the following:

INSERT INTO geonames.cities
SELECT
    *
FROM
    s3(
        'http://minio:9000/geonames/cities500.tsv',
        'user',
        'changeme',
        'TabSeparated',
        'geonameid UInt32, name String, asciiname String, alternatenames String, latitude Float32, longitude Float32, feature_class FixedString(1), feature_code LowCardinality(String), country_code FixedString(2), cc2 String, admin1_code String, admin2_code String, admin3_code String, admin4_code String, population Int64, elevation Int32, dem String, timezone String, modification_date Date',
        'none'
    );

Note 1: Pay attention to that last 'none'. That is the compression which is not needed for our case but it’s a very handy feature to bear in mind.
Note 2: As you can see, the user and password credentials are the ones specified in docker-compose.yaml.

At this point, all our data is in our ClickHouse table so we can play with it.
A simple query will find the closest cities to our starting coordinates and will want to export its result to a new file inside the same bucket.

INSERT INTO FUNCTION s3(
    'http://minio:9000/geonames/closer_to_coordinates.csv', 
    'user', 
    'changeme', 
    'CSVWithNames', 
    'geonameid UInt32, name String, country_code LowCardinality(String), distance Float32', 'none')
WITH 
    55.652743 AS current_latitude,
    12.085123 AS current_longitude
SELECT 
    geonameid,
    name,
    country_code,
    greatCircleDistance(current_longitude, current_latitude, longitude, latitude) AS distance
FROM geonames.cities
ORDER BY distance ASC
LIMIT 10;

Now, a new file named closer_to_coordinates.csv is available in the original bucket with the following content:

"geonameid","name","country_code","distance"
2614481,"Roskilde","DK",1283.3599
2610189,"Vindinge","DK",4718.2124
2611982,"Svogerslev","DK",4877.9355
2622061,"Fløng","DK",6473.969
2620587,"Hedehusene","DK",7107.905
2611132,"Tune","DK",8515.363
2617832,"Lejre","DK",8754.255
2620997,"Gundsømagle","DK",10115.779
2619216,"Jyllinge","DK",11199.348
2610311,"Viby","DK",12189.3125

So, it was Roskilde in Denmark the closest city centre, precisely at 1283.3599 meters from the coordinates John gave me.

If you followed the demo through the little repo I have created for it, you can check it by opening MinIO’s web GUI by opening http://localhost:9000/minio/geonames/ using the credentials user: user and password: changeme.

Wrapping up

Remember to check out the demo at https://github.com/pachico/demo-clickhouse-s3-table-functions.

S3 functions are extremely useful. They allow you to concurrently import data, export reports or backups directly to S3 or use them as part of storage policies (which we will explore in the feature).
GeoNames, on the other hand, is an extremely valuable asset if you ever need a free geographical database.

References:

I hope you found this little demo fun. Don’t hesitate to leave a comment since I treasure all feedback