pachicoHub

Thinking out loud

Using Vector to store Kong logs in ClickHouse

Today we will explore how to store Kong logs in ClickHouse using Vector.

TL;DR

Clone the demo in https://github.com/pachico/demo-kong-vector-clickhouse and follow instructions or keep reading for some notes.

Overview

Kong is the de-facto standard for API gateway solutions (sometimes also called API management). It’s fast and has a number of plugins that should cover the most common use cases. In our demo, we will use two of these: key-auth to only allow authorized consumers to use our API endpoints, and tcp-log to store details about every request handled by Kong.

To store these request events, we will point the tcp-log plugin to Vector, which will receive the events, do a number of transformations and, ultimately, send the events to ClickHouse to be stored.

Before you say anything, depending on the use case, you might want to use a metrics oriented solution rather than an event log one, but this is a demo about features, not a set of instructions to solve all use cases.

If you are curious about the difference between the two, you can read this short but useful article about the subject.

Let’s see what we can do.

Features we will use

  • Kong’s key-auth plugin
  • Kong’s tcp-log plugin
  • Vector’s socket source module
  • Vector’s remap transform module
  • Vector’s unit testing feature
  • Vector’s ClickHouse sink module
  • ClickHouse
  • Vegeta HTTP load testing tool

Let’s get things together

This demo assumes you either have cloned or follow the demo project in https://github.com/pachico/demo-kong-vector-clickhouse.

Let’s start from Kong, which is the front service our users will hit. We will configure Kong to resolve a route to serve a dummy NGINX endpoint (vanilla docker instance). 

Fortunately, Kong can now work without a database by starting it with a declarative configuration.

We will use the following configuration:

_format_version: "1.1"

services:
  - name: my-service
    url: http://nginx:80
    tags:
      - mytags
    routes:
      - name: my-route
        methods:
          - GET
        paths:
          - /mypath
        strip_path: true

consumers:
- username: my-user
  keyauth_credentials:
  - key: my-key

plugins:
  - name: tcp-log
    enabled: true
    config:
      host: vector
      port: 9000

  - name: key-auth
    enabled: true
    config: 
      key_names: ["apikey"]

From this, pay attention to the plugins we’re using and how they are set up.

We will send all logs (request events) using tcp-log plugin to a host named vector using the port 9000.
Additionally, we will require consumers (API users) to send an API key in order to consume our services. This API key will need to be sent in each request in the header’s apikey key.

We have also created a consumer that has my-user as name and, more importantly, my-key as API key value.

Since both plugins are set at a global scope, no request can go through without being authenticated and all requests will be logged.

To make everything work, we have defined a service named my-service that has a route named my-route that resolves GET requests to /my-path by sending them to http://nginx:80.

So far, very easy.

At the opposite extreme, we will set up the table in ClickHouse where we want all our logs to be stored. This is a simple demo and our table will only contain a few columns, but enough to show what can be done.

Our table definition is the following:

CREATE TABLE IF NOT EXISTS `kong`.`log` (
    `timestamp` DateTime,
    `consumer_username` Nullable(String),
    `request_method` LowCardinality(String),
    `request_size` UInt32,
    `request_uri` String,
    `request_url` String,
    `service_name` String,
    `response_status` UInt16,
    `latency_kong` UInt32,
    `latency_proxy` UInt32,
    `latency_request` UInt32
) ENGINE = MergeTree()
ORDER BY
    toYYYYMMDD(timestamp);

All Kong request events will be stored here, which means this can quickly be filled with hundreds of millions of rows.
Although ClickHouse can handle billions of rows even in a laptop (yes, I have done it), it wouldn’t be wise not to exploit ClickHouse’s materialized view feature to aggregate data and have blazing fast and real time results.

Once again, this is just a demo.

Now, those familiar with ClickHouse already know that it doesn’t excel at having many small inserts but, instead, it likes infrequent massive ones, quite the opposite to the reality of logs.
To solve this, many solutions have been put in place, like its Kafka Engine (you publish data to Kafka so that Clickhouse can retrieve the data from here), its RabbitMQ Engine, its buffer engine (as the name says, it buffers inserts in each node before flushing them to a table) but also external solutions like clickhouse-bulk or kittenhouse.

Today we won’t use any of these, but Vector’s ClickHouse sink module, which acts as a buffer too.

As we said before, logs are sent to Vector to its socket module. Let’s see that.

[sources.kong_logs]
  type = "socket" 
  address = "0.0.0.0:9000"
  max_length = 102400 
  mode = "tcp"

Again, quite clear, it accepts connections from everywhere using the port 9000 in TCP mode.

Now is when the fun starts. Kong sends each event as a JSON message, which is quite big (you can see a sample here).
Once it’s received, Vector adds a timestamp field marking when it’s been received.
We don’t want that (neither should you). We want to use the started_at field to know when Kong has received the request, instead.

Unfortunately, this field is in epoch format and in milliseconds, so we can’t use it out of the box and we’ll have to transform it. Additionally, we’re not interested in much of that JSON and certain things should be redacted/deleted too for security reasons, like the API key, which can be found in .request.header.apikey.

To do that, we will use the remap transformation module, which uses Vector Remap Language (VRL) to manipulate the object.

This is the first transformation:

[transforms.kong_logs_shortened]
  type = "remap" 
  inputs = ["kong_logs"] 
  source = ''' 
  ., _ = parse_json(.message)
  ts, _ = .started_at / 1000
  .timestamp = to_timestamp(floor(ts))
  .request.headers = {
    "user-agent":                  .request.headers."user-agent",
    "host":                        .request.headers.host,
    "x-forwarded-for":             .request.headers."x-forwarded-for", 
    "x-consumer-id":               .request.headers."x-consumer-id",
    "x-consumer-username":         .request.headers."x-consumer-username"
  }
  .response.headers = {   
    "content-length":              .response.response."content-length",
    "content-type":                .response.response."content-type",
    "x-kong-proxy-latency":        .response.response."x-kong-proxy-latency", 
    "x-kong-upstream-latency":     .response.response."x-kong-upstream-latency"
  }
  .route = {
    "https_redirect_status_code":  .route."https_redirect_status_code",
    "methods":                     .route.methods,
    "name" :                       .route.name,
    "paths" :                      .route.paths,
  }
  .service = {
    "name":                        .service.name,
    "tags":                        .service.tags
  }
  '''

See how we first parse the entire .message field (which is where the original message is stored after being received by Vector) and create a new .timestamp field which is Kong’s started_at but in seconds, floored and transformed to Vector’s timestamp object.
Then we rewrite the .request.headers object by only including those headers we want, deleting every trace of potentially compromising data (as api key). We continue doing the same with other fields.
This creates a smaller object with the proper timestamp.

Still, it’s not something we want to throw at ClickHouse. (We could, and then use materialized views to normalize data into a new table but this way is safer and we’ll get to why.)

So, we will apply a new transformation that takes the result of the previous one and flattens the object into a more table-ish fashion so it can be stored in ClickHouse.

Here’s our second transformation:

[transforms.kong_logs_clickhouse_ready]
  type = "remap" 
  inputs = ["kong_logs_shortened"] 
  source = ''' 
  ts, _ = format_timestamp(.timestamp, "%F %T")
  . = {
    "timestamp": ts,
    "consumer_username": .request.headers."x-consumer-username",
    "request_method": .request.method,
    "request_size": .request.size,
    "request_uri": .request.uri,
    "request_url": .request.url,
    "service_name": .service.name,
    "response_status": .response.status,
    "latency_kong": .latencies.kong,
    "latency_proxy": .latencies.proxy,
    "latency_request": .latencies.request
  }
  '''

If you paid attention, you can now see that the resulting object matches the definition of our ClickHouse table (see above).

Now I want to show you why I prefer to transform the object in Vector rather than using a materialized view to normalize data.

Vector has an amazing feature which totally convinced me to go for it which is unit tests.
Once your transformation pipelines become complex, you want to know that you are not breaking anything when you apply changes (those using Logstash know what I’m talking about).
This is why we have put two tests in place. One that will test only the first transformation and another that will test the first and the second.

This is how tests look like:

[[tests]]
  ## This tests the first transformation which takes Kong's raw log
  name = "Kong logs shortened"
  [[tests.inputs]]
    insert_at = "kong_logs_shortened"
    type = "log"
    log_fields.message = '{"consumer":{"created_at":1615809852,"id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","username":"my-user"},"latencies":{"request":3,"kong":2,"proxy":1},"service":{"host":"nginx","created_at":1615809852,"connect_timeout":60000,"id":"0855b320-0dd2-547d-891d-601e9b38647f","protocol":"http","name":"my-service","read_timeout":60000,"port":80,"updated_at":1615809852,"write_timeout":60000,"retries":5,"ws_id":"0dc6f45b-8f8d-40d2-a504-473544ee190b","tags":["mytags"]},"request":{"querystring":{},"size":100,"uri":"/mypath","url":"http://localhost:8000/mypath","headers":{"host":"localhost:8000","x-consumer-username":"my-user","user-agent":"curl/7.68.0","accept":"*/*","x-consumer-id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","x-credential-identifier":"487ab43c-b2c9-51ec-8da5-367586ea2b61","apikey":"my-key"},"method":"GET"},"client_ip":"172.20.0.1","tries":[{"balancer_latency":0,"port":80,"balancer_start":1615809924112,"ip":"172.20.0.3"}],"authenticated_entity":{"id":"487ab43c-b2c9-51ec-8da5-367586ea2b61"},"upstream_uri":"/","response":{"headers":{"content-type":"text/html; charset=UTF-8","date":"Mon, 15 Mar 2021 12:05:24 GMT","connection":"close","via":"kong/2.3.3","content-length":"612","x-kong-proxy-latency":"2","server":"nginx/1.19.8","x-kong-upstream-latency":"1","accept-ranges":"bytes","etag":"\"60479ce6-264\"","last-modified":"Tue, 09 Mar 2021 16:05:58 GMT"},"status":200,"size":935},"route":{"strip_path":true,"methods":["GET"],"ws_id":"0dc6f45b-8f8d-40d2-a504-473544ee190b","request_buffering":true,"service":{"id":"0855b320-0dd2-547d-891d-601e9b38647f"},"name":"my-route","path_handling":"v0","id":"481a9539-f49c-51b6-b2e2-fe99ee68866c","preserve_host":false,"regex_priority":0,"paths":["/mypath"],"response_buffering":true,"updated_at":1615809852,"https_redirect_status_code":426,"protocols":["http","https"],"created_at":1615809852},"started_at":1615809924110}'
  [[tests.outputs]]
    extract_from = "kong_logs_shortened"
    [[tests.outputs.conditions]]
      type = "remap"
      source = '''
      ## Make sure time elements are OK
      .started_at == 1614232668342
      .timestamp == to_timestamp!("2021-02-25T05:57:48Z")
      
      ## Test entire payload after transformation
      encode_json(.) == "{\"authenticated_entity\":{\"id\":\"487ab43c-b2c9-51ec-8da5-367586ea2b61\"},\"client_ip\":\"172.20.0.1\",\"consumer\":{\"created_at\":1615809852,\"id\":\"4b1b701d-de2b-5588-9aa2-3b97061d9f52\",\"username\":\"my-user\"},\"latencies\":{\"kong\":2,\"proxy\":1,\"request\":3},\"request\":{\"headers\":{\"host\":\"localhost:8000\",\"user-agent\":\"curl/7.68.0\",\"x-consumer-id\":\"4b1b701d-de2b-5588-9aa2-3b97061d9f52\",\"x-consumer-username\":\"my-user\",\"x-forwarded-for\":null},\"method\":\"GET\",\"querystring\":{},\"size\":100,\"uri\":\"/mypath\",\"url\":\"http://localhost:8000/mypath\"},\"response\":{\"headers\":{\"content-length\":null,\"content-type\":null,\"x-kong-proxy-latency\":null,\"x-kong-upstream-latency\":null},\"size\":935,\"status\":200},\"route\":{\"https_redirect_status_code\":426,\"methods\":[\"GET\"],\"name\":\"my-route\",\"paths\":[\"/mypath\"]},\"service\":{\"name\":\"my-service\",\"tags\":[\"mytags\"]},\"started_at\":1615809924110,\"timestamp\":\"2021-03-15T12:05:24Z\",\"tries\":[{\"balancer_latency\":0,\"balancer_start\":1615809924112,\"ip\":\"172.20.0.3\",\"port\":80}],\"upstream_uri\":\"/\"}"
      '''

  ## This tests the first and the second transformations. The output is the log object ready
  ## to be inserted into Clickhouse
  name = "Kong logs ClickHouse ready"
  [[tests.inputs]]
    insert_at = "kong_logs_shortened"
    type = "log"
    log_fields.message = '{"consumer":{"created_at":1615809852,"id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","username":"my-user"},"latencies":{"request":3,"kong":2,"proxy":1},"service":{"host":"nginx","created_at":1615809852,"connect_timeout":60000,"id":"0855b320-0dd2-547d-891d-601e9b38647f","protocol":"http","name":"my-service","read_timeout":60000,"port":80,"updated_at":1615809852,"write_timeout":60000,"retries":5,"ws_id":"0dc6f45b-8f8d-40d2-a504-473544ee190b","tags":["mytags"]},"request":{"querystring":{},"size":100,"uri":"/mypath","url":"http://localhost:8000/mypath","headers":{"host":"localhost:8000","x-consumer-username":"my-user","user-agent":"curl/7.68.0","accept":"*/*","x-consumer-id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","x-credential-identifier":"487ab43c-b2c9-51ec-8da5-367586ea2b61","apikey":"my-key"},"method":"GET"},"client_ip":"172.20.0.1","tries":[{"balancer_latency":0,"port":80,"balancer_start":1615809924112,"ip":"172.20.0.3"}],"authenticated_entity":{"id":"487ab43c-b2c9-51ec-8da5-367586ea2b61"},"upstream_uri":"/","response":{"headers":{"content-type":"text/html; charset=UTF-8","date":"Mon, 15 Mar 2021 12:05:24 GMT","connection":"close","via":"kong/2.3.3","content-length":"612","x-kong-proxy-latency":"2","server":"nginx/1.19.8","x-kong-upstream-latency":"1","accept-ranges":"bytes","etag":"\"60479ce6-264\"","last-modified":"Tue, 09 Mar 2021 16:05:58 GMT"},"status":200,"size":935},"route":{"strip_path":true,"methods":["GET"],"ws_id":"0dc6f45b-8f8d-40d2-a504-473544ee190b","request_buffering":true,"service":{"id":"0855b320-0dd2-547d-891d-601e9b38647f"},"name":"my-route","path_handling":"v0","id":"481a9539-f49c-51b6-b2e2-fe99ee68866c","preserve_host":false,"regex_priority":0,"paths":["/mypath"],"response_buffering":true,"updated_at":1615809852,"https_redirect_status_code":426,"protocols":["http","https"],"created_at":1615809852},"started_at":1615809924110}'
  [[tests.outputs]]
    extract_from = "kong_logs_clickhouse_ready"
    [[tests.outputs.conditions]]
      type = "remap"
      source = '''
      ## Test entire payload after transformation
      encode_json(.) == "{\"consumer_username\":\"my-user\",\"latency_kong\":2,\"latency_proxy\":1,\"latency_request\":3,\"request_method\":\"GET\",\"request_size\":100,\"request_uri\":\"/mypath\",\"request_url\":\"http://localhost:8000/mypath\",\"response_status\":200,\"service_name\":\"my-service\",\"timestamp\":\"2021-03-15 12:05:24\"}"
      '''

Isn’t this beautiful?
We want to know that our transformations are fine so we execute the tests and:

docker run -v `pwd`/docker/vector/:/etc/vector/ timberio/vector:0.12.1-alpine test /etc/vector/*.toml
Running tests
test Kong logs shortened ... passed
test Kong logs ClickHouse ready ... passed

If, by any chance, we broke the tests, we would see something like this (I simply modified the last number 4 into a 5 to make it fail):

docker run -v `pwd`/docker/vector/:/etc/vector/ timberio/vector:0.12.1-alpine test /etc/vector/*.toml
Running tests
test Kong logs shortened ... passed
test Kong logs ClickHouse ready ... failed

failures:

test Kong logs ClickHouse ready:

check transform 'kong_logs_clickhouse_ready' failed conditions:
  condition[0]: source execution resolved to false
payloads (events encoded as JSON):
   input: {"authenticated_entity":{"id":"487ab43c-b2c9-51ec-8da5-367586ea2b61"},"client_ip":"172.20.0.1","consumer":{"created_at":1615809852,"id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","username":"my-user"},"latencies":{"kong":2,"proxy":1,"request":3},"request":{"headers":{"host":"localhost:8000","user-agent":"curl/7.68.0","x-consumer-id":"4b1b701d-de2b-5588-9aa2-3b97061d9f52","x-consumer-username":"my-user","x-forwarded-for":null},"method":"GET","querystring":{},"size":100,"uri":"/mypath","url":"http://localhost:8000/mypath"},"response":{"headers":{"content-length":null,"content-type":null,"x-kong-proxy-latency":null,"x-kong-upstream-latency":null},"size":935,"status":200},"route":{"https_redirect_status_code":426,"methods":["GET"],"name":"my-route","paths":["/mypath"]},"service":{"name":"my-service","tags":["mytags"]},"started_at":1615809924110,"timestamp":"2021-03-15T12:05:24Z","tries":[{"balancer_latency":0,"balancer_start":1615809924112,"ip":"172.20.0.3","port":80}],"upstream_uri":"/"}
  output: {"consumer_username":"my-user","latency_kong":2,"latency_proxy":1,"latency_request":3,"request_method":"GET","request_size":100,"request_uri":"/mypath","request_url":"http://localhost:8000/mypath","response_status":200,"service_name":"my-service","timestamp":"2021-03-15 12:05:24"}

But this is not the case and everything is correct.

We then set the ClickHouse sink module as follows:

[sinks.kong_logs_clickhouse]
  type = "clickhouse" 
  inputs = ["kong_logs_clickhouse_ready"]
  compression = "gzip" 
  database = "kong"
  endpoint = "http://clickhouse:8123"
  table = "log" 

Please note that ClickHouse sink module offers many configuration parameters, among which those related to buffer capacity.

Beautiful, let’s see if it all works.

We will use Vegeta to do requests at 1,000 rps rate during 15 seconds and see if Vector’s buffering worked and the data was stored correctly.

"echo 'GET http://kong:8000/mypath' | vegeta attack -rate=1000 -duration=15s -header=apikey:my-key | vegeta report"
Requests      [total, rate, throughput]         15000, 1000.06, 999.88
Duration      [total, attack, wait]             15.002s, 14.999s, 2.762ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.202ms, 11.348ms, 3.355ms, 4.538ms, 5.508ms, 335.361ms, 466.81ms
Bytes In      [total, mean]                     9180000, 612.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:15000  
Error Set:

(see how we send the API key using -header=apikey:my-key.)

1,000 requests per second during 15 seconds means 15,000 events that should be now stored in ClickHouse. Let’s check it out:

docker-compose exec clickhouse bash -c "echo 'SELECT consumer_username AS u, count() c, quantile(0.95)(latency_kong) AS lk_p95, quantile(0.95)(latency_proxy) AS lp_p95 FROM kong.log GROUP BY u FORMAT PrettyCompactMonoBlock' | clickhouse-client -m -n"
┌─u───────┬─────c─┬─lk_p95─┬─lp_p95─┐
│ my-user │ 15000 │      4 │      1 │
└─────────┴───────┴────────┴────────┘

We didn’t lose a single event and life is good.

Closing notes

We now know how to store data coming from Kong into ClickHouse. If, by any chance, you preferred Elasticsearch as a storage solution, you could simply use the Elasticsearch sink module and you’d be fine (or you could use both!).

Vector is powerful and, from what we can see, you can expect a vibrant community and a number of new features coming soon.

Again, if you want to run this demo by yourself, you can just clone this repo https://github.com/pachico/demo-kong-vector-clickhouse.