Businesses use data analytics to extract meaningful, decision-making information for their day-to-day operations. With analytics we can discover hidden relations, find unexpected patterns, learn about customer behavior, understand the past, and predict the future.

In this tutorial, we’ll learn about ClickHouse, a distributed column-based database for real-time analytics. We will explore how it works and take it for a test drive using StackPath Containers, a distributed edge compute product that lets you deploy images to multiple points of presence to reduce latency and bandwidth.

Before diving into the tutorial, first we’ll provide an overview of column-based databases, ClickHouse, and the benefits of running ClickHouse on StackPath.

Column-Based Databases

Compared with traditional row-based databases, column-based databases have a subtle yet profound difference in how data is stored on disk.

A traditional database stores data as rows.

IDCOMPANYCEO_NAMEYEAR_FROMYEAR_TO
1MicrosoftBill Gates19752000
2MicrosoftSatya Nadella2014
3AppleSteve Cook2011
4FacebookM. Zuckerberg2004
5AppleSteve Jobs19972001

A column-based database serialize them by columns.

COMPANYMicrosoft (1,2)Apple (3,5)Facebook (4)
CEO_NAMEBill Gates (1)Satya Nadella (2)Steve Cook (3)M. Zuckerberg (4)Steve Jobs (5)
YEAR_FROM1975 (1)2014 (2)2011 (3)2014 (4)1997 (5)
YEAR_TO2000 (1)2011 (5)

Our expected workload determines what type of database we should choose. Row I/O is designed to get all the information for a small set of records while column I/O is designed to get a small subset of columns for a huge set of records (or even a whole table).

Now, imagine we ask the server to calculate how many years each of the persons on the tables above were CEOs of their respective companies. A row database has to retrieve each row, then get to the columns with the relevant years and discard the rest. A column database only needs two I/O operations (regardless of the size of the table): one for each date column. No data is discarded. A column system will perform much better in this case and scales much better with large tables.

As a result of their design, column databases have the following unique properties.

ClickHouse: a Distributed Column-Based DBMS

ClickHouse is a distributed database management system (DBMS) created by Yandex, the Russian Internet giant and the second-largest web analytics platform in the world. ClickHouse is available as open-source software under the Apache 2.0 License.

Beyond Yandex, many other large companies have adopted ClickHouse for their analytics, including Cisco, CERN, Bloomberg, and Spotify.

The main features of ClickHouse include:

ClickHouse supports two mechanisms to scale out databases. Sharding distributes big tables among multiple machines and requests get automatically parallelized. Replication creates redundant copies of the data and can be used to distribute and filter data to different data centers. Also, ClickHouse supports multi-master replication.

Is ClickHouse right for your project?

ClickHouse IS good for:

While ClickHouse IS NOT good for:

Running ClickHouse on StackPath

Databases have unique requirements that can clash with common cloud offerings:

Edge computing solves these issues by moving servers as close as possible to customers and sources of data. As for StackPath, it offers the best of both worlds: the convenience and flexibility of the cloud with on premise-like performance.

Single Instance Configuration

In this section we’ll deploy a single ClickHouse edge computing node in order to get familiar with the system. Then, we’ll learn how to implement a cluster.

Prerequisites

First, we need to create a custom Docker image with our private config. Before continuing, ensure you have the following things.

Prepare the Docker image

Yandex supplies two ready-to-use Docker images:

  1. clickhouse-server is the server component. The server stores data, accepts client connections, and processes requests.
  2. clickhouse-client is the official command-line client, much like psql or mysql.

To download the images to your machine, use the following commands.

docker pull yandex/clickhouse-server
docker pull yandex/clickhouse-client

By default, the ClickHouse server comes with a single blank-password user called default. For testing purposes, having a blank password is not a problem, but since we’re going to deploy the node in production, we need to secure it.

To customize the user settings, we need to change a file called users.xml inside the clickhouse-server image.

To do this, create a directory to store the project files.

mkdir my-clickhouse-deployment
cd my-clickhouse-deployment

Start an instance of the container so we can copy out the file.

docker create -it --name clickhouse-server clickhouse server

Copy the file from the container to your machine. (Don’t forget the dot at the end of the command.)

cp clickhouse-server:/etc/clickhouse-server/users.xml

Open users.xml with a text editor and locate the following lines.

<users>
    <default>
        <!-- This is the default user with a blank password -->
        <password></password>

        . . .

    </default>
</users>

ClickHouse supports several password settings:

To generate a SHA256 password in Linux or Mac, execute the following command in a new terminal.

PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'

The first line printed is the password, the second is the encoded SHA256 string. Alternatively, you can use an online SHA256 generator to obtain the string. In either case, keep both the password and the SHA string at hand as we’ll need them both soon.

Going back to users.xml:

Remove the <password></password> line to get rid of the blank password. Then add the following line with your SHA256 string.

<password_sha256_hex>YOUR_SHA254_HEX_STRING</password_sha256_hex>

For example:

<users>
    <default>
    <password_sha256_hex>78bb49329e64a3df1ea5a48fb09ce22ed5223171cf140e3b08683ab3926a0b9b</password_sha256_hex>

        . . .

    </default>
</users>

Build the image

Create a file called Dockerfile with the following content.

FROM yandex/clickhouse-server
COPY users.xml /etc/clickhouse-server

Then build the image.

docker build -t my-clickhouse-server .

To test that it’s working correctly, run the following.

docker run -it -d --name my-server my-clickhouse-server

Now, try connecting using your password.

docker run -it --rm --link clickhouse:clickhouse-server yandex/clickhouse-client --host clickhouse-server --user default --password YOUR_PASSWORD

Push image to registry

Before we can use the image in StackPath, we have to send it to Docker Hub. To do this, first log in to Docker Hub with your account.

$ docker login -u YOUR_DOCKERHUB_USERNAME

Build the image using your Docker Hub username as a prefix.

$ docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

Push the image to Docker Hub.

$ docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

Deploy to StackPath

Once we have pushed the image to the Docker Registry, we can deploy it on StackPath. Head over to StackPath and login or sign up for a new account. Then click Create and select Workload.

We’ll name the workload “clickhouse-servers” and set the type to Container. For Image, type the address to your Docker image: YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

If you are using a private repository or a different registry, check the Add Image Pull Credentials box and type in the pull details.

Click Continue to Settings.

If you want the servers to be accessible to everyone, add the following ports under Public ports. This will assign Public IPs and open the ports to the Internet.

All containers get a Private IP assigned automatically.

Click Continue to Spec.

In Spec, choose the machine size. ClickHouse developers recommend a minimum of 4GB of RAM to allow for complex queries. But if you choose the smallest machine you can always scale the machine up with a few clicks later on.

Next we’ll need a volume to persist our data. Uunder Additional volume, enter /var/lib/clickhouse and set its size to your requirements.

Under Deployment Target, group servers with similar scaling needs. Type a descriptive name like “clickhouse-servers”. Then select the PoP nearest to you or your customers, set the number of instances to 1, and click Create Workload.

Wait a few seconds until the instance is on the Running state.

Test the connection

The following test will only work if you have opened ports 8123 and 9000.

First, click on the running instance to get its details. Then scroll down to the Instance details section and copy the Public IP address.

Next, open a terminal and try the /ping endpoint with Curl or browser. If the server is running it should respond with Ok.

$ curl -w "\n" YOUR_PUBLIC_IP:8123/ping
Ok

We can also connect with ClickHouse Native Client.

docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD

Now you can run ClickHouse SQL commands on your new node.

clickhouse-servers-clickhouse-servers-jfk-0 :) create database my_db;

CREATE DATABASE my_db

Ok.

0 rows in set. Elapsed: 0.393 sec.

clickhouse-servers-clickhouse-servers-jfk-0 :) show databases;

SHOW DATABASES

┌─name────┐
│ default │
│ my_db   │
│ system  │
└─────────┘

3 rows in set. Elapsed: 0.326 sec.

Sharded Cluster Configuration

In this section, we’ll deploy two more database instances to create a 3-node edge compute cluster.

Create more instances

First, we’ll create the remaining instances so that we can get their Private IPs.

On the left navigation menu, click Workloads and select your container workload. Then click the gear icon to edit the workload configuration.

Click the Edit icon next to Spec, Storage & Deployment and increase the number of Instances per PoP to 3. Then click Save Changes.

StackPath will create the remaining instances in only a few seconds.

When the instances start running, copy the Private IP address of each one.

Configure ClickHouse shards

We have to tell ClickHouse where all the nodes are. To do this, we need to get the config.xml file from the ClickHouse image.

Open a terminal and type the following command to copy the file out.

docker cp clickhouse-server:/etc/clickhouse-server/config.xml 

Then, open the file and locate the remote_servers line.

<remote_servers incl="clickhouse_remote_servers" >
    <!-- Test only shard config for testing distributed storage -->
    <test_shard_localhost>
        <shard>
            <replica>
                <host>localhost</host>
                <port>9000</port>
            </replica>
        </shard>
    </test_shard_localhost>

        . . . 
</remote_servers>

Remove all the lines inside <remote_servers><remote_servers></remote_servers> and enter the following config, replacing the IP addresses with your servers’ Private IPs.

<remote_servers incl="clickhouse_remote_servers" >
    <my_sharded_config>
        <shard>
            <replica>
                <host>SERVER_1_PRIVATE_IP</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <replica>
                <host>SERVER_2_PRIVATE_IP</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <replica>
                <host>SERVER_3_PRIVATE_IP</host>
                <port>9000</port>
            </replica>
        </shard>
    </my_sharded_config>
</remote_servers>

Edit your Dockerfile to add the modified config:

FROM yandex/clickhouse-server
COPY users.xml /etc/clickhouse-server
COPY config.xml /etc/clickhouse-server

Build a new version of the image. (Notice how we’re using a different tag here.)

docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Push the new image to the Docker Registry like we did before.

docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Update the StackPath workload

The final step is to tell StackPath to use our new configuration. To do this, go to your StackPath Workload and click the gear icon again to change its configuration. Edit the Image name with the new tag: YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Click Save Changes.

StackPath will start recycling the containers. When this is done, you should find all of them running the newest version.

Testing the cluster

Again, we can use Curl or any browser to check each of the ClickHouse nodes. Just replace the IP addresses with your Private IPs.

$ curl -w "\n" YOUR_PUBLIC_IP_1:8123/ping
Ok

$ curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping
Ok

$ curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping
Ok

We can also connect to any of the instances and check the cluster health. Run the following query to see the cluster status.

select * from system.clusters;

For example:

$ docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD

ClickHouse client version 20.3.2.1 (official build).
Connecting to 151.139.31.12:9000 as user default.
Connected to ClickHouse server version 20.3.2 revision 54433.

clickhouse-servers-clickhouse-servers-jfk-0 :) select * from system.clusters;

SELECT cluster, shard_num, host_address, is_local, errors_count
FROM system.clusters

┌─cluster────────────┬─shard_num─┬─host_address─┬─is_local─┬─errors_count─┐
│ my_sharded_config  │         1 │ 10.128.96.2  │ 0        │ 0            │
│ my_sharded_config  │         2 │ 10.128.96.4  │ 1        │ 0            │
│ my_sharded_config  │         3 │ 10.128.96.3  │ 0        │ 0            │
└────────────────────┴───────────┴──────────────┴──────────┴──────────────┘

3 rows in set. Elapsed: 2.140 sec.

And that’s it! If you made it this far, you have sucessfully deployed a 3-node cluster on StackPath.

Next Steps

For the next step in your road to mastering ClickHouse, try setting up a second cluster on a different PoP and establishing table replication between the two data centers.

If you want to learn more, check out these related docs and projects: