Connecting Multiple Kafka Clusters in ClickHouse Using Named Collections

connecting-multiple-kafka-clusters-in-clickhouse-using-named-collections

Introduction:

ClickHouse is a powerful columnar database renowned for its speed and efficiency. A pivotal strength lies in its seamless integration with external data sources like Kafka. With the rising need for multi-cluster setups in modern data architectures, ClickHouse’s Named Collections offers an invaluable asset. In this guide, we’ll delve into how you can leverage this feature to seamlessly set up connections to two distinct Kafka clusters.

Why Use Named Collections?

Understanding the true value of Named Collections is crucial before we dive deep into the configurations. They allow us to:

  • Reduce Repetition: Eliminate the need to redundantly specify configurations.
  • Centralized Management: Maintain all configurations in a single, easily manageable location.
  • Improved Security: Safeguard sensitive credentials, keeping them out of the reach of non-administrative users.

Configuring Named Collections for Kafka:

With the prominence of Named Collections established, let’s gear up to connect to two distinct Kafka clusters – primary and secondary.

XML Configuration:


    
        
        
            primary-kafka-cluster:9094
            
                primary_kafka_client
                SASL_PLAINTEXT
                SCRAM-SHA-512
                clickhouse_primary
                primary_secret_password
            
        
        
        
            backup-kafka-cluster:9095
            
                secondary_kafka_client
                SASL_PLAINTEXT
                SCRAM-SHA-512
                clickhouse_secondary
                secondary_secret_password
            
        
    

For a more detailed configuration setup, refer to Pull Request #31691 starting from ClickHouse v21.12, which provides a more streamlined approach to using named_collections.

Setting Up Permanent Storage: MergeTree Table

After configuring our Kafka connections, the focus shifts to the ClickHouse realm. We’ll architect tables that act as our permanent data reservoirs.

1. Kafka Engine Table:

To tap directly into our Kafka topics, we’ll shape tables in ClickHouse using the Kafka engine. Here’s how you can define these tables:

For the primary Kafka cluster:

CREATE TABLE kafka_cluster_a
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
)
ENGINE = Kafka(primary_kafka_cluster)
SETTINGS kafka_topic_list = 'your_topic_name_for_primary',
         kafka_group_name = 'your_consumer_group_for_primary',
         kafka_format = 'JSONEachRow',
         kafka_named_collection = 'primary_kafka_cluster';

For the secondary Kafka cluster:

CREATE TABLE kafka_cluster_b
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
)
ENGINE = Kafka(secondary_kafka_cluster)
SETTINGS kafka_topic_list = 'your_topic_name_for_secondary',
         kafka_group_name = 'your_consumer_group_for_secondary',
         kafka_format = 'JSONEachRow',
         kafka_named_collection = 'secondary_kafka_cluster';

2. MergeTree Table:

We’ll use the MergeTree table to persistently store the data streamed from Kafka:

For kafka.cluster_a:

CREATE TABLE cluster_a_storage
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
) ENGINE = MergeTree()
ORDER BY id;

For kafka.cluster_b:

CREATE TABLE cluster_b_storage
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
) ENGINE = MergeTree()
ORDER BY id;

3. Materialized View:

The Materialized View serves as the Kafka table’s consumer, directing data flow:

For kafka.cluster_a:

CREATE MATERIALIZED VIEW cluster_a_mv TO cluster_a_storage AS
SELECT 
    id,
    first_name,
    last_name
FROM kafka.cluster_a;

For kafka.cluster_b:

CREATE MATERIALIZED VIEW cluster_b_mv TO cluster_b_storage AS
SELECT 
    id,
    first_name,
    last_name
FROM kafka.cluster_b;

Practical Applications:

With the above groundwork, ClickHouse is primed to consistently ingest and archive data from both Kafka clusters. This means any data dispatched to the delineated Kafka topics will be assimilated in real time. This is particularly advantageous for businesses seeking to conduct instantaneous analytics or data-driven decision-making.

Conclusion:

Harnessing ClickHouse’s Named Collections, establishing connections to multiple Kafka clusters transitions from being merely possible to efficient and organized. This structure guarantees instant data availability for querying, simplifying real-time analytics.

Further Reading:

For a profound understanding of ClickHouse’s named_collections, explore the official ClickHouse documentation.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
is-there-a-developer-shortage?

Is There a Developer Shortage?

Next Post
top-5-css-frameworks

Top 5 CSS Frameworks

Related Posts