Clean Log Table Example

Creating a Cleaned LLM Log Table In ClickHouse Updated from Git

To ensure a secure integration, please follow these steps to create a new table that contains only LLM-related logs, with all sensitive data and Personally Identifiable Information (PII) removed. This process involves creating a new target table and then setting up a Materialized View that automatically filters, cleans, and copies the relevant logs from your source table into this new, secure table.

Prerequisites

Before you begin, you should identify:

  • The name of your source table containing all raw logs (e.g., customer_db.logs).
  • The criteria for identifying LLM-related logs (e.g., a specific source value, or keywords in the message field).
  • All columns that contain sensitive data or PII (e.g., user_id, ip_address, email, full_name).

Step 1: Create the New "Cleaned" Table

First, create a new table that will store the sanitized logs. Its structure should be based on your original table, but you should omit any columns that are entirely sensitive and not needed for the integration.

-- Creates the new table to hold the sanitized, LLM-only logs.
-- Replace 'llm_logs_cleaned' with your desired table name.
-- Adjust the columns to match what you want to expose.
CREATE TABLE customer_db.llm_logs_cleaned (
    -- Anonymized user identifier (see Step 2)
    anonymized_user_hash UInt64,

    -- Retain non-sensitive columns as-is
    event_timestamp DateTime,
    level String,

    -- The log message, which we will clean in the next step
    message String,

    -- We can add new columns derived during the cleaning process
    llm_model LowCardinality(String)
) ENGINE = MergeTree()
ORDER BY event_timestamp;

Step 2: Create a Materialized View to Clean and Transfer Data

This is the core step. The Materialized View will listen for new data in your source table, apply the cleaning and filtering logic, and insert the results into customer_db.llm_logs_cleaned. Customize the SELECT query below to match your specific cleaning requirements.

-- This view acts as a continuous ETL pipeline.
-- It reads from your raw logs, transforms the data, and pushes it to the clean table.
CREATE MATERIALIZED VIEW customer_db.llm_log_cleaner_mv
TO customer_db.llm_logs_cleaned
AS
SELECT
    -- Option 1: Anonymize identifiers using a hash function.
    -- This allows for counting distinct users without exposing the original ID.
    sipHash64(toString(user_id)) AS anonymized_user_hash,

    -- Option 2: Retain columns that are not sensitive.
    event_timestamp,
    level,

    -- Option 3: Clean PII from within the log message itself.
    -- The example below removes email addresses and IP addresses from the message.
    -- You can chain multiple replaceAll/regexpReplaceAll functions.
    regexpReplaceAll(
        regexpReplaceAll(message, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '[REDACTED_EMAIL]'),
        '\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}', '[REDACTED_IP]'
    ) AS message,

    -- Option 4: Extract useful, non-sensitive information.
    -- For example, extracting the model name from a JSON payload in the message.
    JSONExtractString(message, 'model') as llm_model
FROM
    customer_db.logs  -- <-- IMPORTANT: Use your source table here.
WHERE
    -- IMPORTANT: Add your filter logic here to select ONLY LLM-related logs.
    -- For example:
    message LIKE '%LLM%' OR source = 'llm_service';

Step 3 (Optional): Backfill Historical Data

The Materialized View will only process new logs from this point forward. To populate the llm_logs_cleaned table with your existing historical data, run the following INSERT statement. This uses the same logic from the view to clean and copy the old records.

-- This command populates the new table with cleaned historical data.
INSERT INTO customer_db.llm_logs_cleaned
SELECT
    sipHash64(toString(user_id)) AS anonymized_user_hash,
    event_timestamp,
    level,
    regexpReplaceAll(
        regexpReplaceAll(message, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '[REDACTED_EMAIL]'),
        '\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}', '[REDACTED_IP]'
    ) AS message,
    JSONExtractString(message, 'model') as llm_model
FROM
    customer_db.logs -- <-- Use your source table here.
WHERE
    message LIKE '%LLM%' OR source = 'llm_service'; -- <-- Use the same filter logic.

Step 4: Grant Read-Only Access

Your new, sanitized table is now ready. Please use the instructions from Option A in our previous guide ("Creating a Secure ClickHouse User") to grant our integration user read-only access to this new customer_db.llm_logs_cleaned table.

This ensures our tools can only see the cleaned, PII-free data and have no access to your raw logs.

Once complete, please provide us with the name of this new table (customer_db.llm_logs_cleaned).