Clean Log Table Example
Creating a Cleaned LLM Log Table In ClickHouse Updated from Git
To ensure a secure integration, please follow these steps to create a new table that contains only LLM-related logs, with all sensitive data and Personally Identifiable Information (PII) removed. This process involves creating a new target table and then setting up a Materialized View that automatically filters, cleans, and copies the relevant logs from your source table into this new, secure table.
Prerequisites
Before you begin, you should identify:
- The name of your source table containing all raw logs (e.g.,
customer_db.logs
). - The criteria for identifying LLM-related logs (e.g., a specific
source
value, or keywords in themessage
field). - All columns that contain sensitive data or PII (e.g.,
user_id
,ip_address
,email
,full_name
).
Step 1: Create the New "Cleaned" Table
First, create a new table that will store the sanitized logs. Its structure should be based on your original table, but you should omit any columns that are entirely sensitive and not needed for the integration.
-- Creates the new table to hold the sanitized, LLM-only logs.
-- Replace 'llm_logs_cleaned' with your desired table name.
-- Adjust the columns to match what you want to expose.
CREATE TABLE customer_db.llm_logs_cleaned (
-- Anonymized user identifier (see Step 2)
anonymized_user_hash UInt64,
-- Retain non-sensitive columns as-is
event_timestamp DateTime,
level String,
-- The log message, which we will clean in the next step
message String,
-- We can add new columns derived during the cleaning process
llm_model LowCardinality(String)
) ENGINE = MergeTree()
ORDER BY event_timestamp;
Step 2: Create a Materialized View to Clean and Transfer Data
This is the core step. The Materialized View will listen for new data in your source table, apply the cleaning and filtering logic, and insert the results into customer_db.llm_logs_cleaned
. Customize the SELECT
query below to match your specific cleaning requirements.
-- This view acts as a continuous ETL pipeline.
-- It reads from your raw logs, transforms the data, and pushes it to the clean table.
CREATE MATERIALIZED VIEW customer_db.llm_log_cleaner_mv
TO customer_db.llm_logs_cleaned
AS
SELECT
-- Option 1: Anonymize identifiers using a hash function.
-- This allows for counting distinct users without exposing the original ID.
sipHash64(toString(user_id)) AS anonymized_user_hash,
-- Option 2: Retain columns that are not sensitive.
event_timestamp,
level,
-- Option 3: Clean PII from within the log message itself.
-- The example below removes email addresses and IP addresses from the message.
-- You can chain multiple replaceAll/regexpReplaceAll functions.
regexpReplaceAll(
regexpReplaceAll(message, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '[REDACTED_EMAIL]'),
'\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}', '[REDACTED_IP]'
) AS message,
-- Option 4: Extract useful, non-sensitive information.
-- For example, extracting the model name from a JSON payload in the message.
JSONExtractString(message, 'model') as llm_model
FROM
customer_db.logs -- <-- IMPORTANT: Use your source table here.
WHERE
-- IMPORTANT: Add your filter logic here to select ONLY LLM-related logs.
-- For example:
message LIKE '%LLM%' OR source = 'llm_service';
Step 3 (Optional): Backfill Historical Data
The Materialized View will only process new logs from this point forward. To populate the llm_logs_cleaned
table with your existing historical data, run the following INSERT
statement. This uses the same logic from the view to clean and copy the old records.
-- This command populates the new table with cleaned historical data.
INSERT INTO customer_db.llm_logs_cleaned
SELECT
sipHash64(toString(user_id)) AS anonymized_user_hash,
event_timestamp,
level,
regexpReplaceAll(
regexpReplaceAll(message, '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '[REDACTED_EMAIL]'),
'\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}', '[REDACTED_IP]'
) AS message,
JSONExtractString(message, 'model') as llm_model
FROM
customer_db.logs -- <-- Use your source table here.
WHERE
message LIKE '%LLM%' OR source = 'llm_service'; -- <-- Use the same filter logic.
Step 4: Grant Read-Only Access
Your new, sanitized table is now ready. Please use the instructions from Option A in our previous guide ("Creating a Secure ClickHouse User") to grant our integration user read-only access to this new customer_db.llm_logs_cleaned
table.
This ensures our tools can only see the cleaned, PII-free data and have no access to your raw logs.
Once complete, please provide us with the name of this new table (customer_db.llm_logs_cleaned
).
Updated 9 days ago