Unified Log Schema

The Netra Apex Log Schema is an automatically generated provider-agnostic logging schema for AIOps.

Note that this Schema is automatically generated by the Netra Agent ingestion process, you do not need to manually create it or convert any log data. This is the reference of what is available after the normalization process.

This standardized data serves as the foundation for the Co-Optimization Agent, an AI-powered system designed to work alongside your existing infrastructure. The agent leverages this schema for:

  1. Offline Analysis: Ingesting logs to a central data store to analyze cost, performance, and quality trade-offs, providing systemic insights and optimization recommendations.
  2. Real-time Middleware (Optional): Acting as an intelligent routing layer to dynamically optimize production traffic based on learned patterns and predefined business objectives.

The Log Object Structure

Each log entry is a self-contained JSON object representing a single, atomic LLM operation. The structure is designed to be comprehensive, normalized, and extensible.

Core Namespaces

These top-level objects provide the essential "who, what, where, and when" for every event.

NamespaceDescription
event_metadataInformation about the log event itself, ensuring traceability and versioning.
trace_contextOpenTelemetry-compliant data for tracing complex, multi-step workflows like RAG or agents.
identity_contextData for attributing requests, managing security, and enabling FinOps controls like chargebacks.
application_contextClient-side application data for isolating issues in a microservices architecture.
{
  "event_metadata": {
    "log_schema_version": "3.0.1",
    "event_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "timestamp_utc": 1721948400000,
    "ingestion_source": "vllm_wrapper_v1.2"
  },
  "trace_context": {
    "trace_id": "trace-001-rag-session",
    "span_id": "span-002-llm-call",
    "parent_span_id": "span-001-retrieval",
    "span_name": "GenerateFinalAnswer",
    "span_kind": "llm"
  },
  "identity_context": {
    "user_id": "usr_anon_f4b3c2d1",
    "organization_id": "org_12345",
    "anon_api_key_hash": "api_anon_a7e8f9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8",
    "auth_method": "api_key"
  },
  "application_context": {
    "app_name": "customer-support-chatbot",
    "service_name": "report-summarizer",
    "sdk_version": "openai-python-1.8.0",
    "environment": "production",
    "anon_ip_reference": "ip_anon_f4b3c2d1"
  }
}

Request Specification

The request object is an immutable, normalized record of all parameters sent to the model, ensuring perfect reproducibility.

request.model

Normalizes model identification across providers. This is crucial for cross-provider analysis.

FieldTypeDescription
providerStringThe entity hosting the model (e.g., openai, aws_bedrock, vllm).
familyStringThe general model family (e.g., gpt-4, claude-3, llama-3).
nameStringThe specific, common name of the model (e.g., gpt-4o, claude-3-opus).
version_idStringThe exact version or snapshot ID for reproducibility.

Example Normalization: An API call to AWS Bedrock with modelId: "anthropic.claude-3-opus-20240229-v1:0" is parsed into:

"model": {
  "provider": "aws_bedrock",
  "family": "claude-3",
  "name": "claude-3-opus",
  "version_id": "20240229-v1:0"
}

request.prompt

A flexible container for all input types, from simple text to multimodal requests. For more information on redaction, see Data Redaction.

"prompt": {
  "messages": [
    { "role": "user", "content": "What is the capital of France?" }
  ],
  "system_prompt": "You are a helpful assistant.",
  "multimodal_parts": [
    {
      "part_type": "image",
      "mime_type": "image/jpeg",
      "source_uri": "s3://my-bucket/image.jpg",
      "source_base64_hash": "b3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a7e8f9b0c1d2e3f4a5b6c7d8e9f0a1b2"
    }
  ]
}

request.generation_config

A canonical structure for all parameters controlling the generation process.

"generation_config": {
    "temperature": 0.7,
    "top_p": 1.0,
    "max_tokens_to_sample": 1024,
    "stop_sequences": ["\n"],
    "is_streaming": false,
    "seed": 42
}

Response Specification

The response object captures the complete output, including generated content, usage metrics, and backend system metadata.

response.completion

Contains the generative output, structured to handle multiple candidates.

FieldTypeDescription
choicesArrayAn array of generated responses, each with an index and finish_reason.
finish_reasonEnumWhy generation stopped (stop_sequence, max_tokens, tool_calls, safety).
messageObjectThe standard {"role": "assistant", "content": "..."} object.

response.usage

A normalized breakdown of token consumption, crucial for accurate FinOps.

Example Normalization:

Unified FieldOpenAI SourceAnthropic SourceGoogle Source
prompt_tokensusage.prompt_tokensusage.input_tokensusageMetadata.promptTokenCount
completion_tokensusage.completion_tokensusage.output_tokensusageMetadata.candidatesTokenCount
"usage": {
  "prompt_tokens": 512,
  "completion_tokens": 256,
  "total_tokens": 768,
  "tool_call_tokens": 64, // If applicable
  "cached_input_tokens": 0
}

Performance Specification

The performance namespace captures all metrics related to speed and infrastructure, bridging the gap between managed APIs and self-hosted servers.

performance.latency_ms

A granular breakdown of latency measurements from the client's perspective.

FieldDescription
total_e2eTotal wall-clock time from request start to final token receipt.
time_to_first_tokenTime from request start until the first token is received. Critical for user perception.
time_per_output_tokenAverage time to generate each subsequent token after the first.
queue_durationTime spent waiting in a server-side queue (from self-hosted servers).
prefill_durationTime spent processing the input prompt (from self-hosted servers).
decode_durationTime spent actively generating tokens (from self-hosted servers).

performance.self_hosted_metrics

A dedicated namespace for detailed metrics available only from self-hosted inference servers like vLLM or TGI, scraped from their /metrics endpoints.

"self_hosted_metrics": {
  "server_type": "vllm",
  "scheduler": {
    "running_requests": 4,
    "waiting_requests": 8,
    "swapped_requests": 2
  },
  "kv_cache": {
    "gpu_usage_percent": 85.5
  },
  "hardware_info": {
    "accelerator_type": "A100-80G",
    "accelerator_count": 8,
    "quantization_type": "awq"
  }
}

FinOps Specification

The finops namespace transforms the log into a detailed, auditable financial record. This data is typically enriched post-request by a processor that has access to pricing information.

"finops": {
  "cost": {
    "total_cost_usd": 0.0125,
    "prompt_cost_usd": 0.0050,
    "completion_cost_usd": 0.0075
  },
  "pricing_info": {
    "provider_rate_id": "claude3-opus-ondemand-jul2025",
    "prompt_token_rate_usd_per_million": 15.00,
    "completion_token_rate_usd_per_million": 75.00
  },
  "attribution": {
    "cost_center_id": "eng-rnd-123",
    "project_id": "proj-q3-chatbot-eval",
    "feature_name": "document_summarization"
  }
}

Full Log Examples

Example 1: Standard API Call to OpenAI

This example shows a typical log from a call to a managed provider. Note the absence of self_hosted_metrics.

{
  "event_metadata": {
    "log_schema_version": "3.0.1",
    "event_id": "c1d2e3f4-g5h6-7890-1234-567890abcdef",
    "timestamp_utc": 1721949000000,
    "ingestion_source": "openai_sdk_v1.8"
  },
  "trace_context": { "trace_id": "chat-session-xyz", "span_id": "span-001", "span_name": "InitialUserQuery" },
  "identity_context": { "user_id": "user_abc", "api_key_hash": "a7e8f9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8" },
  "application_context": { "app_name": "marketing-copy-generator", "environment": "production" },
  "request": {
    "model": { "provider": "openai", "family": "gpt-4", "name": "gpt-4o", "version_id": "2024-05-13" },
    "prompt": { "messages": [{ "role": "user", "content": "Write a tagline for a new coffee shop." }] },
    "generation_config": { "temperature": 0.8, "max_tokens_to_sample": 50 }
  },
  "response": {
    "completion": {
      "choices": [{
        "index": 0,
        "finish_reason": "stop_sequence",
        "message": { "role": "assistant", "content": "Brewing Moments, One Cup at a Time." }
      }]
    },
    "usage": { "prompt_tokens": 15, "completion_tokens": 10, "total_tokens": 25 },
    "system": { "provider_request_id": "req_abc123", "system_fingerprint": "fp_123" }
  },
  "performance": {
    "latency_ms": { "total_e2e": 1200, "time_to_first_token": 450, "time_per_output_token": 75.0 }
  },
  "finops": {
    "cost": { "total_cost_usd": 0.0001, "prompt_cost_usd": 0.000075, "completion_cost_usd": 0.000025 },
    "pricing_info": { "prompt_token_rate_usd_per_million": 5.00, "completion_token_rate_usd_per_million": 15.00 }
  }
}

Example 2: Self-Hosted Llama-3 Call via vLLM

This example highlights the detailed performance metrics captured from a self-hosted inference server, which are critical for infrastructure optimization.

{
  "event_metadata": {
    "log_schema_version": "3.0.1",
    "event_id": "d1e2f3g4-h5i6-7890-1234-abcdef567890",
    "timestamp_utc": 1721950000000,
    "ingestion_source": "custom_vllm_client_v1.0"
  },
  "trace_context": { "trace_id": "batch-job-112", "span_id": "span-987", "span_name": "SummarizeArticle" },
  "identity_context": { "user_id": "internal_system_user" },
  "application_context": { "app_name": "news-feed-processor", "environment": "staging" },
  "request": {
    "model": { "provider": "vllm", "family": "llama-3", "name": "llama-3-70b-instruct" },
    "prompt": { "messages": [{ "role": "user", "content": "Summarize the following article..." }] },
    "generation_config": { "temperature": 0.2, "max_tokens_to_sample": 256 }
  },
  "response": {
    "completion": {
      "choices": [{
        "index": 0,
        "finish_reason": "max_tokens",
        "message": { "role": "assistant", "content": "The article discusses..." }
      }]
    },
    "usage": { "prompt_tokens": 850, "completion_tokens": 256, "total_tokens": 1106 }
  },
  "performance": {
    "latency_ms": {
      "total_e2e": 3500,
      "time_to_first_token": 1500,
      "time_per_output_token": 7.8,
      "queue_duration": 400,
      "prefill_duration": 1050,
      "decode_duration": 2000
    },
    "self_hosted_metrics": {
      "server_type": "vllm",
      "scheduler": { "running_requests": 2, "waiting_requests": 5 },
      "kv_cache": { "gpu_usage_percent": 91.2 },
      "hardware_info": { "accelerator_type": "H100-PCIe", "accelerator_count": 4 }
    }
  },
  "finops": {
    "cost": { "total_cost_usd": 0.0 },
    "pricing_info": {}
  }
}