Concept

The Co-Optimization Agent is an AI optimization platform designed to enhance the performance and cost-efficiency of your AI workloads, such as large language models (LLMs).

It complements your existing infrastructure by providing a data-driven layer to inform and automate complex configuration & supply fulfillment decisions.

The platform is designed for flexibility and can be used in multiple ways:

System-wide tuning through log analysis
Real-time, per-request adjustments via an optional real-time middleware.
Developer tools for IDEs

Modes of Optimization

0. Model Context Protocol (MCP)

See Netra MCP.

1. System-Level Optimization (Log Analysis)

This is the foundational layer of the platform. By analyzing historical workload data, the Co-Optimization Agent identifies systemic patterns and provides strategic tuning recommendations.

These are on-demand adjustments aimed at improving your AI stack.

Key characteristics:

Use Case: Ideal for making infrequent, high-impact change.
Process: The agent analyzes historical data (e.g., the last 30 days) for specified applications, simulates outcomes against your available Supply Options, and generates sets of recommended_config_snippets.
Outcome: You receive concrete, data-backed suggestions—like adjusting KV cache block sizes, enabling speculative decoding, or changing default temperature settings—that you can manually apply to your infrastructure or application code.

This mode allows you to optimize your system's architecture based on observed reality, moving beyond static, intuition-based configurations.

2. Real-Time Optimization (Middleware)

For teams requiring dynamic, per-request control, the platform offers an optional real-time Middleware component. This service acts as a pre-routing decision engine, or a "brain," for your existing smart-routing and orchestration logic.

Key characteristics:

Use Case: Best for applications where the optimal trade-off between cost, latency, and risk changes with each request, such as interactive chatbots, coding assistants, or multi-tenant services.
Process: Before executing a workload, or spinning up a new service, or deploying a new version: your application makes a call to the middleware's /decide endpoint. The agent analyzes the prompt and associated metadata, evaluates it against your catalog of available models (the "supply"), and predicts the performance outcomes.
Outcome: The API returns a Pareto front—a set of optimal, non-dominated solutions. Each solution represents a unique, efficient trade-off. The response also includes a recommended_solution based on either a default balanced profile or user-specified weights, allowing your application to programmatically route the request.

This decoupled architecture separates the decision logic from execution, giving you a centralized control plane to manage optimization without modifying client-side code for every new model or strategy.

Core Principle: Data-Driven Co-Optimization

All modes operate on the same core principle: making the trade-offs between Cost, Latency/Throughput, Quality and Risk explicit and quantifiable.

The Co-Optimization Agent provides the analytical power to move from manual tuning to a continuously improving, data-driven optimization loop that works alongside your existing systems.