1. Introduction

The Co-Optimization Middleware is an optional real-time decision layer that optimizes AI workloads on a per-request basis.

It functions as the "Brain" for your existing smart-routing stack. The pre-routing decision engine.

Before sending a workload to an LLM, your application first calls the middleware's API endpoint. The service analyzes the request against available Supply Options (Services, Models, Configs) and returns a set of optimal execution strategies, enabling your application to programmatically balance cost, latency, and risk.

This approach replaces manual, intuition-based tuning with a unified, data-driven control plane for AI optimization. The following sections detail the API endpoints, data models, and integration patterns for using the middleware in a production environment.

2. How It Works: The Real-Time Optimization Lifecycle

When a request is sent to the Co-Optimization Middleware, it passes through a multi-step process to transform a prompt into a set of actionable, optimized decisions.

Step 1: Workload Analysis

Upon receiving a request, the agent analyzes the workload. It parses the raw_prompt to understand its semantic content and extract linguistic features. For compliance with regulations like GDPR or HIPAA, prompts can optionally be pre-processed to redact or anonymize Personally Identifiable Information (PII) before being sent.

The agent uses the metadata provided in the request, especially the app_id, to attach pre-configured Service Level Objectives (SLOs) and risk profiles. For instance, an app_id of customer-support-chat might be associated with a strict latency target, while an app_id of internal-data-analysis might prioritize cost over speed.

Step 2: Supply Evaluation & Prediction

With a workload profile, the agent consults a dynamic catalog of all available AI models. Each entry in this catalog is a comprehensive configuration for a specific model deployment, defining its operational characteristics (e.g., deployment type, API endpoint, inference parameters, cost model). The agent filters this list to find viable candidates that meet the workload's basic requirements, such as a sufficient context window.

For each viable model, the agent uses predictive models, trained on historical and real-time observability data, to forecast the expected outcomes across three vectors: cost, latency (TTFT and TPOT), and a consolidated risk score.

Step 3: Policy & Governance

Before the final set of solutions is formulated, it is passed through a governance layer. A policy engine checks each potential decision against global rules you have configured. For example, a rule might prevent a workload with a high business impact score from being routed to a model with a predicted risk score above a certain threshold. This layer can also flag decisions that are too high-stakes for automated selection, escalating them for human review.

3. API Reference: The `decide` Endpoint

Provides a multi-objective analysis of a given workload, returning a set of optimal, policy-compliant routing decisions.

Method: POST
URL: /v2/middleware/decide
Authentication: API key included in the X-API-Key HTTP header.

Request Body (`WorkloadDecisionRequest`)

A JSON object containing the details of the workload to be analyzed.

Field Name	Type	Required	Description
`raw_prompt`	String	Yes	The full, unmodified prompt text for the workload.
`metadata`	Object	Yes	Contextual metadata. Must include an `app_id` string to identify the source application for SLO and risk profiling. Other fields like `user_id` can be included for tracking.
`utility_weights`	Object	No	Optional. Specifies the relative importance of each objective (cost, latency, risk). Values are floats. Guides the selection of `recommended_solution`. If omitted, the system uses a balanced profile.

Response Body (`WorkloadDecisionResponse`)

A successful response (200 OK) contains a JSON object with the results of the optimization analysis.

Field Name	Type	Description
`decision_id`	String (UUID)	A unique identifier for this decision event.
`workload_id`	String (UUID)	The unique identifier assigned to the workload profile.
`governance_status`	Enum	The outcome of the governance review: `APPROVED_AUTOMATED`, `APPROVED_HUMAN`, `HUMAN_REVIEW_REQUIRED`, `POLICY_VIOLATION`.
`pareto_solutions`	Array of `ParetoSolution`	An array of optimal, non-dominated solutions (the Pareto front). Empty if `governance_status` is `POLICY_VIOLATION`.
`recommended_solution`	`ParetoSolution` or `null`	The single solution recommended by the system's utility function. `null` if no recommendation could be made or if `governance_status` is `HUMAN_REVIEW_REQUIRED`.

4. Core Data Models

4.1 The `ParetoSolution` Object

The ParetoSolution is the fundamental unit of value returned. Each object in the pareto_solutions array represents a distinct, optimal trade-off.

Field Name	Type	Description
`supply_id`	String (UUID)	The unique identifier for the supply option (model) this solution corresponds to.
`predicted_cost`	Float	The forecasted monetary cost to execute the workload (USD).
`predicted_ttft_ms_p95`	Float	Predicted 95th percentile Time-To-First-Token in milliseconds. Critical for interactive applications.
`predicted_tpot_ms_p95`	Float	Predicted 95th percentile Time-Per-Output-Token in milliseconds. Reflects response streaming speed.
`predicted_risk_score`	Float	A consolidated risk score from 1 (low) to 10 (high), based on the supply's safety profile and the workload's business impact.
`trade_off_profile`	String	A human-readable tag describing the solution's bias, e.g., "Cost-Optimized", "Latency-Optimized", or "Balanced".

5. Supply Configuration Examples

Each "Supply Configuration" object is more than just a model name; it's a comprehensive manifest of its technical and operational parameters. Below are conceptual examples of what these configurations might look like for both a third-party API and a self-hosted model.

Example 1: Third-Party API Model (OpenAI GPT-4)

This configuration defines a connection to a commercial, third-party model, specifying API parameters and cost structures.

{
  "supply_id": "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo",
  "supply_name": "OpenAI GPT-4 Turbo",
  "provider": "OpenAI",
  "deployment_type": "API",
  "connection_config": {
    "api_endpoint": "[https://api.openai.com/v1/chat/completions](https://api.openai.com/v1/chat/completions)",
    "api_key_name": "OPENAI_API_KEY"
  },
  "model_config": {
    "model_slug": "gpt-4-turbo",
    "parameters": {
      "temperature": 0.7,
      "max_tokens": 4096,
      "top_p": 1.0,
      "response_format": { "type": "json_object" }
    }
  },
  "capabilities": {
    "context_window_tokens": 128000,
    "supports_tools": true,
    "supports_structured_output": true
  },
  "cost_model": {
    "input_cost_per_million_tokens": 10.00,
    "output_cost_per_million_tokens": 30.00
  }
}

Example 2: Self-Hosted Model (Llama 3 70B)

This example shows a highly-tuned configuration for a self-hosted model, detailing quantization, performance settings, and infrastructure-level parameters.

{
  "supply_id": "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq",
  "supply_name": "Llama 3 70B - High Throughput (AWQ)",
  "provider": "SelfHosted",
  "deployment_type": "vLLM",
  "connection_config": {
    "api_endpoint": "[http://10.0.1.55:8000/v1/generate](http://10.0.1.55:8000/v1/generate)"
  },
  "model_config": {
    "model_path": "/models/meta-llama/Llama-3-70B-Instruct",
    "quantization": {
      "level": "4-bit",
      "type": "AWQ"
    },
    "context_length_limit": 8192,
    "max_input_tokens": 7000,
    "max_output_tokens": 2048
  },
  "performance_config": {
    "batching_strategy": "dynamic",
    "max_batch_size": 128,
    "tensor_parallel_size": 4,
    "kv_cache_config": {
      "type": "paged",
      "block_size": 16,
      "gpu_memory_utilization": 0.90
    },
    "speculative_decoding": {
      "draft_model_path": "/models/meta-llama/Llama-3-8B-Instruct",
      "num_speculative_tokens": 5
    }
  },
  "cost_model": {
    "amortized_hourly_cost_usd": 12.50
  }
}

6. Integration Recipes

The following recipes demonstrate how to use the /v2/middleware/decide endpoint to solve common operational challenges.

Recipe 1: Balanced Optimization

Goal: Make a standard request for a general-purpose interactive chatbot. The system should automatically select the best all-around option.

cURL Request:

curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
  "raw_prompt": "Tell me a fun fact about the Roman Empire.",
  "metadata": {
    "app_id": "chatbot-interactive",
    "session_id": "session-abc-123"
  }
}'

Python Client (httpx):

import httpx
import json
import asyncio

async def get_balanced_decision():
    api_key = "YOUR_API_KEY"
    url = "[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)"
    payload = {
        "raw_prompt": "Tell me a fun fact about the Roman Empire.",
        "metadata": {
            "app_id": "chatbot-interactive",
            "session_id": "session-abc-123"
        }
    }
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": api_key
    }

    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(url, headers=headers, json=payload, timeout=5.0)
            response.raise_for_status()
            decision = response.json()
            print(json.dumps(decision, indent=2))
        except httpx.HTTPStatusError as e:
            print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
        except httpx.RequestError as e:
            print(f"An error occurred while requesting {e.request.url!r}.")

# To run: asyncio.run(get_balanced_decision())

Explanation: The request uses app_id: "chatbot-interactive". Since no utility_weights were provided, the system uses its default balanced profile to populate the recommended_solution field with the option that provides the best overall mix of performance.

Recipe 2: Prioritizing for Minimum Cost

Goal: Find the absolute cheapest way to run a low-priority, offline task.

cURL Request:

curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
  "raw_prompt": "Summarize the following meeting transcript...",
  "metadata": {
    "app_id": "batch-summarizer"
  },
  "utility_weights": {
    "cost": 3.0,
    "latency": 0.5,
    "risk": 1.0
  }
}'

Explanation: By providing utility_weights that heavily favor cost ("cost": 3.0), we explicitly instruct the system to prioritize economy. The agent uses these weights to select the "Cost-Optimized" option as the recommended_solution, even if it is significantly slower.

Recipe 3: Prioritizing for Minimum Latency

Goal: Get the fastest possible response for a highly interactive coding assistant where low TTFT is critical.

cURL Request:

curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
  "raw_prompt": "import pandas as pd\n# How do I merge two dataframes on multiple columns?",
  "metadata": {
    "app_id": "code-assistant"
  },
  "utility_weights": {
    "cost": 0.5,
    "latency": 3.0,
    "risk": 1.5
  }
}'

Explanation: Providing utility_weights of {"latency": 3.0} will cause the system to recommend the solution from the Pareto front with the lowest predicted_ttft_ms_p95.

Recipe 4: Handling Governance Reviews

Goal: Submit a query for a high-risk application and correctly interpret a governance escalation. This status is a safety feature, not an error.

cURL Request:

curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
  "raw_prompt": "What are the contraindications for using warfarin with amiodarone?",
  "metadata": {
    "app_id": "medical-qa",
    "user_id": "dr_smith_45"
  }
}'

Sample API Response:

{
  "decision_id": "b4b3b2b1-a0a9-8765-4321-fedcba987654",
  "workload_id": "e9e8e7e6-d5d4-3210-cba9-876543210fed",
  "governance_status": "HUMAN_REVIEW_REQUIRED",
  "pareto_solutions": [
    ...
  ],
  "recommended_solution": null
}

Explanation: The app_id: "medical-qa" is configured with a high business impact score. The agent's governance layer detects this and determines an automated decision is too risky. The governance_status is set to HUMAN_REVIEW_REQUIRED, and recommended_solution is null. This is a signal to the client application to escalate the request, for example by routing it to a pre-defined safe default model or queueing it for manual review.

Recipe 5: Fail-safes and Fallback Logic

For a production system, it's critical to design for failure. Your application should not be tightly coupled to the availability of the Co-Optimization Agent.

Client-Side Timeouts: The call to the agent should have an aggressive timeout (e.g., 100-200ms). If the agent doesn't respond quickly, proceed with a default action.
Default/Fallback Routing: Define a "good enough" default model or execution path within your application. This path is used if the agent call fails, times out, or returns a non-approved status.
Circuit Breakers: For high-throughput systems, wrap the agent call in a circuit breaker. If the agent fails consistently, the breaker will "trip" and immediately route all subsequent requests to the fallback path for a period of time.
Handling Governance Escalations: As shown in Recipe 4, a HUMAN_REVIEW_REQUIRED status is a deliberate fail-safe signal. Your application's fallback logic must handle this by routing the request to a separate, safe workflow.

Recipe 6: Full End-to-End Integration Pattern

Goal: Demonstrate the complete integration loop with robust fallback logic: call the agent, parse the recommendation, handle failures by routing to a default, and then call the chosen final inference provider.

Python Client (Conceptual):

import httpx
import json
import asyncio
import os

# Mock mapping of agent Supply IDs to actual provider details.
SUPPLY_PROVIDER_MAP = {
    "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo": {
        "url": "[https://api.anthropic.com/v1/messages](https://api.anthropic.com/v1/messages)",
        "api_key_env": "ANTHROPIC_API_KEY",
        "model_name": "claude-3-haiku-20240307"
    },
    "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq": {
        "url": "[http://my-self-hosted-llama.internal/generate](http://my-self-hosted-llama.internal/generate)",
        "api_key_env": None,
        "model_name": "meta-llama/Llama-3-70B-Instruct"
    }
}
DEFAULT_SUPPLY_ID = "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo" # Fallback to a fast model

async def get_optimized_decision(prompt, metadata):
    # (Code from Recipe 1 to call the Co-Optimization Agent asynchronously)
    # Returns decision data or None on failure/timeout.
    api_key = "YOUR_API_KEY"
    url = "[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)"
    payload = {"raw_prompt": prompt, "metadata": metadata}
    headers = {"Content-Type": "application/json", "X-API-Key": api_key}
    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(url, headers=headers, json=payload, timeout=2.0)
            response.raise_for_status()
            return response.json()
        except (httpx.RequestError, httpx.HTTPStatusError) as e:
            print(f"Co-Optimization Agent call failed: {e}. Proceeding with fallback.")
            return None

async def execute_workload(prompt, metadata):
    decision_data = await get_optimized_decision(prompt, metadata)
    chosen_supply_id = None

    # 1. Check for a valid, automated recommendation
    if (decision_data and
        decision_data.get("recommended_solution") and
        decision_data.get("governance_status") == "APPROVED_AUTOMATED"):
        chosen_supply_id = decision_data["recommended_solution"]["supply_id"]
        print(f"Agent recommended supply ID: {chosen_supply_id}")
    else:
        # 2. If no recommendation, use the default fallback
        print("No valid recommendation. Using default fallback.")
        chosen_supply_id = DEFAULT_SUPPLY_ID

    # 3. Look up provider details and execute
    provider_info = SUPPLY_PROVIDER_MAP.get(chosen_supply_id)
    if not provider_info:
        print(f"FATAL: Unknown supply ID '{chosen_supply_id}'.")
        return

    print(f"Routing to provider: {provider_info['model_name']}")
    # 4. Make the final call to the chosen inference provider... (mocked)
    print("Mock execution successful.")


# To run: asyncio.run(execute_workload("A prompt", {"app_id": "some-app"}))

Recipe 7: Advanced Configuration with an Orchestration Library

Goal: Set query configuration based on the agent's recommendation.

📘
Notice how the params are variable here, based on the agents recommendation.

Python Client (using LangChain):


import asyncio
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template("Query: {query}\n\nAnswer:")

def get_params(solution: dict):
    # Standard Netra part
    model_name = solution.get("model", recommended_model_name)

    # Next step in getting value from Netra:
    params = solution.get("parameters", params)

    return model_name, params


async def execute_routed_query(query: str, agent_decision: dict):
  
    # For Fallback only
    model_name = "gpt-4-turbo" # For Fallback only
    params = {"temperature": 0.7} # For Fallback only

    if agent_decision:
      solution = agent_decision.get("recommended_solution", None)
    
    if solution:
      model_name, params = get_params(solution)
    
    llm = ChatOpenAI(model=model_name, **params)
    
    # Create and run the final chain
    chain = prompt | llm
    result = await chain.ainvoke({"query": query})
    
    return result.content

Explanation: This pattern uses the Co-Optimization Agent as the "brain" for a LangChain router.

7. Background Concepts

This section provides background on the concepts that power the middleware.

Multi-Objective Optimization

The agent's analysis is grounded in a multi-objective optimization framework that considers three primary, often conflicting, vectors:

Cost: The predicted monetary cost to execute the workload.
Latency: The predicted time-based performance (TTFT and TPOT).
Risk: A consolidated score representing a model's propensity for undesirable outcomes (e.g., hallucinations, PII leakage), weighted by the application's business impact.

The agent makes the trade-offs between these objectives explicit and quantifiable.

The Pareto Front: A Set of Optimal Choices

Instead of returning a single "best" answer, the agent generates a Pareto front of solutions. A Pareto front is a set of non-dominated solutions, where no single option is better on one objective (e.g., cost) without being worse on another (e.g., latency). Each solution on the front represents a unique and efficient balance of the core objectives.

This empowers the developer to implement business logic that makes the final trade-off decision based on the context of the request.

Decoupled Architecture: Optimization-as-a-Service

The agent's decoupled architecture separates the decision logic from the final execution. It functions as an external "brain" that provides a recommendation, while your application remains the "hands" that executes the final call to the inference provider. This provides two key benefits:

Agility: Add, remove, or test new models in the agent's supply catalog without any client-side code changes.
Centralized Control: Manage optimization strategies and governance rules centrally, ensuring consistency across all applications.

The Self-Improving Loop

The agent is a continuously learning system. After your application executes a workload, the resulting performance data (actual cost, latency, quality metrics) can be fed back into the system. This feedback updates the statistical profiles of the models in the supply catalog, ensuring that predictions become more accurate over time and automatically adapt to real-world phenomena like model performance drift.

Offline Analysis for Strategic Insights

The platform can also be used for offline analysis to evaluate "what-if" scenarios without impacting production traffic. This is ideal for strategic planning, such as:

Evaluating New Models: Assess a new model against your existing models for a representative set of workloads to understand its performance and cost profile before integration.
Budget Forecasting: Analyze the potential cost implications of launching a new AI feature.
Performance Tuning: Explore the cost vs. latency trade-offs for a new application feature before it goes live.

1. Introduction

2. How It Works: The Real-Time Optimization Lifecycle

Step 1: Workload Analysis

Step 2: Supply Evaluation & Prediction

Step 3: Policy & Governance

3. API Reference: The decide Endpoint

Request Body (WorkloadDecisionRequest)

Response Body (WorkloadDecisionResponse)

4. Core Data Models

4.1 The ParetoSolution Object

5. Supply Configuration Examples

Example 1: Third-Party API Model (OpenAI GPT-4)

Example 2: Self-Hosted Model (Llama 3 70B)

6. Integration Recipes

Recipe 1: Balanced Optimization

Recipe 2: Prioritizing for Minimum Cost

Recipe 3: Prioritizing for Minimum Latency

Recipe 4: Handling Governance Reviews

Recipe 5: Fail-safes and Fallback Logic

Recipe 6: Full End-to-End Integration Pattern

Recipe 7: Advanced Configuration with an Orchestration Library

7. Background Concepts

Multi-Objective Optimization

The Pareto Front: A Set of Optimal Choices

Decoupled Architecture: Optimization-as-a-Service

The Self-Improving Loop

Offline Analysis for Strategic Insights

3. API Reference: The `decide` Endpoint

Request Body (`WorkloadDecisionRequest`)

Response Body (`WorkloadDecisionResponse`)

4.1 The `ParetoSolution` Object