Middleware
AI Co-Optimization General API Guide, Optional Activation Method
1. Introduction
The Co-Optimization Middleware is an optional real-time decision layer that optimizes AI workloads on a per-request basis.
It functions as the "Brain" for your existing smart-routing stack. The pre-routing decision engine.
Before sending a workload to an LLM, your application first calls the middleware's API endpoint. The service analyzes the request against available Supply Options (Services, Models, Configs) and returns a set of optimal execution strategies, enabling your application to programmatically balance cost, latency, and risk.
This approach replaces manual, intuition-based tuning with a unified, data-driven control plane for AI optimization. The following sections detail the API endpoints, data models, and integration patterns for using the middleware in a production environment.
2. How It Works: The Real-Time Optimization Lifecycle
When a request is sent to the Co-Optimization Middleware, it passes through a multi-step process to transform a prompt into a set of actionable, optimized decisions.
Step 1: Workload Analysis
Upon receiving a request, the agent analyzes the workload. It parses the raw_prompt
to understand its semantic content and extract linguistic features. For compliance with regulations like GDPR or HIPAA, prompts can optionally be pre-processed to redact or anonymize Personally Identifiable Information (PII) before being sent.
The agent uses the metadata
provided in the request, especially the app_id
, to attach pre-configured Service Level Objectives (SLOs) and risk profiles. For instance, an app_id
of customer-support-chat
might be associated with a strict latency target, while an app_id
of internal-data-analysis
might prioritize cost over speed.
Step 2: Supply Evaluation & Prediction
With a workload profile, the agent consults a dynamic catalog of all available AI models. Each entry in this catalog is a comprehensive configuration for a specific model deployment, defining its operational characteristics (e.g., deployment type, API endpoint, inference parameters, cost model). The agent filters this list to find viable candidates that meet the workload's basic requirements, such as a sufficient context window.
For each viable model, the agent uses predictive models, trained on historical and real-time observability data, to forecast the expected outcomes across three vectors: cost, latency (TTFT and TPOT), and a consolidated risk score.
Step 3: Policy & Governance
Before the final set of solutions is formulated, it is passed through a governance layer. A policy engine checks each potential decision against global rules you have configured. For example, a rule might prevent a workload with a high business impact score from being routed to a model with a predicted risk score above a certain threshold. This layer can also flag decisions that are too high-stakes for automated selection, escalating them for human review.
3. API Reference: The decide
Endpoint
decide
EndpointProvides a multi-objective analysis of a given workload, returning a set of optimal, policy-compliant routing decisions.
- Method:
POST
- URL:
/v2/middleware/decide
- Authentication: API key included in the
X-API-Key
HTTP header.
Request Body (WorkloadDecisionRequest
)
WorkloadDecisionRequest
)A JSON object containing the details of the workload to be analyzed.
Field Name | Type | Required | Description |
---|---|---|---|
raw_prompt | String | Yes | The full, unmodified prompt text for the workload. |
metadata | Object | Yes | Contextual metadata. Must include an app_id string to identify the source application for SLO and risk profiling. Other fields like user_id can be included for tracking. |
utility_weights | Object | No | Optional. Specifies the relative importance of each objective (cost, latency, risk). Values are floats. Guides the selection of recommended_solution . If omitted, the system uses a balanced profile. |
Response Body (WorkloadDecisionResponse
)
WorkloadDecisionResponse
)A successful response (200 OK
) contains a JSON object with the results of the optimization analysis.
Field Name | Type | Description |
---|---|---|
decision_id | String (UUID) | A unique identifier for this decision event. |
workload_id | String (UUID) | The unique identifier assigned to the workload profile. |
governance_status | Enum | The outcome of the governance review: APPROVED_AUTOMATED , APPROVED_HUMAN , HUMAN_REVIEW_REQUIRED , POLICY_VIOLATION . |
pareto_solutions | Array of ParetoSolution | An array of optimal, non-dominated solutions (the Pareto front). Empty if governance_status is POLICY_VIOLATION . |
recommended_solution | ParetoSolution or null | The single solution recommended by the system's utility function. null if no recommendation could be made or if governance_status is HUMAN_REVIEW_REQUIRED . |
4. Core Data Models
4.1 The ParetoSolution
Object
ParetoSolution
ObjectThe ParetoSolution
is the fundamental unit of value returned. Each object in the pareto_solutions
array represents a distinct, optimal trade-off.
Field Name | Type | Description |
---|---|---|
supply_id | String (UUID) | The unique identifier for the supply option (model) this solution corresponds to. |
predicted_cost | Float | The forecasted monetary cost to execute the workload (USD). |
predicted_ttft_ms_p95 | Float | Predicted 95th percentile Time-To-First-Token in milliseconds. Critical for interactive applications. |
predicted_tpot_ms_p95 | Float | Predicted 95th percentile Time-Per-Output-Token in milliseconds. Reflects response streaming speed. |
predicted_risk_score | Float | A consolidated risk score from 1 (low) to 10 (high), based on the supply's safety profile and the workload's business impact. |
trade_off_profile | String | A human-readable tag describing the solution's bias, e.g., "Cost-Optimized", "Latency-Optimized", or "Balanced". |
5. Supply Configuration Examples
Each "Supply Configuration" object is more than just a model name; it's a comprehensive manifest of its technical and operational parameters. Below are conceptual examples of what these configurations might look like for both a third-party API and a self-hosted model.
Example 1: Third-Party API Model (OpenAI GPT-4)
This configuration defines a connection to a commercial, third-party model, specifying API parameters and cost structures.
{
"supply_id": "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo",
"supply_name": "OpenAI GPT-4 Turbo",
"provider": "OpenAI",
"deployment_type": "API",
"connection_config": {
"api_endpoint": "[https://api.openai.com/v1/chat/completions](https://api.openai.com/v1/chat/completions)",
"api_key_name": "OPENAI_API_KEY"
},
"model_config": {
"model_slug": "gpt-4-turbo",
"parameters": {
"temperature": 0.7,
"max_tokens": 4096,
"top_p": 1.0,
"response_format": { "type": "json_object" }
}
},
"capabilities": {
"context_window_tokens": 128000,
"supports_tools": true,
"supports_structured_output": true
},
"cost_model": {
"input_cost_per_million_tokens": 10.00,
"output_cost_per_million_tokens": 30.00
}
}
Example 2: Self-Hosted Model (Llama 3 70B)
This example shows a highly-tuned configuration for a self-hosted model, detailing quantization, performance settings, and infrastructure-level parameters.
{
"supply_id": "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq",
"supply_name": "Llama 3 70B - High Throughput (AWQ)",
"provider": "SelfHosted",
"deployment_type": "vLLM",
"connection_config": {
"api_endpoint": "[http://10.0.1.55:8000/v1/generate](http://10.0.1.55:8000/v1/generate)"
},
"model_config": {
"model_path": "/models/meta-llama/Llama-3-70B-Instruct",
"quantization": {
"level": "4-bit",
"type": "AWQ"
},
"context_length_limit": 8192,
"max_input_tokens": 7000,
"max_output_tokens": 2048
},
"performance_config": {
"batching_strategy": "dynamic",
"max_batch_size": 128,
"tensor_parallel_size": 4,
"kv_cache_config": {
"type": "paged",
"block_size": 16,
"gpu_memory_utilization": 0.90
},
"speculative_decoding": {
"draft_model_path": "/models/meta-llama/Llama-3-8B-Instruct",
"num_speculative_tokens": 5
}
},
"cost_model": {
"amortized_hourly_cost_usd": 12.50
}
}
6. Integration Recipes
The following recipes demonstrate how to use the /v2/middleware/decide
endpoint to solve common operational challenges.
Recipe 1: Balanced Optimization
Goal: Make a standard request for a general-purpose interactive chatbot. The system should automatically select the best all-around option.
cURL Request:
curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
"raw_prompt": "Tell me a fun fact about the Roman Empire.",
"metadata": {
"app_id": "chatbot-interactive",
"session_id": "session-abc-123"
}
}'
Python Client (httpx
):
import httpx
import json
import asyncio
async def get_balanced_decision():
api_key = "YOUR_API_KEY"
url = "[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)"
payload = {
"raw_prompt": "Tell me a fun fact about the Roman Empire.",
"metadata": {
"app_id": "chatbot-interactive",
"session_id": "session-abc-123"
}
}
headers = {
"Content-Type": "application/json",
"X-API-Key": api_key
}
async with httpx.AsyncClient() as client:
try:
response = await client.post(url, headers=headers, json=payload, timeout=5.0)
response.raise_for_status()
decision = response.json()
print(json.dumps(decision, indent=2))
except httpx.HTTPStatusError as e:
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
except httpx.RequestError as e:
print(f"An error occurred while requesting {e.request.url!r}.")
# To run: asyncio.run(get_balanced_decision())
Explanation: The request uses app_id: "chatbot-interactive"
. Since no utility_weights
were provided, the system uses its default balanced profile to populate the recommended_solution
field with the option that provides the best overall mix of performance.
Recipe 2: Prioritizing for Minimum Cost
Goal: Find the absolute cheapest way to run a low-priority, offline task.
cURL Request:
curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
"raw_prompt": "Summarize the following meeting transcript...",
"metadata": {
"app_id": "batch-summarizer"
},
"utility_weights": {
"cost": 3.0,
"latency": 0.5,
"risk": 1.0
}
}'
Explanation: By providing utility_weights
that heavily favor cost ("cost": 3.0
), we explicitly instruct the system to prioritize economy. The agent uses these weights to select the "Cost-Optimized" option as the recommended_solution
, even if it is significantly slower.
Recipe 3: Prioritizing for Minimum Latency
Goal: Get the fastest possible response for a highly interactive coding assistant where low TTFT is critical.
cURL Request:
curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
"raw_prompt": "import pandas as pd\n# How do I merge two dataframes on multiple columns?",
"metadata": {
"app_id": "code-assistant"
},
"utility_weights": {
"cost": 0.5,
"latency": 3.0,
"risk": 1.5
}
}'
Explanation: Providing utility_weights
of {"latency": 3.0}
will cause the system to recommend the solution from the Pareto front with the lowest predicted_ttft_ms_p95
.
Recipe 4: Handling Governance Reviews
Goal: Submit a query for a high-risk application and correctly interpret a governance escalation. This status is a safety feature, not an error.
cURL Request:
curl -X POST '[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: YOUR_API_KEY' \
-d '{
"raw_prompt": "What are the contraindications for using warfarin with amiodarone?",
"metadata": {
"app_id": "medical-qa",
"user_id": "dr_smith_45"
}
}'
Sample API Response:
{
"decision_id": "b4b3b2b1-a0a9-8765-4321-fedcba987654",
"workload_id": "e9e8e7e6-d5d4-3210-cba9-876543210fed",
"governance_status": "HUMAN_REVIEW_REQUIRED",
"pareto_solutions": [
...
],
"recommended_solution": null
}
Explanation: The app_id: "medical-qa"
is configured with a high business impact score. The agent's governance layer detects this and determines an automated decision is too risky. The governance_status
is set to HUMAN_REVIEW_REQUIRED
, and recommended_solution
is null
. This is a signal to the client application to escalate the request, for example by routing it to a pre-defined safe default model or queueing it for manual review.
Recipe 5: Fail-safes and Fallback Logic
For a production system, it's critical to design for failure. Your application should not be tightly coupled to the availability of the Co-Optimization Agent.
- Client-Side Timeouts: The call to the agent should have an aggressive timeout (e.g., 100-200ms). If the agent doesn't respond quickly, proceed with a default action.
- Default/Fallback Routing: Define a "good enough" default model or execution path within your application. This path is used if the agent call fails, times out, or returns a non-approved status.
- Circuit Breakers: For high-throughput systems, wrap the agent call in a circuit breaker. If the agent fails consistently, the breaker will "trip" and immediately route all subsequent requests to the fallback path for a period of time.
- Handling Governance Escalations: As shown in Recipe 4, a
HUMAN_REVIEW_REQUIRED
status is a deliberate fail-safe signal. Your application's fallback logic must handle this by routing the request to a separate, safe workflow.
Recipe 6: Full End-to-End Integration Pattern
Goal: Demonstrate the complete integration loop with robust fallback logic: call the agent, parse the recommendation, handle failures by routing to a default, and then call the chosen final inference provider.
Python Client (Conceptual):
import httpx
import json
import asyncio
import os
# Mock mapping of agent Supply IDs to actual provider details.
SUPPLY_PROVIDER_MAP = {
"a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo": {
"url": "[https://api.anthropic.com/v1/messages](https://api.anthropic.com/v1/messages)",
"api_key_env": "ANTHROPIC_API_KEY",
"model_name": "claude-3-haiku-20240307"
},
"a1b2c3d4-e5f6-7890-1234-llama3-70b-awq": {
"url": "[http://my-self-hosted-llama.internal/generate](http://my-self-hosted-llama.internal/generate)",
"api_key_env": None,
"model_name": "meta-llama/Llama-3-70B-Instruct"
}
}
DEFAULT_SUPPLY_ID = "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo" # Fallback to a fast model
async def get_optimized_decision(prompt, metadata):
# (Code from Recipe 1 to call the Co-Optimization Agent asynchronously)
# Returns decision data or None on failure/timeout.
api_key = "YOUR_API_KEY"
url = "[https://api.netrasystems.ai/v2/middleware/decide](https://api.netrasystems.ai/v2/middleware/decide)"
payload = {"raw_prompt": prompt, "metadata": metadata}
headers = {"Content-Type": "application/json", "X-API-Key": api_key}
async with httpx.AsyncClient() as client:
try:
response = await client.post(url, headers=headers, json=payload, timeout=2.0)
response.raise_for_status()
return response.json()
except (httpx.RequestError, httpx.HTTPStatusError) as e:
print(f"Co-Optimization Agent call failed: {e}. Proceeding with fallback.")
return None
async def execute_workload(prompt, metadata):
decision_data = await get_optimized_decision(prompt, metadata)
chosen_supply_id = None
# 1. Check for a valid, automated recommendation
if (decision_data and
decision_data.get("recommended_solution") and
decision_data.get("governance_status") == "APPROVED_AUTOMATED"):
chosen_supply_id = decision_data["recommended_solution"]["supply_id"]
print(f"Agent recommended supply ID: {chosen_supply_id}")
else:
# 2. If no recommendation, use the default fallback
print("No valid recommendation. Using default fallback.")
chosen_supply_id = DEFAULT_SUPPLY_ID
# 3. Look up provider details and execute
provider_info = SUPPLY_PROVIDER_MAP.get(chosen_supply_id)
if not provider_info:
print(f"FATAL: Unknown supply ID '{chosen_supply_id}'.")
return
print(f"Routing to provider: {provider_info['model_name']}")
# 4. Make the final call to the chosen inference provider... (mocked)
print("Mock execution successful.")
# To run: asyncio.run(execute_workload("A prompt", {"app_id": "some-app"}))
Recipe 7: Advanced Configuration with an Orchestration Library
Goal: Set query configuration based on the agent's recommendation.
Notice how the params are variable here, based on the agents recommendation.
Python Client (using LangChain):
import asyncio
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
prompt = ChatPromptTemplate.from_template("Query: {query}\n\nAnswer:")
def get_params(solution: dict):
# Standard Netra part
model_name = solution.get("model", recommended_model_name)
# Next step in getting value from Netra:
params = solution.get("parameters", params)
return model_name, params
async def execute_routed_query(query: str, agent_decision: dict):
# For Fallback only
model_name = "gpt-4-turbo" # For Fallback only
params = {"temperature": 0.7} # For Fallback only
if agent_decision:
solution = agent_decision.get("recommended_solution", None)
if solution:
model_name, params = get_params(solution)
llm = ChatOpenAI(model=model_name, **params)
# Create and run the final chain
chain = prompt | llm
result = await chain.ainvoke({"query": query})
return result.content
Explanation: This pattern uses the Co-Optimization Agent as the "brain" for a LangChain router.
7. Background Concepts
This section provides background on the concepts that power the middleware.
Multi-Objective Optimization
The agent's analysis is grounded in a multi-objective optimization framework that considers three primary, often conflicting, vectors:
- Cost: The predicted monetary cost to execute the workload.
- Latency: The predicted time-based performance (TTFT and TPOT).
- Risk: A consolidated score representing a model's propensity for undesirable outcomes (e.g., hallucinations, PII leakage), weighted by the application's business impact.
The agent makes the trade-offs between these objectives explicit and quantifiable.
The Pareto Front: A Set of Optimal Choices
Instead of returning a single "best" answer, the agent generates a Pareto front of solutions. A Pareto front is a set of non-dominated solutions, where no single option is better on one objective (e.g., cost) without being worse on another (e.g., latency). Each solution on the front represents a unique and efficient balance of the core objectives.
This empowers the developer to implement business logic that makes the final trade-off decision based on the context of the request.
Decoupled Architecture: Optimization-as-a-Service
The agent's decoupled architecture separates the decision logic from the final execution. It functions as an external "brain" that provides a recommendation, while your application remains the "hands" that executes the final call to the inference provider. This provides two key benefits:
- Agility: Add, remove, or test new models in the agent's supply catalog without any client-side code changes.
- Centralized Control: Manage optimization strategies and governance rules centrally, ensuring consistency across all applications.
The Self-Improving Loop
The agent is a continuously learning system. After your application executes a workload, the resulting performance data (actual cost, latency, quality metrics) can be fed back into the system. This feedback updates the statistical profiles of the models in the supply catalog, ensuring that predictions become more accurate over time and automatically adapt to real-world phenomena like model performance drift.
Offline Analysis for Strategic Insights
The platform can also be used for offline analysis to evaluate "what-if" scenarios without impacting production traffic. This is ideal for strategic planning, such as:
- Evaluating New Models: Assess a new model against your existing models for a representative set of workloads to understand its performance and cost profile before integration.
- Budget Forecasting: Analyze the potential cost implications of launching a new AI feature.
- Performance Tuning: Explore the cost vs. latency trade-offs for a new application feature before it goes live.
Updated 6 days ago