System-Level
AI Optimization at System-level, Optional Value Activation Method
1. Introduction
Analyze historical workload patterns to provide strategic, system-level tuning recommendations. This guide details how to retrieve and apply these recommendations for one-time configuration changes to your model serving infrastructure. This is in contrast to Middleware for real-time, per-request optimization.
The goal is to optimize the baseline performance and cost-efficiency of your entire system based on observed usage patterns. This process is ideal for making strategic adjustments to self-hosted models or updating the default parameters used by your applications.
2. How It Works: Offline Analysis and Recommendation
For real-time optimization, see the Middleware. For IDE integration, see the MCP.
The process involves two main steps: retrieving recommendations from the agent and then manually applying them to your infrastructure.
Step 1: Generating System-Wide Recommendations
The agent analyzes historical workload data (e.g., from the past 30 days)
for specified applications (app_id
).
It identifies patterns in the workload data.
Based on these patterns,
it runs simulations against your configured supply options
to find an improved system-level configuration.
For example, for a self-hosted context, the dominant workloads are variations on short conversations,
it might recommend configuration sets that reduce the KV cache block size to
improve memory efficiency and allow for higher concurrency.
Step 2: Applying the Optimal Configuration
The agent provides a new configuration snippet. You can then apply this change manually to your model serving environment (like vLLM or Hugging Face TGI) or update the default call parameters in your application code (for third-party APIs like OpenAI). This updates the baseline for all future requests.
3. API Reference: The system_recommendations
Endpoint
system_recommendations
EndpointProvides a list of system-level tuning recommendations based on historical workload analysis.
- Method:
POST
- URL:
/v2/analysis/system_recommendations
- Authentication: API key included in the
X-API-Key
HTTP header.
Request Body
A JSON object specifying the scope of the analysis.
Field Name | Type | Required | Description |
---|---|---|---|
app_ids | Array of Strings | Yes | A list of app_id s to include in the workload analysis. |
analysis_period_days | Integer | No | The number of past days of data to analyze. Defaults to 30 . |
Response Body
A successful response (200 OK
) contains a JSON object with a list of recommendations.
Field Name | Type | Description |
---|---|---|
analysis_id | String (UUID) | A unique identifier for this analysis report. |
recommendations | Array of SystemRecommendation | An array of tuning recommendations for your supply configurations. |
The SystemRecommendation
Object
SystemRecommendation
ObjectField Name | Type | Description |
---|---|---|
recommendation_id | String (UUID) | A unique identifier for the specific recommendation. |
target_supply_id | String (UUID) | The supply_id of the model configuration this recommendation applies to. |
analysis_summary | String | A human-readable summary of the finding and the reasoning behind the recommendation. |
predicted_impact | Object | An object detailing the estimated system-wide impact, e.g., {"cost_change_percent": -15.2, "p95_latency_change_percent": 5.1} . |
recommended_config_snippet | Object | A JSON snippet containing the specific configuration parameters that should be updated. |
4. Applying Recommendations: Practical Examples
Once you have a recommended_config_snippet
from the API, the next step is to apply it. The method depends on your serving stack.
Example Scenario
Imagine you run the analysis for your code-assistant
app and receive the following recommendation for a self-hosted Llama 3 model:
Sample API Response Snippet:
{
"recommendation_id": "rec-1a2b3c-speculative-decoding",
"target_supply_id": "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq",
"analysis_summary": "Analysis of 'code-assistant' workloads shows a high percentage of token generation. Enabling speculative decoding with a smaller draft model is predicted to reduce overall latency by up to 20% with negligible impact on accuracy for this use case.",
"predicted_impact": {
"cost_change_percent": 0,
"p95_tpot_change_percent": -22.5
},
"recommended_config_snippet": {
"performance_config": {
"speculative_decoding": {
"draft_model_path": "/models/meta-llama/Llama-3-8B-Instruct",
"num_speculative_tokens": 5
}
}
}
}
How to Apply This Recommendation:
Option 1: Updating a vLLM Server
If your supply_id
corresponds to a model served with vLLM, you would apply the recommended_config_snippet
by adding or modifying the command-line arguments when you launch the server.
Before (Original vLLM launch command):
python -m vllm.entrypoints.openai.api_server \
--model /models/meta-llama/Llama-3-70B-Instruct \
--quantization awq \
--tensor-parallel-size 4 \
--port 8000
After (Applying the recommendation):
The speculative_decoding
recommendation translates to the --speculative-model
and --num-speculative-tokens
arguments in vLLM.
python -m vllm.entrypoints.openai.api_server \
--model /models/meta-llama/Llama-3-70B-Instruct \
--quantization awq \
--tensor-parallel-size 4 \
--port 8000 \
--speculative-model /models/meta-llama/Llama-3-8B-Instruct \
--num-speculative-tokens 5
You would then restart your vLLM server with the new command to activate the optimization.
Option 2: Updating an Application using an OpenAI Client
Sometimes, the recommendation isn't for the infrastructure but for how you use it. Imagine a recommendation to change the default temperature
for your chatbot-interactive
app to reduce hallucinations.
Sample Recommendation:
{
"recommendation_id": "rec-4d5e6f-temperature-tuning",
"target_supply_id": "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo",
"analysis_summary": "Workloads for 'chatbot-interactive' show a high rate of factual queries where lower temperature can reduce confabulation. Recommend reducing default temperature from 0.7 to 0.2.",
"predicted_impact": {
"risk_score_change_percent": -40.0
},
"recommended_config_snippet": {
"model_config": {
"parameters": {
"temperature": 0.2
}
}
}
}
You would apply this by changing the default parameters in your application code that calls the OpenAI API.
Before (Original Python code):
from openai import OpenAI
client = OpenAI()
def get_chat_response(prompt):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.7 # Original default
)
return response.choices[0].message.content
After (Applying the recommendation):
from openai import OpenAI
client = OpenAI()
def get_chat_response(prompt):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.2 # Updated default based on recommendation
)
return response.choices[0].message.content
This change ensures all future calls from this function use the optimized setting.
Updated 8 days ago