System-Level

1. Introduction

Analyze historical workload patterns to provide strategic, system-level tuning recommendations. This guide details how to retrieve and apply these recommendations for one-time configuration changes to your model serving infrastructure. This is in contrast to Middleware for real-time, per-request optimization.

The goal is to optimize the baseline performance and cost-efficiency of your entire system based on observed usage patterns. This process is ideal for making strategic adjustments to self-hosted models or updating the default parameters used by your applications.

2. How It Works: Offline Analysis and Recommendation

For real-time optimization, see the Middleware. For IDE integration, see the MCP.

The process involves two main steps: retrieving recommendations from the agent and then manually applying them to your infrastructure.

Step 1: Generating System-Wide Recommendations

The agent analyzes historical workload data (e.g., from the past 30 days)
for specified applications (app_id). It identifies patterns in the workload data.

Based on these patterns,
it runs simulations against your configured supply options to find an improved system-level configuration.

For example, for a self-hosted context, the dominant workloads are variations on short conversations,
it might recommend configuration sets that reduce the KV cache block size to improve memory efficiency and allow for higher concurrency.

Step 2: Applying the Optimal Configuration

The agent provides a new configuration snippet. You can then apply this change manually to your model serving environment (like vLLM or Hugging Face TGI) or update the default call parameters in your application code (for third-party APIs like OpenAI). This updates the baseline for all future requests.

3. API Reference: The `system_recommendations` Endpoint

Provides a list of system-level tuning recommendations based on historical workload analysis.

Method: POST
URL: /v2/analysis/system_recommendations
Authentication: API key included in the X-API-Key HTTP header.

Request Body

A JSON object specifying the scope of the analysis.

Field Name	Type	Required	Description
`app_ids`	Array of Strings	Yes	A list of `app_id`s to include in the workload analysis.
`analysis_period_days`	Integer	No	The number of past days of data to analyze. Defaults to `30`.

Response Body

A successful response (200 OK) contains a JSON object with a list of recommendations.

Field Name	Type	Description
`analysis_id`	String (UUID)	A unique identifier for this analysis report.
`recommendations`	Array of `SystemRecommendation`	An array of tuning recommendations for your supply configurations.

The `SystemRecommendation` Object

Field Name	Type	Description
`recommendation_id`	String (UUID)	A unique identifier for the specific recommendation.
`target_supply_id`	String (UUID)	The `supply_id` of the model configuration this recommendation applies to.
`analysis_summary`	String	A human-readable summary of the finding and the reasoning behind the recommendation.
`predicted_impact`	Object	An object detailing the estimated system-wide impact, e.g., `{"cost_change_percent": -15.2, "p95_latency_change_percent": 5.1}`.
`recommended_config_snippet`	Object	A JSON snippet containing the specific configuration parameters that should be updated.

4. Applying Recommendations: Practical Examples

Once you have a recommended_config_snippet from the API, the next step is to apply it. The method depends on your serving stack.

Example Scenario

Imagine you run the analysis for your code-assistant app and receive the following recommendation for a self-hosted Llama 3 model:

Sample API Response Snippet:

{
  "recommendation_id": "rec-1a2b3c-speculative-decoding",
  "target_supply_id": "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq",
  "analysis_summary": "Analysis of 'code-assistant' workloads shows a high percentage of token generation. Enabling speculative decoding with a smaller draft model is predicted to reduce overall latency by up to 20% with negligible impact on accuracy for this use case.",
  "predicted_impact": {
    "cost_change_percent": 0,
    "p95_tpot_change_percent": -22.5
  },
  "recommended_config_snippet": {
    "performance_config": {
      "speculative_decoding": {
        "draft_model_path": "/models/meta-llama/Llama-3-8B-Instruct",
        "num_speculative_tokens": 5
      }
    }
  }
}

How to Apply This Recommendation:

Option 1: Updating a vLLM Server

If your supply_id corresponds to a model served with vLLM, you would apply the recommended_config_snippet by adding or modifying the command-line arguments when you launch the server.

Before (Original vLLM launch command):

python -m vllm.entrypoints.openai.api_server \
    --model /models/meta-llama/Llama-3-70B-Instruct \
    --quantization awq \
    --tensor-parallel-size 4 \
    --port 8000

After (Applying the recommendation):
The speculative_decoding recommendation translates to the --speculative-model and --num-speculative-tokens arguments in vLLM.

python -m vllm.entrypoints.openai.api_server \
    --model /models/meta-llama/Llama-3-70B-Instruct \
    --quantization awq \
    --tensor-parallel-size 4 \
    --port 8000 \
    --speculative-model /models/meta-llama/Llama-3-8B-Instruct \
    --num-speculative-tokens 5

You would then restart your vLLM server with the new command to activate the optimization.

Option 2: Updating an Application using an OpenAI Client

Sometimes, the recommendation isn't for the infrastructure but for how you use it. Imagine a recommendation to change the default temperature for your chatbot-interactive app to reduce hallucinations.

Sample Recommendation:

{
  "recommendation_id": "rec-4d5e6f-temperature-tuning",
  "target_supply_id": "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo",
  "analysis_summary": "Workloads for 'chatbot-interactive' show a high rate of factual queries where lower temperature can reduce confabulation. Recommend reducing default temperature from 0.7 to 0.2.",
  "predicted_impact": {
    "risk_score_change_percent": -40.0
  },
  "recommended_config_snippet": {
    "model_config": {
      "parameters": {
        "temperature": 0.2
      }
    }
  }
}

You would apply this by changing the default parameters in your application code that calls the OpenAI API.

Before (Original Python code):

from openai import OpenAI
client = OpenAI()

def get_chat_response(prompt):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7 # Original default
    )
    return response.choices[0].message.content

After (Applying the recommendation):

from openai import OpenAI
client = OpenAI()

def get_chat_response(prompt):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2 # Updated default based on recommendation
    )
    return response.choices[0].message.content

This change ensures all future calls from this function use the optimized setting.