System-Level

AI Optimization at System-level, Optional Value Activation Method

1. Introduction

Analyze historical workload patterns to provide strategic, system-level tuning recommendations. This guide details how to retrieve and apply these recommendations for one-time configuration changes to your model serving infrastructure. This is in contrast to Middleware for real-time, per-request optimization.

The goal is to optimize the baseline performance and cost-efficiency of your entire system based on observed usage patterns. This process is ideal for making strategic adjustments to self-hosted models or updating the default parameters used by your applications.


2. How It Works: Offline Analysis and Recommendation

For real-time optimization, see the Middleware. For IDE integration, see the MCP.

The process involves two main steps: retrieving recommendations from the agent and then manually applying them to your infrastructure.

Step 1: Generating System-Wide Recommendations

The agent analyzes historical workload data (e.g., from the past 30 days)
for specified applications (app_id). It identifies patterns in the workload data.

Based on these patterns,
it runs simulations against your configured supply options to find an improved system-level configuration.

For example, for a self-hosted context, the dominant workloads are variations on short conversations,
it might recommend configuration sets that reduce the KV cache block size to improve memory efficiency and allow for higher concurrency.

Step 2: Applying the Optimal Configuration

The agent provides a new configuration snippet. You can then apply this change manually to your model serving environment (like vLLM or Hugging Face TGI) or update the default call parameters in your application code (for third-party APIs like OpenAI). This updates the baseline for all future requests.


3. API Reference: The system_recommendations Endpoint

Provides a list of system-level tuning recommendations based on historical workload analysis.

  • Method: POST
  • URL: /v2/analysis/system_recommendations
  • Authentication: API key included in the X-API-Key HTTP header.

Request Body

A JSON object specifying the scope of the analysis.

Field NameTypeRequiredDescription
app_idsArray of StringsYesA list of app_ids to include in the workload analysis.
analysis_period_daysIntegerNoThe number of past days of data to analyze. Defaults to 30.

Response Body

A successful response (200 OK) contains a JSON object with a list of recommendations.

Field NameTypeDescription
analysis_idString (UUID)A unique identifier for this analysis report.
recommendationsArray of SystemRecommendationAn array of tuning recommendations for your supply configurations.

The SystemRecommendation Object

Field NameTypeDescription
recommendation_idString (UUID)A unique identifier for the specific recommendation.
target_supply_idString (UUID)The supply_id of the model configuration this recommendation applies to.
analysis_summaryStringA human-readable summary of the finding and the reasoning behind the recommendation.
predicted_impactObjectAn object detailing the estimated system-wide impact, e.g., {"cost_change_percent": -15.2, "p95_latency_change_percent": 5.1}.
recommended_config_snippetObjectA JSON snippet containing the specific configuration parameters that should be updated.

4. Applying Recommendations: Practical Examples

Once you have a recommended_config_snippet from the API, the next step is to apply it. The method depends on your serving stack.

Example Scenario

Imagine you run the analysis for your code-assistant app and receive the following recommendation for a self-hosted Llama 3 model:

Sample API Response Snippet:

{
  "recommendation_id": "rec-1a2b3c-speculative-decoding",
  "target_supply_id": "a1b2c3d4-e5f6-7890-1234-llama3-70b-awq",
  "analysis_summary": "Analysis of 'code-assistant' workloads shows a high percentage of token generation. Enabling speculative decoding with a smaller draft model is predicted to reduce overall latency by up to 20% with negligible impact on accuracy for this use case.",
  "predicted_impact": {
    "cost_change_percent": 0,
    "p95_tpot_change_percent": -22.5
  },
  "recommended_config_snippet": {
    "performance_config": {
      "speculative_decoding": {
        "draft_model_path": "/models/meta-llama/Llama-3-8B-Instruct",
        "num_speculative_tokens": 5
      }
    }
  }
}

How to Apply This Recommendation:

Option 1: Updating a vLLM Server

If your supply_id corresponds to a model served with vLLM, you would apply the recommended_config_snippet by adding or modifying the command-line arguments when you launch the server.

Before (Original vLLM launch command):

python -m vllm.entrypoints.openai.api_server \
    --model /models/meta-llama/Llama-3-70B-Instruct \
    --quantization awq \
    --tensor-parallel-size 4 \
    --port 8000

After (Applying the recommendation):
The speculative_decoding recommendation translates to the --speculative-model and --num-speculative-tokens arguments in vLLM.

python -m vllm.entrypoints.openai.api_server \
    --model /models/meta-llama/Llama-3-70B-Instruct \
    --quantization awq \
    --tensor-parallel-size 4 \
    --port 8000 \
    --speculative-model /models/meta-llama/Llama-3-8B-Instruct \
    --num-speculative-tokens 5

You would then restart your vLLM server with the new command to activate the optimization.


Option 2: Updating an Application using an OpenAI Client

Sometimes, the recommendation isn't for the infrastructure but for how you use it. Imagine a recommendation to change the default temperature for your chatbot-interactive app to reduce hallucinations.

Sample Recommendation:

{
  "recommendation_id": "rec-4d5e6f-temperature-tuning",
  "target_supply_id": "a1b2c3d4-e5f6-7890-1234-openai-gpt4-turbo",
  "analysis_summary": "Workloads for 'chatbot-interactive' show a high rate of factual queries where lower temperature can reduce confabulation. Recommend reducing default temperature from 0.7 to 0.2.",
  "predicted_impact": {
    "risk_score_change_percent": -40.0
  },
  "recommended_config_snippet": {
    "model_config": {
      "parameters": {
        "temperature": 0.2
      }
    }
  }
}

You would apply this by changing the default parameters in your application code that calls the OpenAI API.

Before (Original Python code):

from openai import OpenAI
client = OpenAI()

def get_chat_response(prompt):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7 # Original default
    )
    return response.choices[0].message.content

After (Applying the recommendation):

from openai import OpenAI
client = OpenAI()

def get_chat_response(prompt):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2 # Updated default based on recommendation
    )
    return response.choices[0].message.content

This change ensures all future calls from this function use the optimized setting.