LanguageModelRateLimitingPlugin

2025-07-22

通过在可配置的时间范围内跟踪提示和完成令牌消耗，模拟语言模型 API 的基于令牌的速率限制。

插件实例定义

{
  "name": "LanguageModelRateLimitingPlugin",
  "enabled": true,
  "pluginPath": "~appFolder/plugins/DevProxy.Plugins.dll",
  "configSection": "languageModelRateLimitingPlugin",
  "urlsToWatch": [
    "https://api.openai.com/*",
    "http://localhost:11434/*"
  ]
}

Configuration example

{
  "languageModelRateLimitingPlugin": {
    "$schema": "https://raw.githubusercontent.com/dotnet/dev-proxy/main/schemas/v1.0.0/languagemodelratelimitingplugin.schema.json",
    "promptTokenLimit": 5000,
    "completionTokenLimit": 5000,
    "resetTimeWindowSeconds": 60,
    "whenLimitExceeded": "Throttle",
    "headerRetryAfter": "retry-after"
  }
}

Configuration properties

Property	Description	Default
`promptTokenLimit`	时间范围内允许的最大提示令牌数。	`5000`
`completionTokenLimit`	在时间范围内允许的最大完成令牌数。	`5000`
`resetTimeWindowSeconds`	令牌限制重置后的时间段（以秒为单位）。	`60`
`whenLimitExceeded`	超过令牌限制时的响应行为。可以是 `Throttle` 或 `Custom`。	`Throttle`
`headerRetryAfter`	要包含重试后信息的 HTTP 标头的名称。	`retry-after`
`customResponseFile`	设置为 `whenLimitExceeded` 时`Custom`包含自定义响应的文件的路径。	`token-limit-response.json`

自定义响应配置

如果 whenLimitExceeded 设置为 Custom，可以在单独的 JSON 文件中定义自定义响应：

{
  "$schema": "https://raw.githubusercontent.com/dotnet/dev-proxy/main/schemas/v1.0.0/languagemodelratelimitingplugin.customresponsefile.schema.json",
  "statusCode": 429,
  "headers": [
    {
      "name": "retry-after",
      "value": "@dynamic"
    },
    {
      "name": "content-type",
      "value": "application/json"
    }
  ],
  "body": {
    "error": {
      "message": "You have exceeded your token quota. Please wait before making additional requests.",
      "type": "insufficient_quota",
      "code": "token_quota_exceeded"
    }
  }
}

自定义响应属性

Property	Description
`statusCode`	超过令牌限制时要返回的 HTTP 状态代码。
`headers`	要包含在响应中的 HTTP 标头数组。用于 `@dynamic` 重试后自动计算秒，直到重置。
`body`	序列化为 JSON 的响应正文对象。

工作原理

LanguageModelRateLimitingPlugin 的工作原理：

截获 OpenAI API 请求：监视 POST 请求以配置包含 OpenAI 兼容的请求正文的 URL
跟踪令牌消耗：分析要提取和从使用情况部分提取 prompt_tokens 的 completion_tokens 响应
Enforcing limits: Maintains running totals of consumed tokens within the configured time window
提供限制响应：超出限制时，返回标准限制响应或自定义响应

支持的请求类型

该插件支持 OpenAI 完成和聊天完成请求：

Completion requests: Requests with a prompt property
聊天完成请求：具有 messages 属性的请求

Token tracking

令牌消耗是单独跟踪的：

Prompt tokens: Input tokens consumed by the request
Completion tokens: Output tokens generated by the response

超过任一限制后，将限制后续请求，直到时间窗口重置。

时间窗口行为

配置后的令牌限制重置 resetTimeWindowSeconds
处理第一个请求时，重置计时器将启动
当时间窗口过期时，提示令牌计数器和完成令牌计数器都将重置为其配置的限制

默认限制响应

如果 whenLimitExceeded 设置为 Throttle，插件将返回标准 OpenAI 兼容的错误响应：

{
  "error": {
    "message": "You exceeded your current quota, please check your plan and billing details.",
    "type": "insufficient_quota",
    "param": null,
    "code": "insufficient_quota"
  }
}

响应包括：

HTTP 状态代码： 429 Too Many Requests
retry-after 标头，直到令牌限制重置
原始请求包含 Origin 标头时 CORS 标头

Use cases

LanguageModelRateLimitingPlugin 适用于：

测试基于令牌的速率限制：模拟语言模型提供程序强制实施令牌配额时应用程序的行为方式
开发成本模拟：在达到实际 API 限制之前了解开发期间的令牌消耗模式
Resilience testing: Verify that your application properly handles token limit errors and implements appropriate retry logic
本地 LLM 测试：使用本地语言模型（如 Ollama）测试令牌限制方案，这些方案不会强制实施自己的限制

Example scenarios

方案 1：基本令牌限制

{
  "languageModelRateLimitingPlugin": {
    "promptTokenLimit": 1000,
    "completionTokenLimit": 500,
    "resetTimeWindowSeconds": 300
  }
}

此配置在 5 分钟内最多允许 1,000 个提示令牌和 500 个完成令牌。

方案 2：自定义错误响应

{
  "languageModelRateLimitingPlugin": {
    "promptTokenLimit": 2000,
    "completionTokenLimit": 1000,
    "resetTimeWindowSeconds": 60,
    "whenLimitExceeded": "Custom",
    "customResponseFile": "custom-token-error.json"
  }
}

此配置使用自定义响应文件在超过令牌限制时提供专用错误消息。

Next step

测试语言模型令牌限制