LanguageModelRateLimitingPlugin

2025-07-22

構成可能な時間枠内でプロンプトと完了トークンの消費量を追跡することで、言語モデル API のトークンベースのレート制限をシミュレートします。

プラグインインスタンスの定義

{
  "name": "LanguageModelRateLimitingPlugin",
  "enabled": true,
  "pluginPath": "~appFolder/plugins/DevProxy.Plugins.dll",
  "configSection": "languageModelRateLimitingPlugin",
  "urlsToWatch": [
    "https://api.openai.com/*",
    "http://localhost:11434/*"
  ]
}

Configuration example

{
  "languageModelRateLimitingPlugin": {
    "$schema": "https://raw.githubusercontent.com/dotnet/dev-proxy/main/schemas/v1.0.0/languagemodelratelimitingplugin.schema.json",
    "promptTokenLimit": 5000,
    "completionTokenLimit": 5000,
    "resetTimeWindowSeconds": 60,
    "whenLimitExceeded": "Throttle",
    "headerRetryAfter": "retry-after"
  }
}

Configuration properties

Property	Description	Default
`promptTokenLimit`	時間枠内で許可されるプロンプトトークンの最大数。	`5000`
`completionTokenLimit`	時間枠内で許可される完了トークンの最大数。	`5000`
`resetTimeWindowSeconds`	トークンのリセットを制限する秒単位の時間枠。	`60`
`whenLimitExceeded`	トークンの制限を超えた場合の応答動作。 `Throttle` または `Custom` を指定できます。	`Throttle`
`headerRetryAfter`	再試行後の情報を含める HTTP ヘッダーの名前。	`retry-after`
`customResponseFile`	`whenLimitExceeded`が `Custom` に設定されている場合のカスタム応答を含むファイルへのパス。	`token-limit-response.json`

カスタム応答の構成

whenLimitExceededが Custom に設定されている場合は、別の JSON ファイルでカスタム応答を定義できます。

{
  "$schema": "https://raw.githubusercontent.com/dotnet/dev-proxy/main/schemas/v1.0.0/languagemodelratelimitingplugin.customresponsefile.schema.json",
  "statusCode": 429,
  "headers": [
    {
      "name": "retry-after",
      "value": "@dynamic"
    },
    {
      "name": "content-type",
      "value": "application/json"
    }
  ],
  "body": {
    "error": {
      "message": "You have exceeded your token quota. Please wait before making additional requests.",
      "type": "insufficient_quota",
      "code": "token_quota_exceeded"
    }
  }
}

カスタム応答プロパティ

Property	Description
`statusCode`	トークンの制限を超えたときに返される HTTP 状態コード。
`headers`	応答に含める HTTP ヘッダーの配列。再試行後に `@dynamic` を使用して、リセットまでの秒数を自動的に計算します。
`body`	JSON にシリアル化される応答本文オブジェクト。

動作方法

LanguageModelRateLimitingPlugin は次の方法で動作します。

OpenAI API 要求のインターセプト: OpenAI と互換性のある要求本文を含む構成済み URL に対する POST 要求を監視します
トークン消費の追跡: 応答を解析して、使用状況セクションから prompt_tokens と completion_tokens を抽出します
Enforcing limits: Maintains running totals of consumed tokens within the configured time window
調整応答の提供: 制限を超えると、標準の調整応答またはカスタム応答のいずれかが返されます

サポートされている要求の種類

このプラグインでは、OpenAI 完了要求とチャット完了要求の両方がサポートされます。

Completion requests: Requests with a prompt property
チャット完了要求: messages プロパティを持つ要求

Token tracking

トークンの使用量は、次の目的で個別に追跡されます。

Prompt tokens: Input tokens consumed by the request
Completion tokens: Output tokens generated by the response

いずれかの制限を超えると、時間枠がリセットされるまで、後続の要求が調整されます。

時間枠の動作

構成後のトークン制限のリセット resetTimeWindowSeconds
リセットタイマーは、最初の要求が処理されたときに開始されます
時間枠が切れると、プロンプトトークンカウンターと完了トークンカウンターの両方が構成された制限にリセットされます

既定の調整応答

whenLimitExceededがThrottleに設定されている場合、プラグインは標準の OpenAI 互換エラー応答を返します。

{
  "error": {
    "message": "You exceeded your current quota, please check your plan and billing details.",
    "type": "insufficient_quota",
    "param": null,
    "code": "insufficient_quota"
  }
}

応答には次のものが含まれます。

HTTP 状態コード: 429 Too Many Requests
retry-after トークンの制限がリセットされるまでの秒数のヘッダー
元の要求に Origin ヘッダーが含まれている場合の CORS ヘッダー

Use cases

LanguageModelRateLimitingPlugin は、次の場合に役立ちます。

トークンベースのレート制限のテスト: 言語モデルプロバイダーがトークンクォータを適用する場合のアプリケーションの動作をシミュレートする
開発コストシミュレーション: 実際の API 制限に達する前に、開発中のトークン消費パターンを理解する
Resilience testing: Verify that your application properly handles token limit errors and implements appropriate retry logic
ローカル LLM テスト: 独自の制限を適用しないローカル言語モデル (Ollama など) を使用してトークン制限シナリオをテストする

Example scenarios

シナリオ 1: 基本的なトークン制限

{
  "languageModelRateLimitingPlugin": {
    "promptTokenLimit": 1000,
    "completionTokenLimit": 500,
    "resetTimeWindowSeconds": 300
  }
}

この構成では、5 分以内に最大 1,000 個のプロンプトトークンと 500 個の完了トークンを使用できます。

シナリオ 2: カスタムエラー応答

{
  "languageModelRateLimitingPlugin": {
    "promptTokenLimit": 2000,
    "completionTokenLimit": 1000,
    "resetTimeWindowSeconds": 60,
    "whenLimitExceeded": "Custom",
    "customResponseFile": "custom-token-error.json"
  }
}

この構成では、カスタム応答ファイルを使用して、トークンの制限を超えたときに特殊なエラーメッセージを提供します。

Next step

テスト言語モデルのトークン制限