You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Important now: DevOps / IT professionals need to set quotas to prevent runaway SaaS LLM workloads (e.g. a UDF that calls an LLM per row inadvertently invoked on a huge dataframe) from exhausting a project’s budget during R&D. They don’t want to manage a bunch of different API keys in different vendor portals to accomplish this.
Important soon: As organizations begin to roll out production applications based on SaaS and OSS LLMs, they'll need to:
Ensure that production applications relying on hosted OSS LLMs remain available and that access is shared fairly.
Control costs for production applications that rely on SaaS LLMs, i.e. limit spend from end-user traffic
Proposal
We propose to extend the MLflow AI Gateway API so that DevOps / IT professionals can set one or more limits on their AI Gateway Routes:
Setting limits will be optional, but AI Gateway docs will encourage it
Limits can be set on Routes for SaaS LLMs and OSS LLMs powered by MLflow Model Serving
Limits are defined / applied per-route
In the future, this can be extended so that DevOps / IT professionals can define limits on a per-user basis
Limits can be enforced on the number of requests
In the future, this can be extended to limits on the number of tokens (as-defined by the LLM)
Limits are reset on a per-minute basis
In the future, this can be extended so that DevOps / IT professionals can choose a different renewal
period (per second, per hour, etc.)
When a Route is queried, all of its defined limits are enforced. If a limit is exceeded, the request is rejected with a 429 response code.
Object & API definitions
We will introduce a LimitsConfiguration to each AI Gateway Route, which is a set of Limits. We will provide SetLimits and GetLimits REST APIs for CRUDing these limits.
We prefer a separate GetLimits API, rather than making the limits a property of the Route, because we may want to require elevated permissions for retrieving the limits.
message LimitsConfiguration {
repeated Limit limits = 1;
}
message Limit {
# The number of tokens
oneof value {
int calls = 1;
# Later on, we can limit by # of tokens, etc.
} [(validate_required = true)];
required LimitRenewalPeriod renewal_period = 2;
}
enum LimitRenewalPeriod {
# Renew the limit counter of tokens every minute
MINUTE = 1;
# <We can add more renewal options later>
}
The Limits Configuration can be created / updated / deleted via a SetLimits API call, for example:
Example: Limit creation with the MLflow Python client
mlflow.gateway.set_limits(
route="dev-gpt-3.5-completions-route",
limits=[
{
# Make at most 200 requests (i.e. spend at most ~ $2 based on
# average request size) to GPT-3.5 per data scientist per minute
"calls": 200
"renewal_period": "minute"
}
]
)
(The Limits Configuration can also be specified as part of the existing CreateRoute API call)
The Limits Configuration can be fetched via a GetLimits API call, for example:
Example: Getting a limit with the MLflow Python client
RPM limits have some nice properties for controlling R&D costs:
RPM limits are unlikely to break / interrupt workloads: When limits are reached, it's common for applications to retry for up to 60 seconds with exponential backoff, at which point the limit will have been renewed. Limits over longer horizons are likely to lead to errors and user / workload lockouts.
Requests are intuitive: Requests are easier for data scientists / analysts to reason about than tokens. The number of requests is an integer multiple of the number of records being processed. The number of tokens varies widely based on the LLM and the size of the records.
RPM limits support all OSS LLMs: Many OSS LLMs deployments don't produce token usage information for requests, making it difficult to enforce token-based limits. In contrast, the number of requests can always be measured, so RPM limits can always be enforced.
Why not distinguish between "quotas" and "rate limits"?
Quotas are "longer-term" (minutes, hours, days, months, lifetime) limits on total # of calls meant to restrict project costs
Rate limits are "shorter-term" (seconds, minutes) limits on # of calls meant to protect application availability against traffic bursts (e.g. DDoS) and promote fair sharing of resources
The fields required to specify a quota and a rate limit are nearly identical. The only difference is that rate limits are typically enforced second-by-second or minute-by-minute, whereas quotas are enforced minute-by-minute, hour-by-hour and beyond. So, a "Limits" concept that covers both of these seems appropriate.
Future work
Given the immediate importance of setting quotas during R&D to prevent runaway costs, we propose to begin with request-per-minute limits on a per-route basis. This will provide a great foundation for future extensions geared towards production applications, such as:
Token rate limits
Per-user rate limits
Rate limits over additional time horizons (second, hour, day, etc)
We also acknowledge that the following capabilities may become important in the future:
Setting long-term (e.g. monthly) budgets for R&D
Alerting customers when a budget limit is reached
Querying how much of a quota has been used / how much is remaining
QoS networking for Routes, e.g. high priority and low priority traffic
Finally, usage tracking / reporting is another active topic of conversation that deserves its own RFC. We're actively investigating the requirements for this capability.
The text was updated successfully, but these errors were encountered:
dbczumar
changed the title
RFC: User-defined limits for AI Gateway Routes. Feedback requested!
RFC: User-defined limits for AI Gateway. Feedback requested!
Aug 25, 2023
User-defined limits for AI Gateway
Motivation
Important now: DevOps / IT professionals need to set quotas to prevent runaway SaaS LLM workloads (e.g. a UDF that calls an LLM per row inadvertently invoked on a huge dataframe) from exhausting a project’s budget during R&D. They don’t want to manage a bunch of different API keys in different vendor portals to accomplish this.
Important soon: As organizations begin to roll out production applications based on SaaS and OSS LLMs, they'll need to:
Ensure that production applications relying on hosted OSS LLMs remain available and that access is shared fairly.
Control costs for production applications that rely on SaaS LLMs, i.e. limit spend from end-user traffic
Proposal
We propose to extend the MLflow AI Gateway API so that DevOps / IT professionals can set one or more limits on their AI Gateway Routes:
period (per second, per hour, etc.)
Object & API definitions
We will introduce a
LimitsConfiguration
to each AI Gateway Route, which is a set ofLimit
s. We will provideSetLimits
andGetLimits
REST APIs for CRUDing these limits.We prefer a separate
GetLimits
API, rather than making the limits a property of the Route, because we may want to require elevated permissions for retrieving the limits.Limit & Limits Configuration definition (proto syntax)
The Limits Configuration can be created / updated / deleted via a SetLimits API call, for example:
Example: Limit creation with the MLflow Python client
(The Limits Configuration can also be specified as part of the existing CreateRoute API call)
The Limits Configuration can be fetched via a GetLimits API call, for example:
Example: Getting a limit with the MLflow Python client
Why limit on requests-per-minute (RPM)?
RPM limits have some nice properties for controlling R&D costs:
RPM limits are unlikely to break / interrupt workloads: When limits are reached, it's common for applications to retry for up to 60 seconds with exponential backoff, at which point the limit will have been renewed. Limits over longer horizons are likely to lead to errors and user / workload lockouts.
Requests are intuitive: Requests are easier for data scientists / analysts to reason about than tokens. The number of requests is an integer multiple of the number of records being processed. The number of tokens varies widely based on the LLM and the size of the records.
RPM limits support all OSS LLMs: Many OSS LLMs deployments don't produce token usage information for requests, making it difficult to enforce token-based limits. In contrast, the number of requests can always be measured, so RPM limits can always be enforced.
Why not distinguish between "quotas" and "rate limits"?
The fields required to specify a quota and a rate limit are nearly identical. The only difference is that rate limits are typically enforced second-by-second or minute-by-minute, whereas quotas are enforced minute-by-minute, hour-by-hour and beyond. So, a "Limits" concept that covers both of these seems appropriate.
Future work
Given the immediate importance of setting quotas during R&D to prevent runaway costs, we propose to begin with request-per-minute limits on a per-route basis. This will provide a great foundation for future extensions geared towards production applications, such as:
We also acknowledge that the following capabilities may become important in the future:
Finally, usage tracking / reporting is another active topic of conversation that deserves its own RFC. We're actively investigating the requirements for this capability.
The text was updated successfully, but these errors were encountered: