Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

Open
dbczumar opened this issue Aug 25, 2023 · 1 comment
Open

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

dbczumar opened this issue Aug 25, 2023 · 1 comment

Comments

@dbczumar
Copy link
Collaborator

dbczumar commented Aug 25, 2023

User-defined limits for AI Gateway

Motivation

Important now: DevOps / IT professionals need to set quotas to prevent runaway SaaS LLM workloads (e.g. a UDF that calls an LLM per row inadvertently invoked on a huge dataframe) from exhausting a project’s budget during R&D. They don’t want to manage a bunch of different API keys in different vendor portals to accomplish this.

Important soon: As organizations begin to roll out production applications based on SaaS and OSS LLMs, they'll need to:

  • Ensure that production applications relying on hosted OSS LLMs remain available and that access is shared fairly.

  • Control costs for production applications that rely on SaaS LLMs, i.e. limit spend from end-user traffic

Proposal

We propose to extend the MLflow AI Gateway API so that DevOps / IT professionals can set one or more limits on their AI Gateway Routes:

  • Setting limits will be optional, but AI Gateway docs will encourage it
  • Limits can be set on Routes for SaaS LLMs and OSS LLMs powered by MLflow Model Serving
  • Limits are defined / applied per-route
    • In the future, this can be extended so that DevOps / IT professionals can define limits on a per-user basis
  • Limits can be enforced on the number of requests
    • In the future, this can be extended to limits on the number of tokens (as-defined by the LLM)
  • Limits are reset on a per-minute basis
    • In the future, this can be extended so that DevOps / IT professionals can choose a different renewal
      period (per second, per hour, etc.)
  • When a Route is queried, all of its defined limits are enforced. If a limit is exceeded, the request is rejected with a 429 response code.

Object & API definitions

We will introduce a LimitsConfiguration to each AI Gateway Route, which is a set of Limits. We will provide SetLimits and GetLimits REST APIs for CRUDing these limits.

We prefer a separate GetLimits API, rather than making the limits a property of the Route, because we may want to require elevated permissions for retrieving the limits.

Limit & Limits Configuration definition (proto syntax)

message LimitsConfiguration {
    repeated Limit limits = 1;
}

message Limit {
    # The number of tokens
    oneof value {
        int calls = 1;
        # Later on, we can limit by # of tokens, etc.
    } [(validate_required = true)];
    required LimitRenewalPeriod renewal_period = 2;
}

enum LimitRenewalPeriod {
    # Renew the limit counter of tokens every minute
    MINUTE = 1;
    # <We can add more renewal options later>
}

The Limits Configuration can be created / updated / deleted via a SetLimits API call, for example:

Example: Limit creation with the MLflow Python client

mlflow.gateway.set_limits(
    route="dev-gpt-3.5-completions-route",
    limits=[
        {
            # Make at most 200 requests (i.e. spend at most ~ $2 based on
            # average request size) to GPT-3.5 per data scientist per minute
            "calls": 200
            "renewal_period": "minute"
        }
    ]
)

(The Limits Configuration can also be specified as part of the existing CreateRoute API call)

The Limits Configuration can be fetched via a GetLimits API call, for example:

Example: Getting a limit with the MLflow Python client

limits = mlflow.gateway.get_limits(
    route="dev-gpt-3.5-completions-route",
)

assert limits == [
        {
            "calls": 200
            "renewal_period": "minute"
        }
]

Why limit on requests-per-minute (RPM)?

RPM limits have some nice properties for controlling R&D costs:

  • RPM limits are unlikely to break / interrupt workloads: When limits are reached, it's common for applications to retry for up to 60 seconds with exponential backoff, at which point the limit will have been renewed. Limits over longer horizons are likely to lead to errors and user / workload lockouts.

  • Requests are intuitive: Requests are easier for data scientists / analysts to reason about than tokens. The number of requests is an integer multiple of the number of records being processed. The number of tokens varies widely based on the LLM and the size of the records.

  • RPM limits support all OSS LLMs: Many OSS LLMs deployments don't produce token usage information for requests, making it difficult to enforce token-based limits. In contrast, the number of requests can always be measured, so RPM limits can always be enforced.

Why not distinguish between "quotas" and "rate limits"?

  • Quotas are "longer-term" (minutes, hours, days, months, lifetime) limits on total # of calls meant to restrict project costs
  • Rate limits are "shorter-term" (seconds, minutes) limits on # of calls meant to protect application availability against traffic bursts (e.g. DDoS) and promote fair sharing of resources

The fields required to specify a quota and a rate limit are nearly identical. The only difference is that rate limits are typically enforced second-by-second or minute-by-minute, whereas quotas are enforced minute-by-minute, hour-by-hour and beyond. So, a "Limits" concept that covers both of these seems appropriate.

Future work

Given the immediate importance of setting quotas during R&D to prevent runaway costs, we propose to begin with request-per-minute limits on a per-route basis. This will provide a great foundation for future extensions geared towards production applications, such as:

  • Token rate limits
  • Per-user rate limits
  • Rate limits over additional time horizons (second, hour, day, etc)

We also acknowledge that the following capabilities may become important in the future:

  • Setting long-term (e.g. monthly) budgets for R&D
    • Alerting customers when a budget limit is reached
    • Querying how much of a quota has been used / how much is remaining
  • QoS networking for Routes, e.g. high priority and low priority traffic

Finally, usage tracking / reporting is another active topic of conversation that deserves its own RFC. We're actively investigating the requirements for this capability.

@dbczumar dbczumar changed the title RFC: User-defined limits for AI Gateway Routes. Feedback requested! RFC: User-defined limits for AI Gateway. Feedback requested! Aug 25, 2023
@dbczumar dbczumar pinned this issue Aug 25, 2023
@mlflow-automation
Copy link
Collaborator

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy