-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Open
Labels
featureA request for a proper, new feature.A request for a proper, new feature.no-scrubExclude from "scrubbing" exercises, e.g., for fundamental issues that don’t need periodic check-in.Exclude from "scrubbing" exercises, e.g., for fundamental issues that don’t need periodic check-in.oncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
Background
Handling PyTorch compile issues and ensuring reproducibility on minimal isolated code is currently quite labor-intensive. This challenge impacts both:
- Users and developers trying to isolate and reproduce errors.
- Triagers or compiler team members working with third-party compiled code, especially for public OSS models.
The complexity increases significantly when compiling full models or high-level def
functions in a chain. Often, a single error might be hidden within a chain of errors, complicating error reporting and resolution.
Proposal
-
Enhanced Error Isolation and Reporting:
- Isolate Failed Function:
Implement a mechanism to exactly isolate the function where the compilation failed. This will allow users to report the specific function causing the issue without additional effort. - Record Fake Inputs:
Automatically record fake inputs to facilitate error reproduction without the need for users to fully reproduce their dataset setup. This ensures that developers and triagers can recreate the issue reliably with minimal setup.
- Isolate Failed Function:
-
Performance Canary Mode:
- Store Baseline Info:
Introduce a mode where running an uncompiled model stores baseline performance data (e.g., memory usage, speed) on disk. - Automatic Regression Detection:
When running the compiled model, automatically compare current performance against the stored baseline. If there are regressions in memory usage or speed, users should be warned. - Simplified Reporting:
In case of performance regressions, provide an easy and straightforward way for users to report these issues.
- Store Baseline Info:
Benefits
- For Users/Developers:
- Simplifies the process of isolating and reporting compile errors.
- Enhances reproducibility by automatically recording necessary inputs.
- For Triagers/Compiler Team:
- Provides clearer insights into the specific functions causing issues.
- Facilitates quicker diagnosis and resolution of performance regressions.
/cc @chauhang @penguinwu @ezyang @msaroufim @bdhirsh @anijain2305
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
featureA request for a proper, new feature.A request for a proper, new feature.no-scrubExclude from "scrubbing" exercises, e.g., for fundamental issues that don’t need periodic check-in.Exclude from "scrubbing" exercises, e.g., for fundamental issues that don’t need periodic check-in.oncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module