-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML-47076] Support custom frequencies #162
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce the number of changes to the function interface, but calculate frequency string from the caller?
@@ -41,7 +41,9 @@ def _prophet_fit_predict(params: Dict[str, Any], history_pd: pd.DataFrame, | |||
horizon: int, frequency: str, cutoffs: List[pd.Timestamp], | |||
interval_width: int, primary_metric: str, | |||
country_holidays: Optional[str] = None, | |||
regressors = None, **prophet_kwargs) -> Dict[str, Any]: | |||
regressors = None, | |||
frequency_quantity: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put frequency_quantity closer to frequency_unit (renamed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally added it in the end because I want frequency_quantity to have a default value = 1 to make it backward compatible. With default value, it cannot be placed before any argument without default value. Even if it is the first one among arguments with default value, it is still far from frequency_unit so I put it in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead pass in Frequency as a dataclass?
@dataclass
class Frequency:
"""Class to specify forecast frequency."""
frequency_unit: str
frequency_value: int = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok the problem is that for some functions, I want frequency_quantity has default value = 1. These functions are the ones that are used outside, in the trial notebook. If a customer developed based on an old trial notebook, we do not want to break those function calls in the old trial notebooks.
For other functions that are only called inside this repo, I recommend having no default values so it can expose bugs earlier. Sometimes with default value, some bugs may be silently ignored.
Besides, I believe we are around the corner to refactor this repo and core library code, as we now are pretty sure we will go black-box. So I would suggest deferring introducing any complex data structures for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code shared between serverless and classic? If so, then it makes sense. If not, why worried about trial notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was wrong, you are right. This code is only used by serverless.
Let me clarify the tradeoffs of introducing a new dataclass.
Pros: code is cleaner. (But I doubt how much it improved the code readability. I believe without the dataclass, it is still pretty readable)
Cons:
- Need to change a lot of code and their related tests, including some codes that are not necessarily touched in this PR.
- This dataclass will force me to either all have default values or all without default values. But we actually do not need to have default values now. But anyway it makes the code lose flexibility.
I would still prefer not using such dataclass. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just cleaner code (in terms of readability) but better for overall maintainability and testability.
Think about the following:
- The current argument list is already long. For the consumer of this function, they need to read many argument definitions in order to understand how to use this
- The frequency_unit and frequency_quantity is a combined logic concept. Specify either one of the doesn't make sense.
- Easy to validate: you can add a validate function inside the dataclass. So every time a Frequency object is passed in, it can be validated immediately.
- Easy test: you only need to test the Frequency object instead of two separate values. You can have a simple test like
def is_valid_frequency(freq: Frequency) -> bool
. In the future, if you want to support other quantities, e.g. 45min or 3d, you don't need to add check all over the place. - Having a default value in it is not a bad thing. With single object, you can change the default value only once and it applies to all.
Anyway, given the urgency of this task, I don't want to make this a blocker. But I suggest you create a follow-up JIRA in EE epic to refactor this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename frequency to frequency_unit and add more tests
Thanks for the review. To clarify, I am working on adding tests, the commits I pushed was to fix the existing tests. Will let you know once I added tests. |
freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[freq]] | ||
return (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n | ||
freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[frequency_unit]] | ||
return (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n // frequency_quantity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it's not divisible, we just get the floor value? is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is intended. If it is not divisible, we will get the floor value. And in the later check we will use this floor value to find out it is not consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comment in here.
runtime/tests/automl_runtime/forecast/pmdarima/training_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to use a dataclass to represent frequency_unit and frequency_value as one entity instead of treating them as separate arguments.
nit: please run lint on your code. There are many indentation issue and multiple arguments in one line.
**DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[freq]] | ||
) * periods == pd.to_datetime(start_time) | ||
**DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[frequency_unit]] | ||
) * periods * frequency_quantity == pd.to_datetime(start_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not clear to me. What is this ==
check doing here? Please also add parenthesis around to make the compute precedence clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the left of this ==, it calculates the difference between end time and period time, on the right, it is start time. the variable diff is True or False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. But the code still needs refactoring. Please add comments to two places I mentioned and create a JIRA for follow-up refactoring.
This PR added support to custom frequencies in AutoML. It added
frequency_quantity
to the Estimator and Model class for all three algorithms. To review this PR, please review it with the PR, https://github.com/databricks-eng/universe/pull/911199.To test this PR
To manually test, I
I tested all possible combinations, stored under the folder, including
5 min - single time series
5 min - multi time series
10 min - single time series
10 min - multi time series
15 min - single time series
15 min - multi time series
30 min - single time series
30 min - multi time series