[ML-47076] Support custom frequencies #162

Lanz-db · 2025-02-11T03:54:17Z

This PR added support to custom frequencies in AutoML. It added frequency_quantity to the Estimator and Model class for all three algorithms. To review this PR, please review it with the PR, https://github.com/databricks-eng/universe/pull/911199.

To test this PR

unit tests
manual e2e tests

To manually test, I

installed core library wheel and automl_runtime wheel in a MLR 14.3 LTS compute
commented out the line %pip install --force-reinstall databricks-automl-runtime==0.2.20.5 in pip_install.jinja and pip_install_deepar.jinja to use the new automl_runtime wheel
ran the driver notebooks
I tested all possible combinations, stored under the folder, including
5 min - single time series
5 min - multi time series
10 min - single time series
10 min - multi time series
15 min - single time series
15 min - multi time series
30 min - single time series
30 min - multi time series

apeforest

Can we reduce the number of changes to the function interface, but calculate frequency string from the caller?

runtime/databricks/automl_runtime/forecast/deepar/utils.py

runtime/databricks/automl_runtime/forecast/pmdarima/model.py

runtime/databricks/automl_runtime/forecast/deepar/model.py

runtime/databricks/automl_runtime/forecast/deepar/utils.py

runtime/databricks/automl_runtime/forecast/pmdarima/training.py

apeforest · 2025-02-12T00:55:28Z

runtime/databricks/automl_runtime/forecast/prophet/forecast.py

@@ -41,7 +41,9 @@ def _prophet_fit_predict(params: Dict[str, Any], history_pd: pd.DataFrame,
                         horizon: int, frequency: str, cutoffs: List[pd.Timestamp],
                         interval_width: int, primary_metric: str,
                         country_holidays: Optional[str] = None,
-                         regressors = None, **prophet_kwargs) -> Dict[str, Any]:
+                         regressors = None, 
+                         frequency_quantity: int = 1,


put frequency_quantity closer to frequency_unit (renamed)

I intentionally added it in the end because I want frequency_quantity to have a default value = 1 to make it backward compatible. With default value, it cannot be placed before any argument without default value. Even if it is the first one among arguments with default value, it is still far from frequency_unit so I put it in the end.

Can we instead pass in Frequency as a dataclass?

@dataclass class Frequency: """Class to specify forecast frequency.""" frequency_unit: str frequency_value: int = 1

ok the problem is that for some functions, I want frequency_quantity has default value = 1. These functions are the ones that are used outside, in the trial notebook. If a customer developed based on an old trial notebook, we do not want to break those function calls in the old trial notebooks.

For other functions that are only called inside this repo, I recommend having no default values so it can expose bugs earlier. Sometimes with default value, some bugs may be silently ignored.

Besides, I believe we are around the corner to refactor this repo and core library code, as we now are pretty sure we will go black-box. So I would suggest deferring introducing any complex data structures for now.

Is this code shared between serverless and classic? If so, then it makes sense. If not, why worried about trial notebook?

Oh I was wrong, you are right. This code is only used by serverless.
Let me clarify the tradeoffs of introducing a new dataclass.
Pros: code is cleaner. (But I doubt how much it improved the code readability. I believe without the dataclass, it is still pretty readable)
Cons:

Need to change a lot of code and their related tests, including some codes that are not necessarily touched in this PR.

This dataclass will force me to either all have default values or all without default values. But we actually do not need to have default values now. But anyway it makes the code lose flexibility.

I would still prefer not using such dataclass. wdyt?

It's not just cleaner code (in terms of readability) but better for overall maintainability and testability.

Think about the following:

The current argument list is already long. For the consumer of this function, they need to read many argument definitions in order to understand how to use this

The frequency_unit and frequency_quantity is a combined logic concept. Specify either one of the doesn't make sense.

Easy to validate: you can add a validate function inside the dataclass. So every time a Frequency object is passed in, it can be validated immediately.

Easy test: you only need to test the Frequency object instead of two separate values. You can have a simple test like def is_valid_frequency(freq: Frequency) -> bool. In the future, if you want to support other quantities, e.g. 45min or 3d, you don't need to add check all over the place.

Having a default value in it is not a bad thing. With single object, you can change the default value only once and it applies to all.

Anyway, given the urgency of this task, I don't want to make this a blocker. But I suggest you create a follow-up JIRA in EE epic to refactor this code.

jira ticket, https://databricks.atlassian.net/jira/software/c/projects/ML/boards/3737?selectedIssue=ML-50316

runtime/tests/automl_runtime/forecast/deepar/utils_test.py

runtime/tests/automl_runtime/forecast/deepar/model_test.py

apeforest

Please rename frequency to frequency_unit and add more tests

Lanz-db · 2025-02-12T01:07:10Z

Please rename frequency to frequency_unit and add more tests

Thanks for the review. To clarify, I am working on adding tests, the commits I pushed was to fix the existing tests. Will let you know once I added tests.

runtime/databricks/automl_runtime/forecast/prophet/model.py

runtime/databricks/automl_runtime/forecast/utils.py

apeforest · 2025-02-12T06:24:57Z

runtime/databricks/automl_runtime/forecast/utils.py

-    freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[freq]]
-    return  (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n
+    freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[frequency_unit]]
+    return  (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n // frequency_quantity


what if it's not divisible, we just get the floor value? is that intended?

yes it is intended. If it is not divisible, we will get the floor value. And in the later check we will use this floor value to find out it is not consistent.

please add comment in here.

runtime/tests/automl_runtime/forecast/pmdarima/model_test.py

runtime/tests/automl_runtime/forecast/pmdarima/training_test.py

apeforest

I suggest to use a dataclass to represent frequency_unit and frequency_value as one entity instead of treating them as separate arguments.

nit: please run lint on your code. There are many indentation issue and multiple arguments in one line.

apeforest · 2025-02-12T17:19:00Z

runtime/databricks/automl_runtime/forecast/utils.py

-                **DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[freq]]
-            ) * periods == pd.to_datetime(start_time)
+                **DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[frequency_unit]]
+            ) * periods * frequency_quantity == pd.to_datetime(start_time)


Still not clear to me. What is this == check doing here? Please also add parenthesis around to make the compute precedence clearer.

on the left of this ==, it calculates the difference between end time and period time, on the right, it is start time. the variable diff is True or False

apeforest

Overall, looks good. But the code still needs refactoring. Please add comments to two places I mentioned and create a JIRA for follow-up refactoring.

prototype succeeded

a4c11e8

Lanz-db changed the title ~~[DO NOT MERGE][ML-47076] Prototyping custom frequencies~~ [ML-47076] Prototyping custom frequencies Feb 11, 2025

Lanz-db changed the title ~~[ML-47076] Prototyping custom frequencies~~ [ML-47076] Support custom frequencies Feb 11, 2025

apeforest requested changes Feb 11, 2025

View reviewed changes

Lanz-db added 3 commits February 11, 2025 11:17

update docstring

6e91a49

fix tests

b80d08a

pr comment

4956f30

Lanz-db requested a review from apeforest February 11, 2025 23:26

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/deepar/model.py Outdated Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/deepar/utils.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/pmdarima/training.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/tests/automl_runtime/forecast/deepar/utils_test.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/tests/automl_runtime/forecast/deepar/model_test.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

Lanz-db added 4 commits February 11, 2025 17:30

add tests

d9e5144

add tests

c2fa88a

add tests

a8b1c97

refactor naming

03cd681

Lanz-db requested review from apeforest and maggiewang-db February 12, 2025 05:28

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/prophet/model.py Outdated Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/utils.py Outdated Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/utils.py Outdated Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/databricks/automl_runtime/forecast/utils.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/tests/automl_runtime/forecast/pmdarima/model_test.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/tests/automl_runtime/forecast/pmdarima/training_test.py Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

runtime/tests/automl_runtime/forecast/pmdarima/training_test.py Outdated Show resolved Hide resolved

apeforest reviewed Feb 12, 2025

View reviewed changes

Lanz-db added 2 commits February 11, 2025 23:10

pr comment

ef75115

pr comment

b02ab49

Lanz-db requested a review from apeforest February 12, 2025 07:21

apeforest reviewed Feb 12, 2025

View reviewed changes

apeforest approved these changes Feb 12, 2025

View reviewed changes

pr comment

5356b04

Lanz-db merged commit 0734ce3 into branch-0.2.20.5 Feb 12, 2025
5 checks passed

Lanz-db deleted the lan-zhang_data/freq_lib branch February 12, 2025 19:31

Lanz-db mentioned this pull request Feb 27, 2025

[ML-50316] Refactor frequency_unit and frequency_quantity in automl runtime #165

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML-47076] Support custom frequencies #162

[ML-47076] Support custom frequencies #162

Lanz-db commented Feb 11, 2025 •

edited

Loading

apeforest left a comment

apeforest Feb 12, 2025

Lanz-db Feb 12, 2025

apeforest Feb 12, 2025

Lanz-db Feb 12, 2025

apeforest Feb 12, 2025

Lanz-db Feb 12, 2025

apeforest Feb 12, 2025 •

edited

Loading

Lanz-db Feb 12, 2025

apeforest left a comment

Lanz-db commented Feb 12, 2025

apeforest Feb 12, 2025

Lanz-db Feb 12, 2025

apeforest Feb 12, 2025

apeforest left a comment

apeforest Feb 12, 2025 •

edited

Loading

Lanz-db Feb 12, 2025

apeforest left a comment

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

[ML-47076] Support custom frequencies #162

[ML-47076] Support custom frequencies #162

Conversation

Lanz-db commented Feb 11, 2025 • edited Loading

apeforest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

Lanz-db commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

apeforest Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

Lanz-db commented Feb 11, 2025 •

edited

Loading

apeforest Feb 12, 2025 •

edited

Loading

apeforest Feb 12, 2025 •

edited

Loading