Content-Length: 625202 | pFad | https://github.com/databricks/automl/pull/162

D1 [ML-47076] Support custom frequencies by Lanz-db · Pull Request #162 · databricks/automl · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML-47076] Support custom frequencies #162

Merged
merged 11 commits into from
Feb 12, 2025
Merged

Conversation

Lanz-db
Copy link

@Lanz-db Lanz-db commented Feb 11, 2025

This PR added support to custom frequencies in AutoML. It added frequency_quantity to the Estimator and Model class for all three algorithms. To review this PR, please review it with the PR, https://github.com/databricks-eng/universe/pull/911199.

To test this PR

  • unit tests
  • manual e2e tests

To manually test, I

  1. installed core library wheel and automl_runtime wheel in a MLR 14.3 LTS compute
  2. commented out the line %pip install --force-reinstall databricks-automl-runtime==0.2.20.5 in pip_install.jinja and pip_install_deepar.jinja to use the new automl_runtime wheel
  3. ran the driver notebooks
    I tested all possible combinations, stored under the folder, including
    5 min - single time series
    5 min - multi time series
    10 min - single time series
    10 min - multi time series
    15 min - single time series
    15 min - multi time series
    30 min - single time series
    30 min - multi time series

@Lanz-db Lanz-db changed the title [DO NOT MERGE][ML-47076] Prototyping custom frequencies [ML-47076] Prototyping custom frequencies Feb 11, 2025
@Lanz-db Lanz-db changed the title [ML-47076] Prototyping custom frequencies [ML-47076] Support custom frequencies Feb 11, 2025
Copy link

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the number of changes to the function interface, but calculate frequency string from the caller?

@Lanz-db Lanz-db requested a review from apeforest February 11, 2025 23:26
@@ -41,7 +41,9 @@ def _prophet_fit_predict(params: Dict[str, Any], history_pd: pd.DataFrame,
horizon: int, frequency: str, cutoffs: List[pd.Timestamp],
interval_width: int, primary_metric: str,
country_holidays: Optional[str] = None,
regressors = None, **prophet_kwargs) -> Dict[str, Any]:
regressors = None,
frequency_quantity: int = 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put frequency_quantity closer to frequency_unit (renamed)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally added it in the end because I want frequency_quantity to have a default value = 1 to make it backward compatible. With default value, it cannot be placed before any argument without default value. Even if it is the first one among arguments with default value, it is still far from frequency_unit so I put it in the end.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead pass in Frequency as a dataclass?

@dataclass
class Frequency:
    """Class to specify forecast frequency."""
    frequency_unit: str
    frequency_value: int = 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the problem is that for some functions, I want frequency_quantity has default value = 1. These functions are the ones that are used outside, in the trial notebook. If a customer developed based on an old trial notebook, we do not want to break those function calls in the old trial notebooks.

For other functions that are only called inside this repo, I recommend having no default values so it can expose bugs earlier. Sometimes with default value, some bugs may be silently ignored.

Besides, I believe we are around the corner to refactor this repo and core library code, as we now are pretty sure we will go black-box. So I would suggest deferring introducing any complex data structures for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code shared between serverless and classic? If so, then it makes sense. If not, why worried about trial notebook?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was wrong, you are right. This code is only used by serverless.
Let me clarify the tradeoffs of introducing a new dataclass.
Pros: code is cleaner. (But I doubt how much it improved the code readability. I believe without the dataclass, it is still pretty readable)
Cons:

  1. Need to change a lot of code and their related tests, including some codes that are not necessarily touched in this PR.
  2. This dataclass will force me to either all have default values or all without default values. But we actually do not need to have default values now. But anyway it makes the code lose flexibility.

I would still prefer not using such dataclass. wdyt?

Copy link

@apeforest apeforest Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just cleaner code (in terms of readability) but better for overall maintainability and testability.

Think about the following:

  1. The current argument list is already long. For the consumer of this function, they need to read many argument definitions in order to understand how to use this
  2. The frequency_unit and frequency_quantity is a combined logic concept. Specify either one of the doesn't make sense.
  3. Easy to validate: you can add a validate function inside the dataclass. So every time a Frequency object is passed in, it can be validated immediately.
  4. Easy test: you only need to test the Frequency object instead of two separate values. You can have a simple test like def is_valid_frequency(freq: Frequency) -> bool. In the future, if you want to support other quantities, e.g. 45min or 3d, you don't need to add check all over the place.
  5. Having a default value in it is not a bad thing. With single object, you can change the default value only once and it applies to all.

Anyway, given the urgency of this task, I don't want to make this a blocker. But I suggest you create a follow-up JIRA in EE epic to refactor this code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename frequency to frequency_unit and add more tests

@Lanz-db
Copy link
Author

Lanz-db commented Feb 12, 2025

Please rename frequency to frequency_unit and add more tests

Thanks for the review. To clarify, I am working on adding tests, the commits I pushed was to fix the existing tests. Will let you know once I added tests.

freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[freq]]
return (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n
freq_alias = PERIOD_ALIAS_MAP[OFFSET_ALIAS_MAP[frequency_unit]]
return (end_time.to_period(freq_alias) - start_time.to_period(freq_alias)).n // frequency_quantity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it's not divisible, we just get the floor value? is that intended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is intended. If it is not divisible, we will get the floor value. And in the later check we will use this floor value to find out it is not consistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment in here.

Copy link

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use a dataclass to represent frequency_unit and frequency_value as one entity instead of treating them as separate arguments.

nit: please run lint on your code. There are many indentation issue and multiple arguments in one line.

@Lanz-db Lanz-db requested a review from apeforest February 12, 2025 07:21
**DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[freq]]
) * periods == pd.to_datetime(start_time)
**DATE_OFFSET_KEYWORD_MAP[OFFSET_ALIAS_MAP[frequency_unit]]
) * periods * frequency_quantity == pd.to_datetime(start_time)
Copy link

@apeforest apeforest Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not clear to me. What is this == check doing here? Please also add parenthesis around to make the compute precedence clearer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the left of this ==, it calculates the difference between end time and period time, on the right, it is start time. the variable diff is True or False

Copy link

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. But the code still needs refactoring. Please add comments to two places I mentioned and create a JIRA for follow-up refactoring.

@Lanz-db Lanz-db merged commit 0734ce3 into branch-0.2.20.5 Feb 12, 2025
5 checks passed
@Lanz-db Lanz-db deleted the lan-zhang_data/freq_lib branch February 12, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/databricks/automl/pull/162

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy