Skip to content

Fix performance bug from cftime import #5640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lusewell
Copy link
Contributor

@lusewell lusewell commented Jul 28, 2021

No functional change, just removes a terrible perfomance bug when cftime isn't installed - previously calls to .sel would search your whole python path for trying to import cftime, leading to progams of mine taking 10% of time just doing this against a slow filesystem.

Tests all pass localy.

@lusewell
Copy link
Contributor Author

I'd also like to append this to tag 14.1 and make tag 14.2 if possible - would this be ok?

@TomNicholas
Copy link
Member

Thanks for the suggestion @lusewell .

I'm a bit confused as to how exactly this improves performance though - you've moved the location of the import cftime statement to the top of the file, but I was under the impression that python doesn't ever import a module more than once, because after the first time it's a fast hash lookup. So surely in both cases we only look for the existence of cftime once? Perhaps I've misunderstood though?

@dcherian
Copy link
Contributor

I guess it always tries importing if the module doesn't exist and so that's a slowdown?

@TomNicholas
Copy link
Member

TomNicholas commented Jul 28, 2021

But in both cases we always check for the existence of cftime via import cftime, so if python is clever enough to remember that cftime doesn't exist the second time it's asked to import it, then where is the opportunity for speedup?

Hopefully @lusewell can enlighten us 😅

@github-actions
Copy link
Contributor

github-actions bot commented Jul 28, 2021

Unit Test Results

         6 files  ±0           6 suites  ±0   57m 49s ⏱️ ±0s
16 227 tests ±0  14 492 ✔️ ±0  1 735 💤 ±0  0 ±0 
90 558 runs  ±0  82 384 ✔️ ±0  8 174 💤 ±0  0 ±0 

Results for commit 4340909. ± Comparison against base commit 4340909.

♻️ This comment has been updated with latest results.

@dcherian
Copy link
Contributor

doesn't look like it?
image

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting this @lusewell. Based on @TomNicholas's comment and @dcherian's investigation, I'm curious whether we can make this PR a little more targeted.

Comment on lines 225 to +227
def assert_all_valid_date_type(data):
import cftime
if cftime is None:
raise ModuleNotFoundError("No module named 'cftime'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be the only place where this import could be attempted regularly without cftime installed. Would only making a change here fix the performance issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have thought its better to avoid a pattern with these perfomance characteristics in general - hence why I changed it for all instances, rather than just the one that was causing me issues. I think its pretty clearly a low risk change, so thought its more a question of what's a better pattern to follow going forward, rather than what currently falls in the critical path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my thinking was, at least in the case of the cftime_offsets.py module, if it is a performance issue to attempt importing cftime when it is not installed, why attempt it -- even once -- if we know we will never need to? Note that we follow this current pattern in more than just the cftime_offsets.py and cftimeindex.py modules. There are places in times.py where we import cftime within functions, as well as numerous places in the tests (again though, these are places where I do not think we would see material impacts on performance). I'm open to discussing this more, however.

One other place where I do think this new pattern could positively impact performance is linked below; this is another example of where we might attempt importing cftime regularly when it is not installed. It would be great if you could modify the import logic there to be more performance-friendly too.

xarray/xarray/plot/utils.py

Lines 625 to 630 in 35d798a

try:
import cftime
cftime_datetime = [cftime.datetime]
except ImportError:
cftime_datetime = []

@lusewell
Copy link
Contributor Author

So i've found another instance of this which causes a performance issue - this one with groupby.

@lusewell
Copy link
Contributor Author

@spencerkclark

RE perfomance.

Its only a performance issue to attempt to import cftime repeatedly. Having it fail once in the top level import is not a big problem. The issue comes when it does it thousands of times every time you try and .sel or .isel, which then adds up to a huge performance hit. Given xarray takes a while to import anyway, the marginal cost of search ing the full pythonpath 3 times in import is minimal - only an issue when done repeatedly.

I've fixed this for some other cases I've found that were causing me slowness - would like me to changeanythinng else before this can be merged?

@spencerkclark
Copy link
Member

Thanks for catching that additional spot.

Its only a performance issue to attempt to import cftime repeatedly. Having it fail once in the top level import is not a big problem.

Yes, understood. I just prefer that we are consistent across the code base -- either we use this pattern only where absolutely necessary or we use it everywhere. In light of that do you mind introducing this pattern in times.py as well? There are two places where cftime is imported within functions there too:

import cftime

import cftime

I think we don't have to worry about the tests, because they already follow this pattern to an extent; in building the requires_cftime decorator, a cftime import is only attempted once.

After that, just fix the linting error and add a what's new entry, and I think this should be ready to go from my perspective.

@lusewell lusewell force-pushed the bugfix/fix-performance-bug-from-cftime-import branch from 4502168 to 8619ad9 Compare September 13, 2021 07:38
@lusewell
Copy link
Contributor Author

Fixed other usages and added to whats-new.rst

@max-sixty max-sixty added the plan to merge Final call for comments label Sep 13, 2021
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @lusewell. This looks good to me. Regarding your request:

I'd also like to append this to tag 14.1 and make tag 14.2 if possible - would this be ok?

It sounds like you would like us to backport this change? I am not an expert on doing this -- perhaps others in @pydata/xarray can weigh in.

@dcherian
Copy link
Contributor

Thanks @lusewell and @spencerkclark

Unfortunately we don't do backports.

@dcherian dcherian merged commit 4340909 into pydata:main Sep 29, 2021
snowman2 pushed a commit to snowman2/xarray that referenced this pull request Feb 9, 2022
Co-authored-by: Luke Sewell <lukeddsewell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy