-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Fixed #31169 -- Adapted the parallel test runner to use spawn. #15421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up @smithdc1.
I've left a bunch of comments.
second_db = sqlite3.connect(worker_db, uri=True) | ||
source_db.backup(second_db) | ||
source_db.close() | ||
self.connection.settings_dict["NAME"] = worker_db | ||
self.connection.connect() | ||
second_db.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this section could do with a bit of thought. Looking at the Python sqlite docs, it seems we should do:
second_db = sqlite3.connect(worker_db, uri=True) | |
source_db.backup(second_db) | |
source_db.close() | |
self.connection.settings_dict["NAME"] = worker_db | |
self.connection.connect() | |
second_db.close() | |
second_db = sqlite3.connect(worker_db, uri=True) | |
with second_db: | |
source_db.backup(second_db) | |
source_db.close() | |
# Re-open connection to in-memory database before closing copy connection. | |
self.connection.settings_dict["NAME"] = worker_db | |
self.connection.connect() | |
second_db.close() |
Not really sure what the context manager is for 🤷🏻♂️
Can we also rename second_db
to target_db
? While we're at it, perhaps we should also change the name of worker_db
as it is just the connection string.
Other thoughts related to .backup()
that might be worth exploring:
- It takes a
sleep
argument with a default of0.25
. Maybe we can reduce this to speed up setup time? - It takes a
progress
argument accepting a callable. Perhaps we can passverbosity
tosetup_worker_connection()
and display progress information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the with
block is for either. I read the PR earlier and that didn't make it any clearer for me as the implementation is written in c.
The second example shows it written more like it is here, but I agree that it is a little untidy.
I've not yet tested Nick's proposal, which I'd judge as more readable than the current attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I testted Nick's proposals I get a handful of test failures. I'm not entirely sure why. I therefore just progressed with the renaming suggestions.
This still leaves the question about changing the other arguments to backup()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the with ...
usage is documented as "Connection objects can be used as context managers that automatically commit or rollback transactions. In the event of an exception, the transaction is rolled back; otherwise, the transaction is committed", which is presumably useful and necessary when the data pages to be backed up are locked for writing etc (hence also, the sleep ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this again. When using the context manager I again saw test failures (different to those before) and they were intermittent. I hadn't seen that with the current patch.
However given they are intermittent I'm not sure if it's an issue with the context manager or this patch.
Hey @smithdc1 — thanks for picking this up. This is quite exciting: First run, hitting an error (quickly 🙂) on macOS:
I shall have a little dig-in to that. 👍 |
sqlite3.OperationalError: no such table: backends_object I was seeing something similar on the Pi with the previous patch. This is why I introduced the extra logic in Maybe that step isn't quite right for MacOS? 🤷 Building on Nick's comments something doesnt quite seem right at the moment, but I find it hard to describe. It seems that were 'cloning' a file for spawn (fork gets its own in memory copy) but then copying (converting the various files back to in memory) / reopening again (fork?) in each process again in 'setup_worker_connection'. |
OK, I'm seeing two intermittent errors. 1
2
3Plus a shutdown 7 leaked semaphores issue occasionally: Full output...
The traceback just ends there. It gets blocked at shutdown. I have to hit
(See the ... to get it to exit. I also saw an |
Hi @carltongibson thanks for looking at this. Thank you for highlighting those test failures. I'll have a look and see if I can get them to reproduce on windows as I don't have MacOS. On the final, semaphore, issue I found this but I'm not sure if it is related to the issue shown in the traceback above. |
3cbadd3
to
d9e5bcf
Compare
Ahh the best type of error. 😄 Running |
Python 3.10. It passes more that it fails. ... — I've just run three times. Only on the third do I hit the It's also significantly faster, so — if it's stable on Windows — I'm half-inclined to say let's have it and fix the isolation issues as we can. (We've had extended test failure issues on macOS for an age... until we commit to running CI on macOS I don't see that changing.) However... let me have another play. |
I can't quite tell (before lunch at least 🥪) if @ngnpope's comments are all resolved? 🤔 |
ff09295
to
4dcda38
Compare
Not quite -- I've pushed a couple of small edits, squashed and rebased. I've also been through and re-marked resolved/not resolved as I think this patch now stands. There's 3 comments that I've not had chance as yet to think about. |
4dcda38
to
ec82706
Compare
Am I right in thinking the Django's PR runners run the tests with |
ec82706
to
2a105f3
Compare
So I have seen this but on Linux, but like you say it was intermittent. I think it's to do with this comment, so I've now reset the connection settings which was missing for both
Carlton -- with my latest amends does this fix the tests failures you were seeing.
I therefore think this patch had a regression in it which wasn't being caught as c/i doesn't run in parallel. While I've done some more testing with long run times on my devices I can't do enough "reps" to prove this is stable. It would therefore be useful, I think, if this could be tested a little bit more widely before merging? I'm getting there slowly, just 1 more of Nicks comments to investigate. |
I think this is ready for review again. I'd appreciate you feedback to understand if the previous batch of test failures are now resolved. There is one outstanding question about use of the context manager when migrating the dB back to memory. As your timings are c. 7x-8x quicker than mine ( I think I'm now memory constrained) I wondered if you could help test the reliability of this patch with/without the suggested context manager? 🙏 |
Thanks @smithdc1. Let me give it another run. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK… so…
macOS.
Running with --parallel=1
works without issue, which is the situation ante, since that's all that works on main
.
With this patch runtime is ≈38s vs 246s on main
.
It occasionally fails, maybe 1 in 4 runs? Here I can re-run it (twice it needs be) before the --parallel=1
version has finished.
Issues
The main error I'm seeing is this one:
test_database_writes (servers.tests.LiveServerDatabase) failed:
<HTTPError 500: 'Internal Server Error'>
Unfortunately, the exception it raised cannot be pickled, making it impossible
for the parallel test runner to handle it cleanly.
Here's the error encountered while trying to pickle the exception:
TypeError("cannot pickle '_io.BufferedReader' object")
This leads to a problem on shutdown:
(snipped)
multiprocessing.pool.RemoteTraceback:
...
TypeError: cannot pickle '_io.BufferedReader' object
...
Exception ignored in: <function Pool.__del__ at 0x109c2dea0>
Previously this was freezing, requiring a ^C
but it's exiting cleanly now.
Then I've seen this morning, one time each (over quite a few runs):
FAIL: test_delete_signals (signals.tests.SignalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 594, in run
self._callTearDown()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 552, in _callTearDown
self.tearDown()
File "/Users/carlton/Projects/Django/django/tests/signals/tests.py", line 32, in tearDown
self.assertEqual(self.pre_signals, post_signals)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 837, in assertEqual
assertion_func(first, second, msg=msg)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 1054, in assertTupleEqual
self.assertSequenceEqual(tuple1, tuple2, msg, seq_type=tuple)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 1025, in assertSequenceEqual
self.fail(msg)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 667, in fail
raise self.failureException(msg)
AssertionError: Tuples differ: (1, 0, 2, 1) != (1, 0, 2, 0)
First differing element 3:
1
0
- (1, 0, 2, 1)
? ^
+ (1, 0, 2, 0)
? ^
and
======================================================================
FAIL: test_database_sharing_in_threads (backends.sqlite.tests.ThreadSharing)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/Users/carlton/Projects/Django/django/tests/backends/sqlite/tests.py", line 271, in test_database_sharing_in_threads
self.assertEqual(Object.objects.count(), 2)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 837, in assertEqual
assertion_func(first, second, msg=msg)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/unittest/case.py", line 830, in _baseAssertEqual
raise self.failureException(msg)
AssertionError: 1 != 2
# This was printed to stderr:
Exception in thread Thread-1 (create_object):
Traceback (most recent call last):
File "/Users/carlton/Projects/Django/django/django/db/backends/utils.py", line 89, in _execute
return self.cursor.execute(sql, params)
File "/Users/carlton/Projects/Django/django/django/db/backends/sqlite3/base.py", line 357, in execute
return Database.Cursor.execute(self, query, params)
sqlite3.OperationalError: no such table: backends_object
I think if we could work out the issue with test_database_writes (servers.tests.LiveServerDatabase)
that would be the most of it.
(Obviously why the other two come up would be worth resolving too.)
The issue with test_database_writes (servers.tests.LiveServerDatabase)
doesn't reproduce running just servers.tests.LiveServerDatabase
or servers
tests.
I'm minded to take this and work on the edges in a much happier (faster) land.
I need to look at this still. I tested already with disk vs in memory SQLite, which made no difference to the behaviour. |
I'm not seeing any difference with the context manager approach (discussed above):
By the time we reach the end of the @smithdc1 — what were the errors you saw? (Similar to those I'm seeing or others?) |
Last time I tried it I had a file access error (so something different). But I only saw it the once, so hard to tell if it is related to this approach. |
It breaks. I see 5 failures running postgres tests in parallel without this change, see gist. Looking at the previous PR, there was some discussion here and here but nothing on the why from my reading. Assuming we're happy to keep it, I think all comments are now updated for. I've also rebased for the hook that was pulled out into a separate PR earlier today. |
Hi All, I've been trying to replicate some of the test failures seen here using Windows this morning. I tried using With this command / setup I see repeated test fails. (Not so when runnign the whole test suite). The tests pass when running in a single process with
Full output from this mornings logs here |
Hey @smithdc1 — interesting. Typically, I'm not seeing that failure running the same here. 🤔 (at e445604) I think we should try to divide the issues into:
I can well believe we'll find plenty of 2. But do we think there are any 1s left? If not I think we can probably go for it, and resolve the 2s with time and visibility, once folks are running the test suite in parallel in more environments. (Or so I might argue.) |
I think that's difficult to tell as we need this patch to find the isolation issues. Especially as they seem hard to replicate. I was thinking if we are happy that this patch doesn't create a regression in the following scenarios then we should progress.
People using spawn can always revert to |
@smithdc1 Yes, that's more or less my reasoning too. (Plus it's so much faster that I'd take the failures whilst we track 'em down as a trade-off without regret.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smithdc1 Thanks 👍 I squashed commit, rebased, and pushed small edits.
29dd40c
to
f667201
Compare
@carltongibson Can you check this with selenium tests? |
Question: Could this be considered an extended support upgrade for 3.2? I ask because 3.2 is the last version of Django to support Python 3.7, and 3.7 is the last version of Python that can run parallel tests on mac by setting the |
@hamstap85 This should be part of Django 4.1. It doesn't qualify for a backport to Django 3.2, which is now in extended support, and only receives security and data loss fixes anyways. |
As it stands, Selenium tests are all skipped unless I'd suggest taking getting that working as a follow-up. Maybe a error if
|
@carltongibson thanks for clarifying 👍 |
Not had chance yet to test it on Windows as yet, but maybe something like this? diff --git a/tests/runtests.py b/tests/runtests.py
index 06755688ea..48840b0042 100755
--- a/tests/runtests.py
+++ b/tests/runtests.py
@@ -3,6 +3,7 @@ import argparse
import atexit
import copy
import gc
+import multiprocessing
import os
import shutil
import socket
@@ -683,6 +684,16 @@ if __name__ == "__main__":
options = parser.parse_args()
+ if (
+ options.selenium
+ and options.parallel != 1
+ and multiprocessing.get_start_method() != "fork"
+ ):
+ raise ValueError(
+ "You cannot use --selenium with parallel tests; "
+ "pass --parallel=1 to use it."
+ )
+
using_selenium_hub = options.selenium and options.selenium_hub
if options.selenium_hub and not options.selenium:
parser.error( |
@smithdc1 I think that would be just right. Do you have the capacity to add that, and a test? — I think then we can get this in. 😜 |
795208f
to
26cd048
Compare
@smithdc1 — I added a check for the Just now discussing the |
Thank you for pushing this along. I was still contemplating how to add a test 😄 |
Co-authored-by: Valz <ahmadahussein0@gmail.com> Co-authored-by: Nick Pope <nick@nickpope.me.uk>
26cd048
to
a34c15a
Compare
@smithdc1 @carltongibson This seems to caused some issues:
I stopped this build on 70432 workers 🤯 See logs. |
This is probably related with the fact that we use TEST_RUNNER = 'xmlrunner.extra.djangotestrunner.XMLTestRunner' on Jenkins. However, folks can use different test runners and it shouldn't crash. |
A draft PR with a potential fix at #15520 |
A bit late, but I think this deserves a release note! At the least, projects that have patched to allow fork on macOS will be able to remove the patch. Possibly there are other workarounds and test scripts users will want to update now they can run tests in parallel on all platforms! |
Good call. @adamchainz How about #15599 ? |
Ticket : #31169
Previous PR : #12646
This is my attempt at rebasing the previous PR and to accomodate the comments. There was a lot of discussion on the previous PR so maybe I've not captured everything as yet, likley I've also not interpreted it all correctly as well.
I've had to make a some small changes so that this passess locally for me on both Windows and Linux but interested to see what the test suites here have to say. I've also changed
get_max_test_processes()
so the test suite will now run in parallel by default whenspawn
is in play.