Deep Dive: Our workers' based synchronous stack #2317
WebReflection
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Have you ever wondered how is it possible that we don't block the main thread (the visited page) while executing remote code that can synchronously read or write any window reference? ... well, you are about to find it out, at least on a high level 👍
Worker, Proxy, SharedArrayBuffer & Atomics
These are the ingredients to make it all happen and the reason for intensive UI tasks, things might feel not as fast as it is when everything just works on the main thread. The current issue with the latest approach though, is that if your code has a
while True:
loop, if your code takes long time to bootstrap, or your code asks for an input, there's not much we can do if not indirectly crashing the browser or its tab, block entirely interactivity or "stop the world" for a window prompt that must be answered or dismissed.The solution to all this is to use workers, which operate in a separate thread, being that a hyper one or a whole different CPU core.
The roundtrip's Tango
In a worker, we type, as example,
window.location.href
to retrieve the current page:window
Proxy intercepts the need to access itslocation
fieldpostMessage
to the main thread, attaching what the proxy needs to know as details, such SharedArrayBuffer, then it waits synchronously for a notify operations (viaAtomics.notify
)window
and creates a unique identifier of thatlocation
object as result1
of the int32 array in charge of viewing the SharedArrayBuffer0
with a positive integer that there is alength
known to grab contentlength
of the binary result and, until now, itpostMessage
a new SharedArrayBuffer with enough room to store that binary data + 4 bytes to wait for the next notification4
0
that the operation has been completed4
to4 + length
previously communicated the binary content that was returnedlocation.href
At the end of this convoluted orchestration we'll have that string value representing the current location
href
... is this madness?Well, somehow ... yes, but this dance is basically the same Pyodide or MicroPython or any FFI usually does: things are mapped bi-directionally, hooks in the Garbage Collector are created to avoid caching too heavily all possible references, nothing is usually strongly referenced because these programming languages are strongly dynamic.
That's a lovely question and the simple answer is that we don't really know ahead of time what users are asking for, we can only guarantee that whatever they have asked provided a meaningful, and as fast as it can be, result.
Our polyfills' role
The SharedArrayBuffer primitive has, unfortunately, historic reasons to not be always available unless special headers, provided by the server or a special Service Worker, are in place, and we orchestrated 2 variants that solve the issue, in a way or another, but both variants add some low to high overhead:
When neither variants are fully available, you can see an error in PyScript devtools that states:
This does not mean that PyScript won't work though, it means that the whole magic provided by SharedArrayBuffer and Atomics.wait cannot practically be used so it's not possible to simulate synchronous code at that point.
There is still room for improvements!
Nowadays, both ArrayBuffer and SharedArrayBuffer instances can grow in size overtime, this wasn't true 2.5 years ago when we first sketched this whole worker/main dance.
On top of that, there are better primitives to deal with binary data, such as
DataView
, that helps to avoid duplicating the amount of RAM needed to serialize or deserialize, where TypedArrays as views do a wonderful job at being thin abstract layers to deal with, dropping any unnecessary bloat.So let's see how our initial plan/dance can be improved now, keeping the
window.location.href
example as reference, from a worker:SharedArrayBuffer
that can grow up to a few megabytes but starts as tiny as possible (64K or something similar, the upper bound should rarely be reached or needed at all)window
Proxy intercepts the need to access itslocation
fieldpostMessage
to the main thread, attaching what the proxy needs to know as details, always the same SharedArrayBuffer, then it waits synchronously for a notify operations (viaAtomics.notify
)window
and:4
, into the SharedArrayBuffer the result, keeping the ability to grow on demand behind the scene0
with a positive integer that everything is ready4
Done 🥳 ... or better, there is no need anymore to:
postDance
twice to have a length and a contentIn few words, what required 7 steps (ask/serialize/binary/length -> retrieve/binary/deserialze) and 2X the RAM, could instead use 3 steps (ask/binary-serialization -> binary-deserialization), reducing complexity and code bloat too, while improving performance by quite some margin.
An extra detail ...
I need to figure out if using a SharedWorker to create a unique SharedArrayBuffer that can be used across all main and their workers would work, which goal is to have predictable amount of RAM needed to orchestrate one to dozens tabs running PyScript with one to many workers ... that requires also Web Lock API but if it can be done for SQLite I believe it could be done for our project too ... still surfing the edge of modern APIs but hey, we just want the best by all means 😇
We are close but not there yet
If you have followed recent community calls, you are probably bored by me demoing and benchmarking all the things but here an update of our current state:
Are there unknowns?
At the logical level, I expect performance to improve that's a natural consequence of removing intermediate steps and reduce by 2 the time it takes to ask and retrieve data but, on the other hand, I cannot measure concretely improvements until this has been done ... I mean, I could hack something around but that won't still provide real-world improvements so maybe I should not focus on that.
Last, but not least, it's unclear if/how I can polyfill this new stack in sabayon too but I also expect that dance to be even faster because it won't need more than a single and synchronous XMLHttpRequest to retrieve data, as opposite of
2
plus the whole roundtrip there done to figure out who asked for what (multiple tabs using PyScript as example).Conclusion
I am very glad I've managed to tackle all performance issues that were either hidden in implementations (current serializer we use at the moment) or logic (not updated usage of most modern APIs) to actually being able to "draw" on the board what is the current goal and I hope everything will work as planned and that it won't take too long to have a release that showcases performance are finally reasonable enough to stop using the main thread so please follow this space to know further progresses and don't be afraid to ask anything you'd like to ask about these topics 👋
Beta Was this translation helpful? Give feedback.
All reactions