Cost concerns between network emulations and simulations #3503

ppopth · 2025-01-30T01:05:29Z

ppopth
Jan 30, 2025

As mentioned in the Shadow doc:

Network emulators (e.g., mininet) run real application code on top of real OS kernels in real time, but are non-determinsitic and have limited scalability: time distortion can occur if emulated processes exceed an unknown computational threshold, leading to undefined behavior.

I have a doubt on the scalability issue on the emulation approach.

Recently, I ran Shadow to simulate a network of 5,000 libp2p nodes. It took me 200GB of RAM. The rent for that kind of machine is quite costly. For example, Digital Ocean charges $1,008/mo for that kind of machine.

Now let's look at the emulation approach. What I need to do is to spin up 5,000 physical machines. However, each machine can be small. Let's take the cheapest Digital Ocean instance which is $4/mo. For 5,000 instances, it takes $20,000/mo. Also note that there is no time distortion nor bandwidth issue, because you have one CPU for each node and you can have 1Gbps connections inside a single data center anyway if you spin up all the machines inside the same data center.

The cost per month looks very different, but you don't need a month to do simulations. It can take only one day. So using Shadow costs 1,008/30=$33.6, while the emulation approach costs $666.

$666 isn't expensive, right? and you can get the real environment with CPU time simulated as well, while Shadow can't simulate CPU time.

I'm curious to know what is your opinion on this?

It seems to me that Shadow provides scalability only to some extent. If you are willing to pay a bit more, you can just get a real environment rather than a simulated environment.

sporksmith · 2025-01-30T04:04:37Z

sporksmith
Jan 30, 2025
Collaborator

In theory shadow could one day support sharding simulations across multiple machines. We had planned on building this feature for this reason - that renting or acquiring multiple smaller machines may be cheaper and more feasible than one huge one - but ultimately prioritized other features instead. It's a big feature that would probably need some kind of funding to get accomplished.

Does mininet or other emulators support sharding across machines like this? Or maybe you're talking about not using emulation at all and just running on a real private network of 5k machines?

Trying to orchestrate 5k physical machines sounds painful to me, though maybe with the right tools (puppet?) it could be manageable.

Depending on the emulator (or if you don't use an emulator at all) you may lose other properties that shadow provides:

determinism
the ability to run faster than real-time for idle periods
the ability to simulate whatever network topology and link characteristics you want

with CPU time simulated as well

Maybe - though it sounds like you'd be "modeling" whatever CPU speed and parallelism happens to be on the physical machines you're renting. You'd either want to rent beefy enough machines that whatever you're doing doesn't become CPU-bound (in which case you're your results won't be so different than shadow's model of "cpu time is free"), or machines that accurately reflects how powerful you expect the "real" machines to be, which may be more expensive than "the cheapest Digital Ocean instance".

As with sharding, modeling CPU time in shadow is a feature that could potentially be added to shadow. I think modeling CPU time for a single-CPU machine wouldn't be terribly difficult, but would add some complexity and performance overhead. Trying to accurately model CPU time of a multi-core machine would be much more difficult though, especially while preserving determinism.

6 replies

jtracey Jan 30, 2025

I think mininet is single-machine these days, but I know some emulators did support sharding. That said, sharding across thousands of machines is something I'm guessing has not been tested. Back when I was working with NetMirage, you had to use a custom kernel on the routing machine, because there was a global lock in netem that prevented it from scaling, so at least in my experience, the emulator space struggled with really big numbers. :P

chirag-parmar Jan 30, 2025

@sporksmith was there any initial research done for sharded simulations? I'm curios about the performance gains.

ppopth Jan 30, 2025
Author

maybe with the right tools (puppet?) it could be manageable.

Yes with a configuration management tool, like puppet or ansible.

the ability to simulate whatever network topology and link characteristics you want

This doesn't sound like a problem, right? I can just use netem.

because there was a global lock in netem that prevented it from scaling

@jtracey Can you give a reference on the netem scaling problem, you mentioned? That's an interesting issue. Thank you in advance.

sporksmith Jan 30, 2025
Collaborator

@sporksmith was there any initial research done for sharded simulations? I'm curios about the performance gains.

Only some design discussion. IIRC the performance bottlenecks that'd keep performance from scaling linearly with # of machines are:

We'd need to coordinate across physical machines so that each physical machine only simulates up to the end of the current simulation round. Data sent between hosts inside the simulation would need to be sent to the physical machine that has the corresponding virtual host.
We probably wouldn't support migrating virtual hosts (along with their processes and threads) across physical machines; at least in an inititial implementation. Moving the corresponding real process state across machines would be complicated, fragile, and expensive. This means that a physical machine that ends up with virtual hosts that are doing more work won't be able to shed that load to other machines, and will become a bottleneck to the simulation.

jtracey Jan 30, 2025

@ppopth I believe this was the lock in question: https://github.com/torvalds/linux/blob/v6.13/net/core/rtnetlink.c#L6928

I wasn't the one working with him, but there was an undergrad term project on working around it. IIRC the conclusion was that it was untenable outside of our own dedicated fork (I don't even know if that fork ever saw the light of day, since it did some dubious things). NetMirage has been abandoned for years now, and I'm sure the details of the situation have changed, but the consensus seemed to be it was never going to see greater than ~1.5x speedup from parallelism on a normal Linux setup, which isn't enough for network scales we were trying to look at.

jtracey · 2025-01-30T04:33:32Z

jtracey
Jan 30, 2025

I spot a few problems with this analysis.

Most obviously, 20x the price for comparable scale is not what I would consider competitive. ;) You would probably be better off using that money to run more (or larger) simulations to tighten your error bars.
While I've seen some people run Shadow on hosting services, most people I know use it on-site (for some loose definition of that, e.g. a rack in a university data center). Even if you don't have large org resources, you can buy a used server or workstation with enough memory to run experiments yourself for around $1000. If you do that, it's obviously never going to be more efficient to run hosts on VMs or physical machines than as processes in Shadow.
You say "you have one CPU for each node". This is not true for the Basic tier of Digital Ocean Droplets. You're on a shared host, and they will throttle you based on both your load and the load of other customers. D.O. Droplets with a dedicated CPU thread per host start at $42/mo.
You're not going to get valid results if you run a "real environment" on its own. You need an emulator to go with it, to ensure that the links are behaving realistically, assuming you're not trying to study the properties of the data center you're running the experiment on. This is going to require setting up custom images, and, since all the network emulators I know of require centralized routing, something considerably beefier than a $4 droplet (CPU, memory, and bandwidth).
Related to the previous two points, even if you do manage to achieve the scale you're looking for, you still are going to have the same set of problems emulators always have around reproducibility and configurability. Variables as mundane as "what temperature is it" and, especially in a shared hosting environment, "what else is running on the system" can cause wildly different results, and you're always going to be constrained by the performance of the underlying hardware.

There are situations where emulation does beat Shadow, the big ones being if you do have access to lots of small machines and not a large one, you're running software incompatible with Shadow, or you're trying to study a system bottle-necked on compute and not the network (in which case, accurate network properties matter less). But because emulators are subject to more physical constraints, and don't simulate CPU time, the idea of "it's real, therefore more accurate" doesn't actually pan out for network-constrained experiments.

1 reply

ppopth Jan 30, 2025
Author

You say "you have one CPU for each node". This is not true for the Basic tier of Digital Ocean Droplets. You're on a shared host, and they will throttle you based on both your load and the load of other customers. D.O. Droplets with a dedicated CPU thread per host start at $42/mo.

This is a killing point. Thank you so much.

Nashatyrev · 2025-01-30T16:00:58Z

Nashatyrev
Jan 30, 2025

I'm potentially a big fan of Shadow approach because it is deterministic. I.e. if something weird happened in the network anyone may potentially reproduce and investigate exactly this case.

However there are some difficulties:

Accounting CPU delay. It is sometimes crucial to take it into account considering heavy KZG and BLS operations. (this could basically be solved by using stub KZG and BLS and emulating their CPU time with a sleep call)
Accounting disk access time. E.g. restoring a states from DB may cause significant lags for a node. BTW is the disk access time considered to be 0 simulated ms (similar to CPU time)?
Need some client adoption and maintenance to
- avoid missing syscalls
- be deterministic

2 replies

robgjansen Jan 30, 2025
Maintainer

BTW is the disk access time considered to be 0 simulated ms (similar to CPU time)?

Yes. Currently, Shadow does not advance the simulation time while performing disk operations.

We had discussed how we might address this issue (i.e., implementing a virtual file system) but we currently get file I/O "for free" from linux and decided to prioritize other work (like converting C to Rust) first.

sporksmith Jan 30, 2025
Collaborator

BTW is the disk access time considered to be 0 simulated ms (similar to CPU time)?

Yes :)

Injecting some delay into the syscalls that do disk access wouldn't be very difficult. Deciding how much delay to inject into which syscalls would raise a fair number of questions. A reasonable start might be to just make read and write syscalls have a delay based on some configured read and write throughput, but that ignores implementation details of the file system (including on-disk metadata access and manipulation), spinning disk seek times, kernel and hardware buffering, etc.

robgjansen · 2025-01-30T16:17:57Z

robgjansen
Jan 30, 2025
Maintainer

@sporksmith and @jtracey make great points. Let me expand on a couple.

Let's think abstractly about scalability. Scalability is the property of a system to handle a growing amount of work. We want that adding more work (ie, virtual host density, network/packet density, process/compute density) does not introduce a departure from realism / intended results, etc. The statement that is referenced from Shadow's README claims that the architecture implemented in Shadow is fundamentally more scalable than emulators, because it controls time and thus decouples an experiment from the hardware resources that are required to run it. Think about this in the limit, as the work extends to very large numbers. Shadow can accommodate larger and larger experiments by making use of the existing CPUs on the machine and taking more real time to run a correct experiment, whereas an emulator will eventually max out the CPU resources and will be forced to compress the workload and sharing CPU cycles across too many consumers will cause unintended artifacts like dropped packets, unresponsive processes/sockets, etc. This graph from the architecture paper demonstrates this point.

i.e. this graph shows that:

time distortion can occur if emulated processes exceed an unknown computational threshold, leading to undefined behavior.

The "unknown computational threshold" part means you don't know exactly when such problems arise, meaning you will have to run an emulation while very much underutilizing your hardware to make sure that a possibly bursty workload doesn't lead to slowdowns or introduce artifacts.

You already argue that Shadow is cheaper, but I think your calculation is off quite a bit by how much cheaper. For Shadow, cloud services are a crazy way to spend money if you intend on running more than a couple of one-off experiments, because as mentioned you would be far better off buying your own beefy machine and paying only the electricity cost. I think you may have trouble splitting that beefy machine into 5000 nodes in order to run an emulator, so you'd need many machines and that will lead to additional cost in order to admin them. Or you do as you suggest and run in a cloud to avoid the admin costs, and then you have to take on increased node management costs to facilitate the experiment. I've never used puppet or ansible for this, but I've used other distributed management tools and it never quite goes as smoothly as promised, and any small issue that arises with your scripts can quickly snowball into a huge time investment.

There are certainly scenarios where emulation can make sense. But most of the time this has to do with a lack of implemented features in the current version of Shadow and not some sort of fundamental weakness with the architecture. We just need the developer time to continue extending the capabilities to enable Shadow to reach its full potential (including CPU modeling, distributed simulation, full syscall support, etc.) and over time the arguments for Shadow will only get stronger. :)

0 replies

Cost concerns between network emulations and simulations #3503

Uh oh!

ppopth Jan 30, 2025

Replies: 4 comments · 9 replies

Uh oh!

sporksmith Jan 30, 2025 Collaborator

Uh oh!

jtracey Jan 30, 2025

Uh oh!

chirag-parmar Jan 30, 2025

Uh oh!

ppopth Jan 30, 2025 Author

Uh oh!

sporksmith Jan 30, 2025 Collaborator

Uh oh!

jtracey Jan 30, 2025

Uh oh!

jtracey Jan 30, 2025

Uh oh!

ppopth Jan 30, 2025 Author

Uh oh!

Nashatyrev Jan 30, 2025

Uh oh!

robgjansen Jan 30, 2025 Maintainer

Uh oh!

sporksmith Jan 30, 2025 Collaborator

Uh oh!

robgjansen Jan 30, 2025 Maintainer

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

ppopth
Jan 30, 2025

Replies: 4 comments 9 replies

sporksmith
Jan 30, 2025
Collaborator

ppopth Jan 30, 2025
Author

sporksmith Jan 30, 2025
Collaborator

jtracey
Jan 30, 2025

ppopth Jan 30, 2025
Author

Nashatyrev
Jan 30, 2025

robgjansen Jan 30, 2025
Maintainer

sporksmith Jan 30, 2025
Collaborator

robgjansen
Jan 30, 2025
Maintainer