Replies: 4 comments 9 replies
-
In theory shadow could one day support sharding simulations across multiple machines. We had planned on building this feature for this reason - that renting or acquiring multiple smaller machines may be cheaper and more feasible than one huge one - but ultimately prioritized other features instead. It's a big feature that would probably need some kind of funding to get accomplished. Does mininet or other emulators support sharding across machines like this? Or maybe you're talking about not using emulation at all and just running on a real private network of 5k machines? Trying to orchestrate 5k physical machines sounds painful to me, though maybe with the right tools (puppet?) it could be manageable. Depending on the emulator (or if you don't use an emulator at all) you may lose other properties that shadow provides:
Maybe - though it sounds like you'd be "modeling" whatever CPU speed and parallelism happens to be on the physical machines you're renting. You'd either want to rent beefy enough machines that whatever you're doing doesn't become CPU-bound (in which case you're your results won't be so different than shadow's model of "cpu time is free"), or machines that accurately reflects how powerful you expect the "real" machines to be, which may be more expensive than "the cheapest Digital Ocean instance". As with sharding, modeling CPU time in shadow is a feature that could potentially be added to shadow. I think modeling CPU time for a single-CPU machine wouldn't be terribly difficult, but would add some complexity and performance overhead. Trying to accurately model CPU time of a multi-core machine would be much more difficult though, especially while preserving determinism. |
Beta Was this translation helpful? Give feedback.
-
I spot a few problems with this analysis.
There are situations where emulation does beat Shadow, the big ones being if you do have access to lots of small machines and not a large one, you're running software incompatible with Shadow, or you're trying to study a system bottle-necked on compute and not the network (in which case, accurate network properties matter less). But because emulators are subject to more physical constraints, and don't simulate CPU time, the idea of "it's real, therefore more accurate" doesn't actually pan out for network-constrained experiments. |
Beta Was this translation helpful? Give feedback.
-
I'm potentially a big fan of Shadow approach because it is deterministic. I.e. if something weird happened in the network anyone may potentially reproduce and investigate exactly this case. However there are some difficulties:
|
Beta Was this translation helpful? Give feedback.
-
@sporksmith and @jtracey make great points. Let me expand on a couple. Let's think abstractly about scalability. Scalability is the property of a system to handle a growing amount of work. We want that adding more work (ie, virtual host density, network/packet density, process/compute density) does not introduce a departure from realism / intended results, etc. The statement that is referenced from Shadow's README claims that the architecture implemented in Shadow is fundamentally more scalable than emulators, because it controls time and thus decouples an experiment from the hardware resources that are required to run it. Think about this in the limit, as the work extends to very large numbers. Shadow can accommodate larger and larger experiments by making use of the existing CPUs on the machine and taking more real time to run a correct experiment, whereas an emulator will eventually max out the CPU resources and will be forced to compress the workload and sharing CPU cycles across too many consumers will cause unintended artifacts like dropped packets, unresponsive processes/sockets, etc. This graph from the architecture paper demonstrates this point. i.e. this graph shows that:
The "unknown computational threshold" part means you don't know exactly when such problems arise, meaning you will have to run an emulation while very much underutilizing your hardware to make sure that a possibly bursty workload doesn't lead to slowdowns or introduce artifacts. You already argue that Shadow is cheaper, but I think your calculation is off quite a bit by how much cheaper. For Shadow, cloud services are a crazy way to spend money if you intend on running more than a couple of one-off experiments, because as mentioned you would be far better off buying your own beefy machine and paying only the electricity cost. I think you may have trouble splitting that beefy machine into 5000 nodes in order to run an emulator, so you'd need many machines and that will lead to additional cost in order to admin them. Or you do as you suggest and run in a cloud to avoid the admin costs, and then you have to take on increased node management costs to facilitate the experiment. I've never used puppet or ansible for this, but I've used other distributed management tools and it never quite goes as smoothly as promised, and any small issue that arises with your scripts can quickly snowball into a huge time investment. There are certainly scenarios where emulation can make sense. But most of the time this has to do with a lack of implemented features in the current version of Shadow and not some sort of fundamental weakness with the architecture. We just need the developer time to continue extending the capabilities to enable Shadow to reach its full potential (including CPU modeling, distributed simulation, full syscall support, etc.) and over time the arguments for Shadow will only get stronger. :) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As mentioned in the Shadow doc:
I have a doubt on the scalability issue on the emulation approach.
Recently, I ran Shadow to simulate a network of 5,000 libp2p nodes. It took me 200GB of RAM. The rent for that kind of machine is quite costly. For example, Digital Ocean charges $1,008/mo for that kind of machine.
Now let's look at the emulation approach. What I need to do is to spin up 5,000 physical machines. However, each machine can be small. Let's take the cheapest Digital Ocean instance which is $4/mo. For 5,000 instances, it takes $20,000/mo. Also note that there is no time distortion nor bandwidth issue, because you have one CPU for each node and you can have 1Gbps connections inside a single data center anyway if you spin up all the machines inside the same data center.
The cost per month looks very different, but you don't need a month to do simulations. It can take only one day. So using Shadow costs 1,008/30=$33.6, while the emulation approach costs $666.
$666 isn't expensive, right? and you can get the real environment with CPU time simulated as well, while Shadow can't simulate CPU time.
I'm curious to know what is your opinion on this?
It seems to me that Shadow provides scalability only to some extent. If you are willing to pay a bit more, you can just get a real environment rather than a simulated environment.
Beta Was this translation helpful? Give feedback.
All reactions