The New Age of Data-Intensive Applications
The New Age of Data-Intensive Applications
In his book Designing Data-Intensive Applications, Martin Kleppmann suggests that all data applications follow a similar pattern. Their goal is to read data,
run some transformation on it, and store the result somewhere, all to have a faster way to read that data later.
Lately, I've been seeing more and more data applications using object storages (e.g. AWS S3, Azure Blob Store, Google Cloud Storage) instead of the traditional file
system to store data, claiming to be a much cheaper solution than the old alternatives.
In this post, we'll explore the benefits and drawbacks of this architecture, with 3 real-world examples:
https://tontinton.com/posts/new-age-data-intensive-apps/ 4/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
Quickwit - A cheap log search engine as an alternative to Elasticsearch.
WarpStream - A cheap distributed log as an alternative to Kafka.
Neon - Serverless Postgres, a sort of alternative to AWS Aurora.
Choosing between file system and object storage is critical to do before you write even a single line of code, as they have different APIs, performance characteristics,
costs, and deployment operations. The architecture you choose to make will turn out to be vastly different.
I hope this can serve as a guide to deciding whether object storage is the correct approach for your next data application.
Object storage is a service that is a kind of key-value database, that is very cheap to store huge amounts of unstructured data called blobs. "Blob" stands for "Binary
large object".
They usually store data on the cheapest hardware, using HDDs instead of SSDs (until they'll be cheaper).
But you can also do more things, like listing files under a specific prefix:
ListObjects(bucket, prefix)
One of the limitations of the API, compared to file systems, is that there's no way to overwrite a file partially, you can only overwrite the entirety of the file, replacing
it.
AWS S3, the most popular object storage, doesn't even have a MoveObject request, you must CopyObject(new) -> DeleteObject(old).
Also in S3, there are no transaction guarantees other than read-after-write consistency (a GetObject after a successful PutObject will always work). We'll soon learn
how you can achieve ACID transactions over the different object storages.
Notice how you don't pay much for the compute it takes AWS to keep S3 running. That's the magic of these object storage services. They allow for a pattern known
as the separation of storage and compute.
In a traditional storage solution (for example Elasticsearch), when it runs out of disk space, it scales up to run another node. Thus, you pay for the accumulated CPU
time of all the nodes running to hold your data.
What if most of the time, the data just sits there, accumulating dust, almost never to be queried? You pay for wasted CPU time.
This is why in the big data analytics world, we see products like Snowflake, Delta Lake and Apache Iceberg (we'll expand on these later) being so popular lately. It's
mainly because of costs.
Be wary that you also pay for the network egress (data going outside the data center). On AWS specifically, it will cost ~53$ per TB, which depending on
your workload, can be a deal breaker. As long as you run in the same AZ (availability zone) though, you shouldn't pay for egress.
There's a pretty new service by Cloudflare called R2 which provides the same API as S3, without the egress costs.
"Why not use a mounted file system like EBS?", the reason is, again, cost. S3 is much cheaper in comparison (~8x cheaper per replica, 3 replicas will cost ~24x
more). There's also the simplicity of not needing to deal with resizing the volume. For a more thorough explanation I'll link WarpStream's Cloud Disks are (Really!)
Expensive blog post.
Stateless
The separation of storage and compute also means your service is stateless, allowing for simple scalability and operation:
You can add nodes / remove nodes by monitoring CPU / RAM / network usage.
Pay only for what you use.
Do so quickly. No need to synchronize the state with the cluster.
Scale to zero with AWS Lambda and pay nothing if there's usually no workload.
Or better yet, run on a cheap serverless edge solution like Cloudflare Workers.
Ability to break the monolith into different services.
For example a service for the write path, and a service for the read path.
You can restart pods in case of a bug, and know that they will start with a clean state.
Throw away that StatefulSet in k8s, and deploy a simple Deployment, just like your regular stateless web server.
Reliability
S3 is designed to provide 99.999999999% durability and 99.99% availability of objects over a given year, but all object storages have similar guarantees.
They are designed to last bit-flips from cosmic rays and random earthquakes that destroy a data center.
https://tontinton.com/posts/new-age-data-intensive-apps/ 5/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
But can they protect against human error? Probably not 🙃
They remove the need to manage replicas of data in your system, which is a very complex problem.
One of the ways they achieve this is by using something called erasure coding. It's an algorithm that breaks an object into X amount of chunks and distributes the
storage of these chunks on different data centers. The beauty is that you only need a Y amount of chunks to reconstruct the object, where Y < X.
Performance
As object storages are designed to be cheap and durable, it comes at the cost of performance, specifically latency.
When running a GetObject request to download a blob file, you can expect the median latency to be ~15ms, with P90 at ~60ms. Although these numbers got better
with time and will continue to slowly improve, the latency of an NVMe SSD is 20–100 μs, which is 1000x faster.
The throughput is also not amazing by default, being somewhere around 50MB/s (while NVMe 5.0 can get to 12GB/s), but there's a trick to reach the throughput of
even SSDs, and that is running multiple GetObject requests in parallel. For example, getting 20 blob files in parallel will give you 1GB/s of throughput.
This trick works even when you want to download 1 big file. For example in S3, there's a Range header you can provide to GetObject, where you specify the byte
offset and size to download. Split the download into chunks, and fire multiple GetObject requests concurrently. Adding a bit of complexity for the benefit of better
throughput.
Usually, the cloud providers also provide a more expensive but lower latency solution. For example, AWS has S3 Express, which is somewhere in the middle of the
pricing range between regular S3 and EBS, but it allows for tiering strategies without changing much of the architecture and code.
For example, if most reads from users are on new data and you write to an immutable log, like a LSM Tree, you can first write into the more expensive solution, and
then on compaction write to the cheaper one. Access to new data will be fast, without paying that much more, as most of the time the data is in the cold storage.
Be wary of rate limits though. S3 for example, states they support 3500 write requests per second and 5500 read requests per second. Just remember the rate limits
are applied per prefix, so storing data in different prefixes will allow you to have greater rate limits.
Finally, ListObjects requests are notoriously slow, mostly because object storages are flat and not hierarchical. Prefixes are called prefixes and not directories,
because that's exactly what they are, a prefix to the key (remember how I said object storages are similar to K/V DBs?). To accommodate that, you should not store a
bunch of small blobs, but a few big blobs. I can't say the best absolute size you should go for, experiment, and benchmark for your use case.
Mostly cloud-based
Almost all object storages are services provided as part of the cloud. If you want your data to sit inside the internal company servers (On-Premise), it gets a bit more
complex but definitely doable.
A popular solution for having an On-Prem object storage is to deploy MinIO using k8s or OpenShift.
MinIO strives to provide an API compatible with S3, but it has some differences. For example, in S3, a file and a directory can have the same name, while it is not
supported in MinIO.
That's why when writing automated tests for your service, you should consider using LocalStack's S3 instead of MinIO.
Both MinIO and LocalStack have testcontainers modules, greatly simplifying the setup of your tests.
ACID transactions?
If you have no idea what ACID is, you can go read my Database Fundamentals post.
Storing data in object storages and guaranteeing ACID transactions is possible, but has to be carefully designed.
This is not a novel problem anymore, let's look at how open-source solutions have solved this.
Delta lake (highly recommended white paper linked) is an open source ACID table storage layer over cloud object storages, developed at Databricks. Think of it as
adding the ability to run SQL over data stored in object storages.
In chapter 3.2 in the white paper, they state that both Google Cloud Storage and Azure Blob Store support an atomic put-if-absent operation, so they simply use these
as the atomicity primitive. S3 is trickier, as it doesn't support any atomic put-if-absent / atomic rename operations, so you need to roll a coordination service that uses
some concurrency primitive like locks, where all S3 write requests go through it.
A very clever business move by Databricks. If you write to S3 with Spark running in Databricks, the writes automatically go through a coordination
service implemented by them.
In Delta Lake version 1.2, they've included a way to use DynamoDB as the coordination service.
This method of using a database that already implements ACID transactions is common, as it also improves the performance when listing files.
The biggest disadvantage with this approach is that the availability and durability guarantees of your application are only as good as the worst guarantees your
different services provide. If you run Postgres self-hosted, and the node crashes for any reason, it can mean you don't have access to the data anymore, or at least
transactional and efficient access to the data, depending on your architecture.
Iceberg, a competitor to Delta Lake, developed by Netflix and is quickly becoming the industry standard, has an open-source coordination service called
Nessie, which also supports git-like branching on your data (very cool ).😎
Snowflake uses FoundationDB.
I would really like it if one day AWS added an IfMatch header that checks, right before the end of a PutObject request, whether the ETag is different, and if it is, to
fail the request. I mean there's already one in GetObject...
It would allow you to implement optimistic concurrency control right over the object storage by:
https://tontinton.com/posts/new-age-data-intensive-apps/ 6/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
If the request fails on the IfMatch, repeat from the beginning.
This will be less efficient in most cases than using Postgres, as you would need to upload a whole metadata file for each change, but it's much simpler when you don't
need speed.
Tony from the future here: AWS has just announced conditional writes, really exciting. Do you think this post had an influence? Probably not 🙃
Implementation tips
It used to be that you would need to roll your own abstraction over object storages.
Since Apache's OpenDAL was introduced, it made working with all the different object storages much simpler, by providing a single unified API.
let op = opendal::Operator::new(builder)?
.layer(opendal::layers::LoggingLayer::default())
.finish();
Ok(())
}
OpenDAL also supports the file system with opendal::services::FS, allowing you to run your object storage native app without relying on object storage. This can
be great for testing, for example. However, don't expect it to be as optimized as an app designed to run on the file system from the start.
Finally, because object storages don't allow for partial writes, you should use immutable data structures like the LSM Tree, where files are only deleted or read after
being written.
Real-world examples
Ok, we're done with the theory, let's look at some real-world data applications that have explicitly decided to use an object storage.
We'll look at what they gained, and what they lost in the process.
It's open source (AGPL license) and written in rust using tantivy (MIT license), a fast text search engine, similar to Apache's Lucene (Elasticsearch's search engine).
Tantivy and Lucene are libraries that receive text, tokenize it, and write to a data structure called an inverted index.
Let's say you provide them the following two strings: "My dog ate my food!", "My cat likes my dog", here's the resulting inverted index:
Word Documents
my 0, 1
dog 0, 1
ate 0
food 0
cat 1
likes 1
The tokenizer may also stem words and convert "changing", "changed" and "change" into "chang", so searching for "change" will find "My dog is changing". The
inverted index may also store how many times a word comes up in each document, for sorting more relevant results on a search (the algorithm used is BM25).
There's more to it, but I think you get the idea.
https://tontinton.com/posts/new-age-data-intensive-apps/ 7/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
Image inspired by Quickwit 101 - Architecture of a distributed search engine on object storage.
Quickwit is much cheaper than Elasticsearch, ~10x cheaper (depending on the workload of course), and you can control which nodes and how many nodes are in the
indexing and searching clusters, tuning it to match your read / write workload.
As we've already discussed, each round trip takes 1000x more time than a modern SSD. Quickwit has built a few measures to lower the latency, for example:
There is also another more minor issue I found: no monitoring and alerting system. Minor because it can be implemented in the future
The bottom line is: if you don't need consistent sub 200ms search times, and you don't need an alerting system, then Quickwit is probably a good fit for you.
For most use cases, the drawbacks are so minor compared to the advantages, I truly think this is the future of log search engines.
After learning about Quickwit, I got hyped and started implementing something like it myself, using tantivy and OpenDAL: toshokan 😛
WarpStream
WarpStream is a cheap distributed log and streaming platform with an API compatible with Kafka. Or in simpler words: "Kafka but on an object storage".
If you're from WarpStream (now Confluent?), please understand that I don't want support, I want to read code when stuff doesn't work.
No leader / followers.
Max latency starts at 250ms, as the WarpStream agents (the stateless service) buffer records in memory, and flush after 250ms have passed. This is only the
default and can be modified, but lowering the time to flush will mean it's less cost efficient (more PUT / GET requests to S3).
The WarpStream devs understand S3's drawbacks well, they have implemented multiple nice tricks to design against them:
Getting good throughput on S3 by distributing written records to multiple agents, and letting them write to S3 in parallel.
Data locality for reads. Each agent is elected to specific split files. When an agent receives a request to a split file not owned by it, it will redirect the request to
the owner agent, which caches these files in memory. This is especially useful as the most common pattern in a stream is to read from the end, meaning most
read requests will want to read the latest file, which is most likely to be cached in memory.
Data locality for historical reads. Split files are combined, sorted and compacted to allow for better efficiency when reading old historical records serially one
after another.
Can be configured to write new data to S3 Express, which is the most likely data to be read in a stream, and write old data (after compaction) to standard S3.
As you can probably already guess, WarpStream is ~5-10x cheaper than Kafka, and much simpler to operate as it's stateless.
Other than being new and mostly unproven yet, it has a pretty big problem. Try to guess what it is 😊
https://tontinton.com/posts/new-age-data-intensive-apps/ 8/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
Latency.
Can they improve it? Maybe. But probably not near the latency of Kafka.
Mostly in high throughput workloads, where you don't care about a second of latency, and you have enough throughput to start worrying about costs. For example,
streaming security logs (e.g. AWS CloudTrail) into Quickwit to be searched by security analysts.
Neon
Neon is an open-source (Apache license) serverless Postgres.
They took Postgres and made it work with an architecture that stores the actual data in an object storage instead of local disk.
Postgres stores transaction logs into a data structure called a WAL (Write-Ahead-Log). Neon streams log entries from this WAL to a service they called Safekeeper,
using the native Postgres replication protocol. Safekeeper nodes provide durability and fault-tolerance using a custom made Paxos, where the Postgres nodes are the
proposers and safekeepers are the acceptors (verified by this TLA+).
https://tontinton.com/posts/new-age-data-intensive-apps/ 9/10
06/01/2025, 12:53 The New Age of Data-Intensive Applications
Once logs are accepted by the safekeepers, they stream to the next service called the page server. The page server behaves like an LSM Tree, where it buffers logs
until they reach the size of 1GB, and then flushes them as a new immutable file into the object storage. Of course, just like the usual LSM Tree, you can query these
logs even while they are buffered.
All read requests go directly to the page server, with the page id and a LSN (Log Sequence Number). The LSN is a monotonically increasing number that identifies a
specific log in the WAL. So you know what that means, right?
Neon is an event source of Postgres` WAL! It has history, meaning you can have time-traveling queries and copy-on-write to your data. Or in other words: "git
branching for your data".
Plus, you have to be careful to not treat it as a general-purpose distributed database. For example, JOIN queries are not distributed, they run on one of the stateless
Postgres services. Neon is more similar to a single-writer, multiple-read-replicas kind of architecture.
I don't know whether I can recommend this one as a replacement for your usual OLTP workloads, as these must be super quick. It looks promising, but I'd have to
play around with it more.
Conclusion
Ok, hopefully you've learned of object storages, when they might be good and when they might be bad, by examining how they work on a high level, and by learning
of 3 real solutions already running in the wild.
Think a bit, which of the 3 did you like the most? Why?
Object storage solutions can definitely be market-disrupting when applied to the right solution.
Don't be a sleeper, for your next open-source database startup, think about whether using them can be a right fit!
https://tontinton.com/posts/new-age-data-intensive-apps/ 10/10