Don't Choose Your Global Database From a Diagram

Recently, the team at Kode invited me to sit down with them for an afternoon to think through their data architecture. If you don't know them: the hardware startup behind Kode Dot, an all-in-one pocket device for makers with its own operating system, kodeOS, and its app catalog, all open source. And already operating in the United States and Europe: few hardware startups get to run a fleet across two continents this early.

The Kode device on a workbench, surrounded by sensors, cables and electronic components

On the table was a question that sounds like the ones that matter: Spanner? Multi-region Aurora? DynamoDB Global Tables? Bigtable? Four powerful names, a pile of hours reading vendor documentation, and a fleet of devices spread across two continents that has to be served well.

We spent a good while turning it over. And to be clear, it wasn't a question thrown out lightly: we sat down to go through the options together, to narrow them down and to read the trade-offs of each one carefully. But the most useful conversation we had wasn't choosing one of the four. It was realizing that we didn't yet have the data to choose any of them.

That's what I want to tell you about today.

The urge to "we're global, let's build the global thing"

There's a trap that's very easy to fall into when your product crosses borders: confusing "I have users on two continents" with "I need a globally consistent database."

They're different things. The first is a fact about your business. The second is a wildly expensive engineering decision, and not just on the invoice. A piece like Spanner costs you money, yes, but above all it costs you in operational and cognitive overhead: a consistency model your team has to understand, monitor and debug when something goes wrong at three in the morning.

And that urge has fine print: it leads you to pick the most expensive, heaviest model without a single data point about what your product actually needs. You decide from the gut feeling of "let's go all in" instead of from what measuring would tell you.

I'm telling you this as someone it's happened to. I've fallen in love more than once with a piece of infrastructure for the spectacular problem it promised to solve, not for the problem in front of me. It's not pleasant to admit, but there it is.

Every vendor sells you their worst case, not yours

Spanner exists because Google had a strong-consistency problem at planetary scale. DynamoDB Global Tables exists for multi-master with eventual replication. Bigtable for brutal time-series throughput. Aurora Global Database for local reads and disaster recovery across regions, keeping a single write primary. They're all excellent pieces solving real problems.

But those problems are theirs, or their biggest customers'. The marketing page describes the edge case that justifies the tool, not your case. And if you choose by reading the marketing page, you end up sizing your architecture for a problem you probably don't have.

The question isn't "which of these is best?" It's "which of these problems is mine?" And to answer that you need to look at your own data.

The question almost nobody asks: does your data partition on its own?

In a fleet of IoT devices, data almost always partitions naturally by device and by region. A device in Texas and one in Valencia rarely need a consistent transaction between them. Each one reports its telemetry, receives its commands, stores its state.

If your data is already naturally partitioned, a huge part of the value of a global database disappears. Strong global consistency solves the problem of coordinating writes that compete for the same piece of data from two places at once. If that almost never happens in your domain, you're paying for a guarantee you don't use.

I don't want to sell you that this is always the case. There are products where the data really is global and shared (a balance, a single inventory, a collaborative document). But it's worth looking closely before taking it for granted.

There's no single answer, there's a profile per use case

And here, for me, is the crux: you don't have one database problem, you have several, and each one tolerates different things.

Think about it by use case. Registering a device, the app catalog of its operating system, the telemetry going up, the commands or OTA updates coming down. Each with its own tolerance for latency and consistency.

Telemetry tolerates eventual replication and a couple of hundred milliseconds without anyone's pulse skipping. An interactive command, where the user taps something and waits for a response, doesn't tolerate the same. Cramming all of that into a single global decision ("which database do we use?") is exactly the mistake: you're averaging requirements that have nothing in common.

Each flow tolerates different things

A single global decision (“which database do we use?”) averages requirements that have nothing in common. The better question: what does each flow need.

When you separate by use case, the question stops being "Spanner or DynamoDB" and becomes "what does each flow need." Which is a much better question.

One region, but properly instrumented

The proposal that came out of the conversation with Kode was to start with a single region. It takes more judgment to defend that plan when you have global ambition than to sign off on Spanner's budget, and they got it right away. But I want to be precise about what that means, because "start with one region" gets confused with "build a toy MVP and we'll see."

It's not that. It's building one region well, and instrumenting it to measure what matters: p50, p95 and p99 latencies for each use case, seen from real clients on both continents. You want to know what happens to a device in Europe talking to a backend in the United States, and to know it from the numbers you see, not from what you think is going to happen.

What you discover that way you didn't know from drawing. Maybe it turns out telemetry handles the transatlantic hop just fine and the only flow that hurts is the interactive command. Or the other way around. The diagram won't tell you.

And this isn't theory I made up for the post. At Mercadona Tech, where I work as a Staff Engineer, we process on the order of 25,000 orders a day, and the architecture decisions that actually changed something didn't come from a pretty whiteboard. They came from looking at numbers in production and discovering that the bottleneck was where nobody would have put it on the initial diagram.

Why the answer is usually a hybrid

With that data, the reasonable architecture is almost never "one magic piece for everything." It's a hybrid: a regional MQTT broker for the local and hot, a partitioned store where data lives by region, eventual replication for the global and cold that you only need for aggregates and analytics.

And the app catalog is the opposite case to the partitioned store: it's written in one place and read everywhere. A read replica per region gives you local latency without fighting consistency, because a catalog that updates once a week handles its replicas running a few seconds behind without breaking a sweat.

The hybrid: regional for hot, eventual for cold

Hybrid isn't refusing to choose: it's removing complexity where it adds nothing and putting it where the user feels the difference.

Hybrid sounds like "they don't dare to choose." It's the opposite: it's removing complexity where it adds nothing and putting it where the user feels the difference.

Regional isn't free either

And careful, the regional approach I just defended isn't free either. The moment you start drawing a partitioned store by region, with one instance per region instead of a global database, its own problems show up.

The first is routing: every time a request comes in, someone has to know which region that user belongs to in order to send it to the right instance. That's one more piece to maintain, and one more that can fail.

The second is uglier: what happens when a user changes country? Their data shouldn't live where it lives anymore. You have to migrate it from one region to another without losing anything or leaving it half done, and resolve what happens during the window while it's moving. It's not impossible, but it's real work, and it's exactly the kind of complexity you don't see on the diagram and do suffer in production.

I bring it up because it almost never comes up in these conversations. There's no free option. There are costs you see coming and decide to pay, and costs you run into late, when it's no longer the time to fix them. And again, what makes the difference is how much you'd measured before choosing.

The migration path is what separates done-right-simple from naive-simple

And someone will say: "sure, you start simple and then you get stuck with an architecture that doesn't scale." It's the right objection, and you have to take it seriously.

The trap of starting simple is real. The difference between doing it well and doing it naively is one single thing: having the migration trigger written down in advance. Which concrete metric, which threshold, which move sets off the jump to multi-region. For example: "if the p95 of the interactive command in Europe goes above X for a sustained Y, we move that specific flow to a second region." And that agreement can't live in anyone's memory: write it down in an ADR (here's a generator if you don't have a template handy).

When you have that written down, starting with one region isn't technical debt. It's an option you exercise the day the data calls for it, on the specific subsystem that calls for it, and not on everything at once.

And there's another side that also has to be put on the table: doing everything right from the start assumes you already know what "right" is. Migrating a specific subsystem when you have data usually comes out cheaper than operating, paying for and maintaining a global architecture for two years that you might not have needed. But not always. If you know from minute zero that you're going to need the heavy stuff, forcing "start simple" would be just as dogmatic as the opposite.

What I take away

I don't have a universal rule to give you. Be wary of anyone who does.

What I do have is a question I always ask, at Mercadona Tech and in conversations like the one with Kode, before making a big architecture decision: what data am I missing to decide this well, and what does it cost to get it? Almost always the data is far cheaper than the decision. A region instrumented for a few weeks costs a fraction of what it costs to pick the wrong global database and live with it.

Today any LLM rattles off the differences between Spanner and DynamoDB in ten seconds. That's stopped being the scarce thing. What's scarce is knowing which question to ask before choosing, and having the discipline to measure instead of assume. Code, and infrastructure, are a means. What you're designing is the product.

The diagram doesn't know your latency. Your users do.

A question for you: how many of your team's big architecture decisions were made looking at real numbers, and how many were made looking at a diagram and a vendor's page? I don't ask it with judgment. I ask because I think the answer says a lot about where the real cost is.

Weekly Newsletter

Enjoying what you read?

Join other engineers who receive reflections on career, leadership, and technology every week.

This newsletter is written in Spanish.