一千个微服务的死亡

Software

Death By a Thousand Microservices

The software industry is learning once again that complexity kills

By: Andrei Taranchenko (LinkedIn)

Created: 10 Sep 2023

Updated: 11 Feb 2024

The Church of Complexity
复杂性的教会

There is a pretty well-known sketch in which an engineer is explaining to the project manager how an overly complicated maze of microservices works in order to get a user’s birthday - and fails to do so anyway. The scene accurately describes the absurdity of the state of the current tech culture. We laugh, and yet bringing this up in a serious conversation is tantamount to professional heresy, rendering you borderline un-hirable.
有一个非常著名的草图，其中工程师正在向项目经理解释过于复杂的微服务迷宫是如何工作的，以便获得用户的生日 - 但无论如何都失败了。这个场景准确地描述了当前科技文化状态的荒谬。我们笑了，但在严肃的谈话中提出这个问题就等于职业异端邪说，让你处于不被录用的边缘。

How did we get here? How did our aim become not addressing the task at hand but instead setting a pile of cash on fire by solving problems we don’t have?
我们是怎么来到这里的？我们的目标是如何变得不再是解决手头的任务，而是通过解决我们没有的问题来烧毁一堆现金？

Trigger warning: Some people understandably got salty when I name-checked JavaScript and NodeJS as a source of the problem, but my point really was more about the dangers of hermetically sealed software ecosystems that seem hell-bent on re-learning the lessons that we just had finished learning. We ran into the complexity wall before and reset - otherwise we'd still be using CORBA and SOAP. These air-tight developer bubbles are a wrecking ball on the entire industry, and it takes about a full decade to swing.
触发警告：当我将 JavaScript 和 NodeJS 视为问题根源时，有些人会感到不高兴，这是可以理解的，但我的观点实际上更多的是关于密封软件生态系统的危险，这些生态系统似乎一心要重新学习我们所学到的教训。刚刚学完。我们之前遇到了复杂性墙并重新设置 - 否则我们仍然会使用 CORBA 和 SOAP。这些密不透风的开发商泡沫对整个行业来说是一个破坏球，大约需要整整十年的时间才能扭转。

The perfect storm 完美风暴

There are a few events in recent history that may have contributed to the current state of things. First, a whole army of developers writing JavaScript for the browser started self-identifying as “full-stack”, diving into server development and asynchronous code. JavaScript is JavaScript, right? What difference does it make what you create using it - user interfaces, servers, games, or embedded systems. Right? Node was still kind of a learning project of one person, and the early JavaScript was a deeply problematic choice for server development. Pointing this out to still green server-side developers usually resulted in a lot of huffing and puffing. This is all they knew, after all. The world outside of Node effectively did not exist, the Node way was the only way, and so this was the genesis of the stubborn, dogmatic thinking that we are dealing with to this day.
近代史上的一些事件可能导致了目前的状况。首先，为浏览器编写 JavaScript 的一大批开发人员开始自我定位为“全栈”，深入研究服务器开发和异步代码。 JavaScript 就是 JavaScript，对吗？它对您使用它创建的内容（用户界面、服务器、游戏或嵌入式系统）有何影响。正确的？ Node 仍然是一个人的学习项目，而早期的 JavaScript 对于服务器开发来说是一个存在严重问题的选择。向尚不成熟的服务器端开发人员指出这一点通常会导致很多人气喘吁吁。毕竟，这就是他们所知道的一切。 Node 之外的世界实际上并不存在，Node 方式是唯一的方式，因此这就是我们今天所面对的顽固、教条思想的起源。

And then, a steady stream of FAANG veterans started merging into the river of startups, mentoring the newly-minted and highly impressionable young JavaScript server-side engineers. The apostles of the Church of Complexity would assertively claim that “how they did things over at Google” was unquestionable and correct - even if it made no sense with the given context and size. What do you mean you don’t have a separate User Preferences Service? That just will not scale, bro!
然后，源源不断的 FAANG 资深人士开始融入初创公司的河流中，指导新晋且易受影响的年轻 JavaScript 服务器端工程师。复杂性教会的使徒们会断言“他们在谷歌的做法”是毫无疑问和正确的——即使它在给定的背景和规模下毫无意义。您没有单独的用户首选项服务是什么意思？那根本无法扩展，兄弟！

But, it’s easy to blame the veterans and the newcomers for all of this. What else was happening? Oh yeah - easy money.
但是，我们很容易将这一切归咎于老手和新人。还发生了什么？哦，是的——轻松赚钱。

What do you do when you are flush with venture capital? You don’t go for revenue, surely! On more than one occasion I received an email from management, asking everyone to be in the office, tidy up their desks and look busy, as a clouder of Patagonia vests was about to be paraded through the space. Investors needed to see explosive growth, but not in profitability, no. They just needed to see how quickly the company could hire ultra-expensive software engineers to do … something.
当你手头有大量风险投资时你会做什么？当然，你不是为了收入！我不止一次收到管理层发来的电子邮件，要求每个人都到办公室，收拾好办公桌，表现得很忙碌，因为一件巴塔哥尼亚背心的云彩即将在整个空间里展示。投资者需要看到爆炸性增长，但不需要看到盈利能力，不是。他们只是想看看公司能多快雇用超昂贵的软件工程师来做……某件事。

And now that you have these developers, what do you do with them? Well, they could build a simpler system that is easier to grow and maintain, or they could conjure up a monstrous constellation of “microservices” that no one really understands. Microservices - the new way of writing scalable software! Are we just going to pretend that the concept of “distributed systems” never existed? (Let’s skip the whole parsing of nuances about microservices not being real distributed systems).
现在你有了这些开发人员，你用他们做什么？好吧，他们可以构建一个更简单的系统，更容易增长和维护，或者他们可以想象出一个没有人真正理解的巨大的“微服务”星座。微服务 - 编写可扩展软件的新方式！我们是否要假装“分布式系统”的概念从未存在过？（让我们跳过对微服务不是真正的分布式系统的细微差别的整个解析）。

Back in the days when the tech industry was not such a bloated farce, distributed systems were respected, feared, and generally avoided - reserved only as the weapon of last resort for particularly gnarly problems. Everything with a distributed system becomes more challenging and time-consuming - development, debugging, deployment, testing, resilience. But I don’t know - maybe it’s all super easy now because toooollling.
在科技行业还没有如此臃肿的闹剧的时代，分布式系统受到尊重、恐惧，并且普遍被回避——仅作为解决特别棘手问题的最后手段。分布式系统的一切都变得更具挑战性和耗时——开发、调试、部署、测试、弹性。但我不知道——也许现在一切都超级简单，因为太酷了。

There is no standard tooling for microservices-based development - there is no common framework. Working on distributed systems has gotten only marginally easier in 2020s. The Dockers and the Kuberneteses of the world did not magically take away the inherent complexity of a distributed setup.
没有用于基于微服务的开发的标准工具——没有通用框架。 2020 年代，分布式系统的工作仅变得稍微容易一些。世界上的 Docker 和 Kubernetes 并没有神奇地消除分布式设置固有的复杂性。

I love referring to this summary of 5 years of startup audits, as it is packed with common-sense conclusions:
我喜欢参考这份对 5 年初创公司审计的总结，因为它充满了常识性结论：

… the startups we audited that are now doing the best usually had an almost brazenly ‘Keep It Simple’ approach to engineering. Cleverness for cleverness sake was abhorred. On the flip side, the companies where we were like ”woah, these folks are smart as hell” for the most part kind of faded.
......我们审计过的目前表现最好的初创公司通常都采用近乎厚颜无耻的“保持简单”的工程方法。为了聪明而聪明是令人憎恶的。另一方面，那些让我们觉得“哇哦，这些人太聪明了”的公司大部分都已经消失了。

Generally, the major foot-gun that got a lot of places in trouble was the premature move to microservices, architectures that relied on distributed computing, and messaging-heavy designs.
一般来说，让很多地方陷入困境的主要因素是过早转向微服务、依赖分布式计算的架构和消息传递密集型设计。

Literally - “complexity kills”.

The audit revealed an interesting pattern, where many startups experienced a sort of collective imposter syndrome while building straight-forward, simple, performant systems. There is a stigma attached to not starting out with microservices on day one - no matter the problem. “Everyone is doing microservices, yet we have a single Django monolith maintained by just a few engineers, and a MySQL instance - what are we doing wrong?”. The answer is almost always “nothing”.

Likewise, it’s very often that seasoned engineers experience hesitation and inadequacy in today’s tech world, and the good news is that, no - it’s probably not you. It’s common for teams to pretend like they are doing “web scale”, hiding behind libraries, ORMs, and cache - confident in their expertise (they crushed that Leetcode!), yet they may not even be aware of database indexing basics. You are operating in a sea of unjustified overconfidence, waste, and Dunning-Kruger, so who is really the imposter here?

There is nothing wrong with a monolith

The idea that you cannot grow without a system that looks like the infamous slide of Afghanistan war strategy is a myth.

Dropbox, Twitter, Netflix, Facebook, GitHub, Instagram, Shopify, StackOverflow - these companies and others started out as monolithic code bases. Many have a monolith at their core to this day. StackOverflow makes it a point of pride how little hardware they need to run the massive site. Shopify is still a Rails monolith, leveraging the tried and true Resque to process billions of tasks.

WhatsApp went supernova with their Erlang monolith and a relatively small team. How?

WhatsApp consciously keeps the engineering staff small to only about 50 engineers.

Individual engineering teams are also small, consisting of 1 - 3 engineers and teams are each given a great deal of autonomy.

In terms of servers, WhatsApp prefers to use a smaller number of servers and vertically scale each server to the highest extent possible.

Instagram was acquired for billions - with a crew of 12.

And do you imagine Threads as an effort involving a whole Meta campus? Nope. They followed the Instagram model, and this is the entire Threads team:

Credit: Substack - The Pragmatic Engineer

Perhaps claiming that your particular problem domain requires a massively complicated distributed system and an open office stuffed to the gills with turbo-geniuses is just crossing over into arrogance rather than brilliance?

Don’t solve problems you don’t have

It’s a simple question - what problem are you solving? Is it scale? How do you know how to break it all up for scale and performance? Do you have enough data to show what needs to be a separate service and why? Distributed systems are built for size and resilience. Can your system scale and be resilient at the same time? What happens if one of the services goes down or comes to a crawl? Just scale it up? What about the other services that are going to get hit with traffic? Did you war-game the endless permutations of things that can and will go wrong? Is there backpressure? Circuit breakers? Queues? Jitter? Sensible timeouts on every endpoint? Are there fool-proof guards to make sure a simple change does not bring everything down? The knobs you need to be aware of and tune are endless, and they are all specific to your system’s particular signature of usage and load.

The truth is that most companies will never reach the massive size that will actually require building a true distributed system. Cosplaying Amazon and Google - without their scale, expertise, and endless resources - is very likely just an egregious waste of money and time. Religiously following all the steps from an article called “Ten morning habits of very successful people” is not going to make you a billionaire.

The only thing harder than a distributed system is a BAD distributed system.

“But each team… but separate… but API”

Trying to shove a distributed topology into your company’s structure is a noble effort, but it almost always backfires. It’s a common approach to break up a problem into smaller pieces and then solve those one by one. So, the thinking goes, if you break up one service into multiple ones, everything becomes easier.

The theory is sweet and elegant - each microservice is being maintained rigorously by a dedicated team, walled off behind a beautiful, backward-compatible, versioned API. In fact, this is so solid that you rarely even have to communicate with that team - as if the microservice was maintained by a 3rd party vendor. It’s simple!
这个理论是甜蜜而优雅的——每个微服务都由一个专门的团队严格维护，并隔离在一个漂亮的、向后兼容的、版本化的 API 后面。事实上，它是如此可靠，以至于您几乎不需要与该团队进行沟通 - 就好像微服务是由第三方供应商维护的一样。这很简单！

If that doesn’t sound familiar, that’s because this rarely happens. In reality, our Slack channels are flooded with messages from teams communicating about releases, bugs, configuration updates, breaking changes, and PSAs. Everyone needs to be on top of everything, all the time. And if that wasn’t great, it’s normal for one already-slammed team to half-ass multiple microservices instead of doing a great job on a single one, often changing ownership as people come and go.
如果这听起来不熟悉，那是因为这种情况很少发生。事实上，我们的 Slack 频道充斥着来自团队的关于版本、错误、配置更新、重大变更和 PSA 的消息。每个人都需要始终掌控一切。如果这还不够好，那么对于一个已经遭受重创的团队来说，在多个微服务上表现不佳，而不是在单个微服务上做得很好，并且经常随着人员的来来去去而改变所有权，这是很正常的。

In order to win the race, we don’t build one good race car - we build a fleet of shitty golf carts.
为了赢得比赛，我们不会建造一辆好的赛车，而是建造了一支由劣质高尔夫球车组成的车队。

What you lose 你失去了什么

There are multiple pitfalls to building with microservices, and often that minefield is either not fully appreciated or simply ignored. Teams spend months writing highly customized tooling and learning lessons not related at all to the core product. Here are just some often overlooked aspects…
使用微服务进行构建存在多个陷阱，而且通常这些雷区要么没有被充分认识，要么被简单地忽视。团队花费数月时间编写高度定制的工具并学习与核心产品完全无关的课程。以下是一些经常被忽视的方面……

Say goodbye to DRY

After decades of teaching developers to write Don’t Repeat Yourself code, it seems we just stopped talking about it altogether. Microservices by default are not DRY, with every service stuffed with redundant boilerplate. Very often the overhead of such “plumbing” is so heavy, and the size of the microservices is so small, that the average instance of a service has more “service” than “product”. So what about the common code that can be factored out?

Have a common library?
How does the common library get updated? Keep different versions everywhere?
Force updates regularly, creating dozens of pull requests across all repositories?
Keep it all in a monorepo? That comes with its own set of problems.
Allow for some code duplication?
Forget it, each team gets to reinvent the wheel every time.

Each company going this route faces these choices, and there are no good “ergonomic” options - you have to choose your version of the pain.

Developer ergonomics will crater

“Developer ergonomics” is the friction, the amount of effort a developer must go through in order to get something done, be it working on a new feature or resolving a bug.

With microservices, an engineer has to have a mental map of the entire system in order to know what services to bring up for any particular task, what teams to talk to, whom to talk to, and what about. The “you have to know everything before doing anything” principle. How do you keep on top of it? Spotify, a multi-billion dollar company, spent probably not negligible internal resources to build Backstage, software for cataloging its endless systems and services.

This should at least give you a clue that this game is not for everyone, and the price of the ride is high. So what about the tooooling? The Not Spotifies of the world are left with MacGyvering their own solutions, robustness and portability of which you can probably guess.

And how many teams actually streamline the process of starting a YASS - “yet another stupid service”? This includes:

Developer privileges in GitHub/GitLab
Default environment variables and configuration
CI/CD
Code quality checkers
Code review settings
Branch rules and protections
Monitoring and observability
Test harness
Infrastructure-as-code

And of course, multiply this list by the number of programming languages used throughout the company. Maybe you have a usable template or a runbook? Maybe a frictionless, one-click system to launch a new service from scratch? It takes months to iron out all the kinks with this kind of automation. So, you can either work on your product, or you can be working on toooooling.

Integration tests - LOL

As if the everyday microservices grind was not enough, you also forfeit the peace of mind offered by solid integration tests. Your single-service and unit tests are passing, but are your critical paths still intact after each commit? Who is in charge of the overall integration test suite, in Postman or wherever else? Is there one?

Integration testing a distributed setup is a nearly-impossible problem, so we pretty much gave up on that and replaced it with another one - Observability. Just like “microservices” are the new “distributed systems”, “observability” is the new “debugging in production”. Surely, you are not writing real software if you are not doing…. observability!

Observability has become its own sector, and you will pay in both pretty penny and in developer time for it. It doesn’t come as plug-and-pay either - you need to understand and implement canary releases, feature flags, etc. Who is doing that? One already overwhelmed engineer?

As you can see, breaking up your problem does not make solving it easier - all you get is another set of even harder problems.

No, a monolith does not mean “better code”

All these arguments often get interpreted as if the suggestion here is that monoliths are “good code” and microservices are “most likey bad code”. The latter is probably true, but never have I suggested that monolithic code is good by default. The world runs on mediocre monoliths, written by rushed teams, or just mediocre ones. Something is slow? Slap more CPU juice and memory on it, and you just bought yourself another couple of years. I often wondered how my own unremarkable code was running so well in production — and then I saw the system specs.

Distributed systems, on the other hand, are unforgiving of cut corners, bad decisions, and overlooked failure modes. You need to be on top of your game all the time or you will get penalized.

What about just “services”?

Why do your services need to be “micro”? What’s wrong with just services? Some startups have gone as far as create a service for each function, and yes, “isn’t that just like Lambda” is a valid question. This gives you an idea of how far gone this unchecked cargo cult is.

So what do we do? Starting with a monolith is one obvious choice. A pattern that could also work in many instances is “trunk & branches”, where the main “meat and potatoes” monolith is helped by “branch” services. A branch service can be one that takes care of a clearly-identifiable and separately-scalable load. A CPU-hungry Image-Resizing Service makes way more sense than a User Registration Service. Or do you get so many registrations per second that it requires independent horizontal scaling?

The pendulum is swinging back

The hype, however, seems to be dying down. The VC cash faucet is tightening, and so the businesses have been market-corrected into exercising common-sense decisions, recognizing that perhaps splurging on web-scale architectures when they don’t have web-scale problems is not sustainable.

Ultimately, when faced with the need to travel from New York to Philadelphia, you have two options. You can either attempt to construct a highly intricate spaceship for an orbital descent to your destination, or you can simply purchase an Amtrak train ticket for a 90-minute ride. That is the problem at hand.

Your database skills are not ‘good to have’