这是用户在 2024-3-9 19:26 为 https://renegadeotter.com/2023/09/10/death-by-a-thousand-microservices.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Death By a Thousand Microservices
The software industry is learning once again that complexity kills
By: Andrei Taranchenko (LinkedIn)
Created: 10 Sep 2023
Updated: 11 Feb 2024

The Church of Complexity

There is a pretty well-known sketch in which an engineer is explaining to the project manager how an overly complicated maze of microservices works in order to get a user’s birthday - and fails to do so anyway. The scene accurately describes the absurdity of the state of the current tech culture. We laugh, and yet bringing this up in a serious conversation is tantamount to professional heresy, rendering you borderline un-hirable.
有一个非常著名的草图,其中工程师正在向项目经理解释过于复杂的微服务迷宫是如何工作的,以便获得用户的生日 - 但无论如何都失败了。这个场景准确地描述了当前科技文化状态的荒谬。我们笑了,但在严肃的谈话中提出这个问题就等于职业异端邪说,让你处于不被录用的边缘。

How did we get here? How did our aim become not addressing the task at hand but instead setting a pile of cash on fire by solving problems we don’t have?

The perfect storm 完美风暴

There are a few events in recent history that may have contributed to the current state of things. First, a whole army of developers writing JavaScript for the browser started self-identifying as “full-stack”, diving into server development and asynchronous code. JavaScript is JavaScript, right? What difference does it make what you create using it - user interfaces, servers, games, or embedded systems. Right? Node was still kind of a learning project of one person, and the early JavaScript was a deeply problematic choice for server development. Pointing this out to still green server-side developers usually resulted in a lot of huffing and puffing. This is all they knew, after all. The world outside of Node effectively did not exist, the Node way was the only way, and so this was the genesis of the stubborn, dogmatic thinking that we are dealing with to this day.
近代史上的一些事件可能导致了目前的状况。首先,为浏览器编写 JavaScript 的一大批开发人员开始自我定位为“全栈”,深入研究服务器开发和异步代码。 JavaScript 就是 JavaScript,对吗?它对您使用它创建的内容(用户界面、服务器、游戏或嵌入式系统)有何影响。正确的? Node 仍然是一个人的学习项目,而早期的 JavaScript 对于服务器开发来说是一个存在严重问题的选择。向尚不成熟的服务器端开发人员指出这一点通常会导致很多人气喘吁吁。毕竟,这就是他们所知道的一切。 Node 之外的世界实际上并不存在,Node 方式是唯一的方式,因此这就是我们今天所面对的顽固、教条思想的起源。


And then, a steady stream of FAANG veterans started merging into the river of startups, mentoring the newly-minted and highly impressionable young JavaScript server-side engineers. The apostles of the Church of Complexity would assertively claim that “how they did things over at Google” was unquestionable and correct - even if it made no sense with the given context and size. What do you mean you don’t have a separate User Preferences Service? That just will not scale, bro!
然后,源源不断的 FAANG 资深人士开始融入初创公司的河流中,指导新晋且易受影响的年轻 JavaScript 服务器端工程师。复杂性教会的使徒们会断言“他们在谷歌的做法”是毫无疑问和正确的——即使它在给定的背景和规模下毫无意义。您没有单独的用户首选项服务是什么意思?那根本无法扩展,兄弟!

But, it’s easy to blame the veterans and the newcomers for all of this. What else was happening? Oh yeah - easy money.

What do you do when you are flush with venture capital? You don’t go for revenue, surely! On more than one occasion I received an email from management, asking everyone to be in the office, tidy up their desks and look busy, as a clouder of Patagonia vests was about to be paraded through the space. Investors needed to see explosive growth, but not in profitability, no. They just needed to see how quickly the company could hire ultra-expensive software engineers to do … something.

Try our service for streamlined code review assignments and notifications - Friendly Fire:
尝试我们的服务来简化代码审查任务和通知 - Friendship Fire:
  • Notify in Slack directly 直接在 Slack 中通知
  • Skip reviewers who are not available
  • File pattern matching 文件模式匹配
  • Individual code review reminders
  • No access to your codebase needed

And now that you have these developers, what do you do with them? Well, they could build a simpler system that is easier to grow and maintain, or they could conjure up a monstrous constellation of “microservices” that no one really understands. Microservices - the new way of writing scalable software! Are we just going to pretend that the concept of “distributed systems” never existed? (Let’s skip the whole parsing of nuances about microservices not being real distributed systems).
现在你有了这些开发人员,你用他们做什么?好吧,他们可以构建一个更简单的系统,更容易增长和维护,或者他们可以想象出一个没有人真正理解的巨大的“微服务”星座。微服务 - 编写可扩展软件的新方式!我们是否要假装“分布式系统”的概念从未存在过? (让我们跳过对微服务不是真正的分布式系统的细微差别的整个解析)。

Back in the days when the tech industry was not such a bloated farce, distributed systems were respected, feared, and generally avoided - reserved only as the weapon of last resort for particularly gnarly problems. Everything with a distributed system becomes more challenging and time-consuming - development, debugging, deployment, testing, resilience. But I don’t know - maybe it’s all super easy now because toooollling.

There is no standard tooling for microservices-based development - there is no common framework. Working on distributed systems has gotten only marginally easier in 2020s. The Dockers and the Kuberneteses of the world did not magically take away the inherent complexity of a distributed setup.
没有用于基于微服务的开发的标准工具——没有通用框架。 2020 年代,分布式系统的工作仅变得稍微容易一些。世界上的 Docker 和 Kubernetes 并没有神奇地消除分布式设置固有的复杂性。

I love referring to this summary of 5 years of startup audits, as it is packed with common-sense conclusions:
我喜欢参考这份对 5 年初创公司审计的总结,因为它充满了常识性结论:

… the startups we audited that are now doing the best usually had an almost brazenly ‘Keep It Simple’ approach to engineering. Cleverness for cleverness sake was abhorred. On the flip side, the companies where we were like ”woah, these folks are smart as hell” for the most part kind of faded.

Generally, the major foot-gun that got a lot of places in trouble was the premature move to microservices, architectures that relied on distributed computing, and messaging-heavy designs.

Literally - “complexity kills”.

The audit revealed an interesting pattern, where many startups experienced a sort of collective imposter syndrome while building straight-forward, simple, performant systems. There is a stigma attached to not starting out with microservices on day one - no matter the problem. “Everyone is doing microservices, yet we have a single Django monolith maintained by just a few engineers, and a MySQL instance - what are we doing wrong?”. The answer is almost always “nothing”.

Likewise, it’s very often that seasoned engineers experience hesitation and inadequacy in today’s tech world, and the good news is that, no - it’s probably not you. It’s common for teams to pretend like they are doing “web scale”, hiding behind libraries, ORMs, and cache - confident in their expertise (they crushed that Leetcode!), yet they may not even be aware of database indexing basics. You are operating in a sea of unjustified overconfidence, waste, and Dunning-Kruger, so who is really the imposter here?

There is nothing wrong with a monolith

The idea that you cannot grow without a system that looks like the infamous slide of Afghanistan war strategy is a myth.

Dropbox, Twitter, Netflix, Facebook, GitHub, Instagram, Shopify, StackOverflow - these companies and others started out as monolithic code bases. Many have a monolith at their core to this day. StackOverflow makes it a point of pride how little hardware they need to run the massive site. Shopify is still a Rails monolith, leveraging the tried and true Resque to process billions of tasks.

WhatsApp went supernova with their Erlang monolith and a relatively small team. How?

WhatsApp consciously keeps the engineering staff small to only about 50 engineers.

Individual engineering teams are also small, consisting of 1 - 3 engineers and teams are each given a great deal of autonomy.

In terms of servers, WhatsApp prefers to use a smaller number of servers and vertically scale each server to the highest extent possible.

Instagram was acquired for billions - with a crew of 12.

And do you imagine Threads as an effort involving a whole Meta campus? Nope. They followed the Instagram model, and this is the entire Threads team:

Credit: Substack - The Pragmatic Engineer

Perhaps claiming that your particular problem domain requires a massively complicated distributed system and an open office stuffed to the gills with turbo-geniuses is just crossing over into arrogance rather than brilliance?

Don’t solve problems you don’t have

It’s a simple question - what problem are you solving? Is it scale? How do you know how to break it all up for scale and performance? Do you have enough data to show what needs to be a separate service and why? Distributed systems are built for size and resilience. Can your system scale and be resilient at the same time? What happens if one of the services goes down or comes to a crawl? Just scale it up? What about the other services that are going to get hit with traffic? Did you war-game the endless permutations of things that can and will go wrong? Is there backpressure? Circuit breakers? Queues? Jitter? Sensible timeouts on every endpoint? Are there fool-proof guards to make sure a simple change does not bring everything down? The knobs you need to be aware of and tune are endless, and they are all specific to your system’s particular signature of usage and load.

The truth is that most companies will never reach the massive size that will actually require building a true distributed system. Cosplaying Amazon and Google - without their scale, expertise, and endless resources - is very likely just an egregious waste of money and time. Religiously following all the steps from an article called “Ten morning habits of very successful people” is not going to make you a billionaire.

The only thing harder than a distributed system is a BAD distributed system.


“But each team… but separate… but API”

Trying to shove a distributed topology into your company’s structure is a noble effort, but it almost always backfires. It’s a common approach to break up a problem into smaller pieces and then solve those one by one. So, the thinking goes, if you break up one service into multiple ones, everything becomes easier.

The theory is sweet and elegant - each microservice is being maintained rigorously by a dedicated team, walled off behind a beautiful, backward-compatible, versioned API. In fact, this is so solid that you rarely even have to communicate with that team - as if the microservice was maintained by a 3rd party vendor. It’s simple!
这个理论是甜蜜而优雅的——每个微服务都由一个专门的团队严格维护,并隔离在一个漂亮的、向后兼容的、版本化的 API 后面。事实上,它是如此可靠,以至于您几乎不需要与该团队进行沟通 - 就好像微服务是由第三方供应商维护的一样。这很简单!

If that doesn’t sound familiar, that’s because this rarely happens. In reality, our Slack channels are flooded with messages from teams communicating about releases, bugs, configuration updates, breaking changes, and PSAs. Everyone needs to be on top of everything, all the time. And if that wasn’t great, it’s normal for one already-slammed team to half-ass multiple microservices instead of doing a great job on a single one, often changing ownership as people come and go.
如果这听起来不熟悉,那是因为这种情况很少发生。事实上,我们的 Slack 频道充斥着来自团队的关于版本、错误、配置更新、重大变更和 PSA 的消息。每个人都需要始终掌控一切。如果这还不够好,那么对于一个已经遭受重创的团队来说,在多个微服务上表现不佳,而不是在单个微服务上做得很好,并且经常随着人员的来来去去而改变所有权,这是很正常的。

In order to win the race, we don’t build one good race car - we build a fleet of shitty golf carts.


What you lose 你失去了什么

There are multiple pitfalls to building with microservices, and often that minefield is either not fully appreciated or simply ignored. Teams spend months writing highly customized tooling and learning lessons not related at all to the core product. Here are just some often overlooked aspects…

Say goodbye to DRY

After decades of teaching developers to write Don’t Repeat Yourself code, it seems we just stopped talking about it altogether. Microservices by default are not DRY, with every service stuffed with redundant boilerplate. Very often the overhead of such “plumbing” is so heavy, and the size of the microservices is so small, that the average instance of a service has more “service” than “product”. So what about the common code that can be factored out?

  • Have a common library?
  • How does the common library get updated? Keep different versions everywhere?
  • Force updates regularly, creating dozens of pull requests across all repositories?
  • Keep it all in a monorepo? That comes with its own set of problems.
  • Allow for some code duplication?
  • Forget it, each team gets to reinvent the wheel every time.

Each company going this route faces these choices, and there are no good “ergonomic” options - you have to choose your version of the pain.

Developer ergonomics will crater

“Developer ergonomics” is the friction, the amount of effort a developer must go through in order to get something done, be it working on a new feature or resolving a bug.

With microservices, an engineer has to have a mental map of the entire system in order to know what services to bring up for any particular task, what teams to talk to, whom to talk to, and what about. The “you have to know everything before doing anything” principle. How do you keep on top of it? Spotify, a multi-billion dollar company, spent probably not negligible internal resources to build Backstage, software for cataloging its endless systems and services.

This should at least give you a clue that this game is not for everyone, and the price of the ride is high. So what about the tooooling? The Not Spotifies of the world are left with MacGyvering their own solutions, robustness and portability of which you can probably guess.

And how many teams actually streamline the process of starting a YASS - “yet another stupid service”? This includes:

  • Developer privileges in GitHub/GitLab
  • Default environment variables and configuration
  • CI/CD
  • Code quality checkers
  • Code review settings
  • Branch rules and protections
  • Monitoring and observability
  • Test harness
  • Infrastructure-as-code

And of course, multiply this list by the number of programming languages used throughout the company. Maybe you have a usable template or a runbook? Maybe a frictionless, one-click system to launch a new service from scratch? It takes months to iron out all the kinks with this kind of automation. So, you can either work on your product, or you can be working on toooooling.

Integration tests - LOL

As if the everyday microservices grind was not enough, you also forfeit the peace of mind offered by solid integration tests. Your single-service and unit tests are passing, but are your critical paths still intact after each commit? Who is in charge of the overall integration test suite, in Postman or wherever else? Is there one?

Service tests

Integration testing a distributed setup is a nearly-impossible problem, so we pretty much gave up on that and replaced it with another one - Observability. Just like “microservices” are the new “distributed systems”, “observability” is the new “debugging in production”. Surely, you are not writing real software if you are not doing…. observability!

Observability has become its own sector, and you will pay in both pretty penny and in developer time for it. It doesn’t come as plug-and-pay either - you need to understand and implement canary releases, feature flags, etc. Who is doing that? One already overwhelmed engineer?

As you can see, breaking up your problem does not make solving it easier - all you get is another set of even harder problems.

No, a monolith does not mean “better code”

All these arguments often get interpreted as if the suggestion here is that monoliths are “good code” and microservices are “most likey bad code”. The latter is probably true, but never have I suggested that monolithic code is good by default. The world runs on mediocre monoliths, written by rushed teams, or just mediocre ones. Something is slow? Slap more CPU juice and memory on it, and you just bought yourself another couple of years. I often wondered how my own unremarkable code was running so well in production — and then I saw the system specs.

Distributed systems, on the other hand, are unforgiving of cut corners, bad decisions, and overlooked failure modes. You need to be on top of your game all the time or you will get penalized.

What about just “services”?

Why do your services need to be “micro”? What’s wrong with just services? Some startups have gone as far as create a service for each function, and yes, “isn’t that just like Lambda” is a valid question. This gives you an idea of how far gone this unchecked cargo cult is.

So what do we do? Starting with a monolith is one obvious choice. A pattern that could also work in many instances is “trunk & branches”, where the main “meat and potatoes” monolith is helped by “branch” services. A branch service can be one that takes care of a clearly-identifiable and separately-scalable load. A CPU-hungry Image-Resizing Service makes way more sense than a User Registration Service. Or do you get so many registrations per second that it requires independent horizontal scaling?


The pendulum is swinging back

The hype, however, seems to be dying down. The VC cash faucet is tightening, and so the businesses have been market-corrected into exercising common-sense decisions, recognizing that perhaps splurging on web-scale architectures when they don’t have web-scale problems is not sustainable.


Ultimately, when faced with the need to travel from New York to Philadelphia, you have two options. You can either attempt to construct a highly intricate spaceship for an orbital descent to your destination, or you can simply purchase an Amtrak train ticket for a 90-minute ride. That is the problem at hand.

Your database skills are not ‘good to have’

Additional reading & listening

How to recover from microservices

You want modules, not microservices

XML is the future

Gasp! You might not need microservices

Podcast: How we keep Stack Overflow’s codebase clean and modern

Goodbye Microservices: From 100s of problem children to 1 superstar

It’s the future

对比 (0/20)