A useful software architecture concept is the distinction between “control plane” and “data plane” . If you are in an organization that treats servers like cattle or spend any time counting 9s this distinction is probably already useful to you!
Control planes are services that tend to be both complex and have lower reliability stakes. Data planes are services that are simpler to operate but also tend to be sweat-and-tear-inducing severe if they fail.
Even if you don’t use this jargon you may already have an intuition for this. Compare your reaction to the following two EC2 outages:
- Scenario 1: AWS has a capacity problem and spinning up new EC2 instances has stopped globally. This is a control plane failure.
- Scenario 2: Previously healthy running EC2 instances have stopped globally. This is a data plane failure.
You probably feel much more heartburn about Scenario 2.
Your company may not be the size of AWS, nor sell a product that is EC2-shaped, but you can still benefit from this “control plane” distinction. A useful way to apply this concept is with deploying code changes. If you work at a tech company, you might deploy code sometimes?
The process of spinning up hosts/containers/services from scratch with new1 code is your control plane. Existing services that are chugging along are your data plane.
Your code deployment process almost certainly has a step to make sure the changes you are rolling out look reasonable before considering that new deployment healthy and progressing a rollout. You can get a lot of reliability mileage by constraining certain operations to the deployment “control plane”:
- Does your service connect to a database? Have it loudly crash in
mainif it fails to get a database connection. - Does your service rely on loading API keys from a secret store? Have it loudly crash in
mainif it fails to load those secrets. - Does your service parse a configuration file? Have it loudly crash in
mainif there is a parse or validation error in that file.
With this approach entire classes of failures won’t make their way into production since big noisy crashes on service start tend to get noticed quickly.
This approach does have some limits. You cannot model all possible failures this way. Your database may degrade independent of a code change. You might feel comfortable testing a small S3 file upload on every deploy but maybe the idea of coupling your deploy’s success to a less robust external API sounds painful or expensive. It’s a knob that you can tune how you see fit.
-
Or even rolling back to an older revision if you are in a scenario where you do not have running instances of your revision as you might have with blue-green deploys . ↩︎