On Platform Engineering

The history of software is a story of increasing complexity. As more software got written, and more infrastructure was built to run that software, more people with distinct roles were needed to write, build, deploy, configure and monitor the systems. For decades, the IT industry has managed this complexity by separating concerns and forming abstractions:

  • Form specialist teams to focus on specific aspects of the software development lifecycle
  • Buy, instead of make, products and services to replace what those specialist teams were doing.

We went from employing developers and sysadmins to front-end and back-end software engineers, build, DevOps and infrastructure engineers. We went from building all code ourselves on top of the operating system, to using third-party libraries and SaaS services, IaaS and PaaS infrastructure, and a myriad of home-grown or purchased tools to assist across all phases of the software development process.

In the last few years, complexity especially in the DevOps space has exploded and the question increasingly is "who can we expect to put all of these parts together"? Of course, various specialty areas could be developed within the scope of what a "DevOps Engineer" does (not that DevOps should be a role, see anti-type B in DevOps topologies). The other option is to create a self-service platform, reduce tool choices and try to "shift left" more responsibilities out to development teams - that's one definition of Platform Engineering. Contrast that with what ChatGPT comes up with in response to the prompt "Write a blog post on Platform Engineering", which sounds rather like traditional Ops and DevOps disciplines:

Platform engineering is the practice of building and maintaining the infrastructure and tools that support the development and deployment of software applications. This includes not only the hardware and software components of the platform, but also the processes and practices that enable teams to build, test, and deploy their applications efficiently and reliably.

Platform engineering is important because it provides the foundation for a successful software development effort. Without a well-designed and well-maintained platform, teams may struggle to deliver their applications on time and with high quality. A platform that is difficult to use or unreliable can lead to frustration and delays, which can ultimately impact the business.

Platform engineering involves a range of activities, including designing and implementing the platform architecture, configuring and maintaining the underlying infrastructure, and developing and deploying the tools and processes that support the software development life cycle. This can include everything from version control systems and continuous integration tools, to deployment pipelines and monitoring systems.

Platform engineers are responsible for ensuring that the platform is scalable, reliable, and secure, and that it supports the needs of the development teams. This can involve working closely with development teams to understand their requirements and to identify and implement the best solutions.

In addition to technical skills, platform engineers also need strong communication and collaboration skills. They must be able to work effectively with a range of stakeholders, including development teams, operations teams, and business leaders. They must also be able to effectively manage and prioritize their work, and to adapt to changing business needs and technology trends.

In conclusion, platform engineering is a critical discipline that plays a key role in the success of any software development effort. By designing and maintaining a robust and scalable platform, platform engineers enable teams to deliver high-quality applications efficiently and reliably, which can ultimately drive business value.

Hm, isn't this exactly what an SRE team or the System Team in SAFe supposedly should be doing? The most important point about Platform Engineering is missing, in my view:

Designing and running the platform as a product, with development teams as the internal customer, providing self-service capabilities and infrastructure automation instead of bespoke services.

Based on the State of DevOps report, using self-service platforms is statistically likely to make teams more successful But what are the limitations of Platform Engineering, and what makes a good platform?

A personal history of DevOps platforms

Working for a software company a decade ago, I was responsible for an early form of DevOps or Platform team (called "Engineering Solutions" at the time). In the span of 4 years, we went from hiring "Build Engineers" to cope with a huge monolithic codebase in 2012 to pushing a various initiatives for developer self-service and dev/ops collaboration, both based on AWS. My first goal was to help the Germany-based development organization work more efficiently. My second goals was to reduce friction between development and the US-based operations teams.

Starting with an overhaul of the build system, my team went on replace the CI/CD infrastructure, automated the setup of development environments, started enforcing all-green tests before integrating new code, automated "baking" of deployment images, developed tools deploying and managing test environments on AWS and running tests on those environments. We joined then forces with the operations team to close the loop all the way to automated deployment and updates of production environments.

What we learned is that automating DevOps work (in the sense of everything that is needed to get software into production) needs to be approached with a Product mindset - to keep from drowning in the sea of small tasks that need to be done, we needed to understand in which direction the whole organization needed us to go, and we needed to make the investment of developing internal software applications - with sometimes high initial up-front effort, but great incremental returns on that investment.

Fast-forward just a few months later, I had switched sides and was responsible for IT Operations teams at a different company. Now it was time to focus on what happens after software is running in production! Keeping large-scale distributed applications running is repetitive, time-consuming and often mentally draining for the Sysops, who are usually vastly outnumbered compared to software developers. To make "Day 2 operations" more effective, I was looking for a platform to provide a "Sliceable Stack", meaning a set of tools for the whole lifecycle of software components, from code repository through CI/CD and deployment into test and production environments all the way to monitoring and incident resolution, with access to operational data (logs, metrics, traces) and infrastructure cost data. The "sliceable stack" would allow cross-functional teams to take more control of their components in production while focusing only on their "slice" of the stack (and, especially for external teams, not be able to see other teams' slices).

Only the largest companies would be able to afford building such a platform from scratch, so such a platform should be bought instead of made. Unfortunately, suitable platforms weren't available to buy yet: The individual tools available at the time didn't really allow "slicing", and integrated platforms like Gitlab and Azure DevOps were not as comprehensive as they are today. Without the resources to build our own platform, we had to accept the fact that much of the interaction with software development continued to be ticket-based on both directions.

Recently, I've worked with an organization that employs SAFe and uses Gitlab extensively. They were quite successful in standardizing software development processes, from coding standards to container builds and documentation. They also standardized on Kubernetes as the common deployment platform on the software development side, but hit resistance when trying to carry this approach into production environments (many separate ones, operated on-premises for customers, with limited infrastructure resources): It turns out that their platform approach for deployment was not a good match for the real infrastructure underneath.

Creating a platform involves creating abstractions, and the goal is to reduce complexity. However, there is a risk in making things seem simpler than they actually are.

Choosing the right abstractions

People have limited short-term memory: at any given point in time, there is only a certain number of moving parts and their interactions that we can keep in memory. We use abstractions to work around this problem. But abstractions create their own set of problems, even in the original separation of software development and operations:

  • Developers would think of infrastructure as one or more computers running their software, with hazy concepts of networks, storage and databases underneath, but they don't need to worry about the specifics.
  • Operations people would think of software as deployment packages, basically a set of binary files that needs to be copied to computers and configured.

This model led to the challenges of scalability, stability and release cadence that then led to the rise of DevOps:

  • If developers don't understand the operating environment and its limitations, it will be very challenging to scale the application and resolve incidents (and keep demand for infrastructure resources at a reasonable level). At the very least, developers need feedback from the production environment in form of logs and operational metrics.
  • If operations engineers think of an application release as something that can't be trusted and requires extensive periods of testing and validation before deploying to production, it will be very challenging to release frequently.

The quote "Everything should be made as simple as possible, but not simpler" (incorrectly attributed to Alfred Einstein) is true also for infrastructure abstractions - using an unsuitable mental model will create follow-up problems that are hard to solve. For instance, the traditional database-driven development model implicitly assumes that databases can be infinitely scaled - that mismatch usually becomes apparent very soon after going live on production when it becomes clear that database infrastructure is, in fact, resource-constrained and costly.

What about IaaS and PaaS?

In order to use a good abstraction of infrastructure, why not just use IaaS and PaaS cloud APIs directly? After all, IaaS and PaaS had a large role in even making DevOps possible, by providing suitable abstractions of infrastructure that allowed software developers to understand enough (but not too much) about the behavior of the infrastructure and automate it without running into too many issues.

But cloud APIs are designed to accommodate a large range of use cases, and give a lot of choice in selecting and configuring services, both of which leads to complexity. In addition, many organizations want to keep their applications cloud-agnostic. Hence the need to constrain choices, and hide large parts of the actual operating environment behind yet another layer of abstraction.

Kubernetes was supposed to be the platform to offer a simple deployment path and make this "multi-cloud" vision attainable. But Kubernetes environments are far from simple and standardized. The number of components and deployment options in the Kubernetes ecosystem has skyrocketed in the last years, so organizations need to standardize and limit choices here as well. And for most non-trivial applications, Kubernetes is just part of the stack - load balancers and WAFs, managed databases, event buses, will typically

The burden of choice

Choosing between alternatives creates cognitive load, and there is statistical proof that people are generally happier with their choices when they have less alternatives to choose from. It sounds counterintuitive to restrict teams' freedom in choosing the best tools for the job ("best of breed" ), but more efficiency for the whole organization necessarily means less choice for individual people and teams. Platform Engineering (along with good architecture planning!) can help in preselecting sensible tools, processes and infrastructure services, and make heavy-handed governance all but unnecessary.

When the same solutions are (mostly) used for the same problems across the organization, it's easier for people to move between teams, domains and roles - shared tools and abstractions become a shared body of knowledge. In addition, a limited canon of platform choices can make it easier for developers and operations people to find common ground.

I have found that incident management (especially outside of business hours, through on-call duty), is the main reason why the "you build it, you run it" approach doesn't work for most software development teams. That means that a dedicated Ops or SRE team needs to handle incidents, and their readiness and ability to take over that responsibility depends on how well they understand the application and component it uses.

More platform, less process

A good platform choice can reduce process, since less effort needs to be spent on chasing quality and compliance problems - "it should be easy to do the right things, and hard to do the wrong things". If your platform enforces the use of containers for all application components, that will go a long way towards enforcing a policy for being "cloud-agnostic". If your platform includes a tool for automated vulnerability checks, it's likely that you won't need a heavy and ineffective manual process for doing those checks.

Lastly, a lot of "DevOps" work today is repetitive - setting up CI/CD pipelines, provisioning clusters and environments. It's not unreasonable to assume that much of this work is already being done via copy-and-paste (from public sources or pre-existing projects in the organization). If the same code is being copied over and over again, why not make it a reusable component inside a platform?

Make or buy platforms?

For most organizations, it doesn't make sense to invest into building their own platforms:

  • Development process and operating environment are not competitive differentiators for most companies - they will find themselves unable to significantly outperform their competition in this area; the goal is being "good enough"
  • Integrated platform solutions are increasingly available for purchase.

While it makes economic sense to buy a platform solution, the problem is the existing variety and complexity of tools, infrastructures and applications - in all but a greenfield situation, organizations will find it hard to find a ready-made solution that fits their needs.

The reality on the ground is that more or less every organization already has some kind of internal platform, consisting of various open-source tools and purchased software or services, stitched together and customized to varying degrees for each application. In addition, everyone has legacy environments, legacy applications, and the established processes for dealing with both. So committing to a standardized platform approach initially means an increase in complexity (by adding more tools and processes) instead of reduction, and reducing the number of tools comes with high migration costs.

Platform choices

Plenty of venture capital was put into developing solutions in the DevOps space in the last years, and that means that there are more than enough choices even for those who want to limit their organization's choices:

  • Platforms originating from source code management - Github, Gitlab, Azure DevOps
  • Platforms originating from CI/CD - CloudBees, Harness.io
  • Platforms originating from container deployment and management - OpenShift, Cloud Foundry
  • Platforms originating from Application Release Automation (ARA) - digital.ai (formerly Xebia Labs), Octopus Deploy
  • Developer portals - backstage.io with its vast plugin ecosystem, also available as hosted offering (roadie.io), commercial alternatives like humanitec or OpsLevel

Photo credits: Image generated by DALL-E