The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by G. Kim, K Behr & G Spafford

Precisely!” I hear Erik say. “You even used the term I like most for it: unplanned work. Firefighting is vividly descriptive, but ‘unplanned work’ is even better. It might even be better to call it ‘anti-work,’ since it further highlights its destructive and avoidable nature. “Unlike the other categories of work, unplanned work is recovery work, which almost always takes you away from your goals. That’s why it’s so important to know where your unplanned work is coming from.

…you need to know what matters to the achievement of the business objectives, whether it’s projects, operations, strategy, compliance with laws and regulations, security, or whatever.” He continues, “Remember, outcomes are what matter—not the process, not controls, or, for that matter, what work you complete.

Erik interrupts. “Well put, Bill. You’ve just described ‘technical debt’ that is not being paid down. It comes from taking shortcuts, which may make sense in the short-term. But like financial debt, the compounding interest costs grow over time. If an organization doesn’t pay down its technical debt, every calorie in the organization can be spent just paying interest, in the form of unplanned work.

Unplanned work has another side effect. When you spend all your time firefighting, there’s little time or energy left for planning. When all you do is react, there’s not enough time to do the hard mental work of figuring out whether you can accept new work. So, more projects are crammed onto the plate, with fewer cycles available to each one, which means more bad multitasking, more escalations from poor code, which mean more shortcuts. As Bill said, ‘around and around we go.’ It’s the IT capacity death spiral.

He took me to Allie, the Manufacturing Resource Planning Coordinator, and asked her how she decides on whether to accept a new order.” I flip back to my notes. “She said that she would first look at the order and then look at the bill of materials and routings. Based on that, she would look at the loadings of the relevant work centers in the plant and then decide whether accepting the order would jeopardize any existing commitments.

The Third Way is all about ensuring that we’re continually putting tension into the system, so that we’re continually reinforcing habits and improving something. Resilience engineering tells us that we should routinely inject faults into the system, doing them frequently, to make them less painful. “Mike Rother says that it almost doesn’t matter what you improve, as long as you’re improving something. Why? Because if you are not improving, entropy guarantees that you are actually getting worse, which ensures that there is no path to zero errors, zero work-related accidents, and zero loss.” Suddenly, it’s so obvious and evident. I feel like I need to call Patty right away to tell her to start the monitoring project immediately..

“Good point,” Wes grunts. “You know, it’s odd. So many of these problems we’ve been facing are caused by decisions we made. We have met the enemy. And he is us.

CFO GOALS Health of company Revenue Market share Average order size Profitability Return on assets Health of Finance Order to cash cycle Accounts receivable Accurate and timely financial reporting Borrowing costs

Are we competitive? Understanding customer needs and wants: Do we know what to build? Product portfolio: Do we have the right products? R&D effectiveness: Can we build it effectively? Time to market: Can we ship it soon enough to matter? Sales pipeline: Can we convert products to interested prospects? Are we effective? Customer on-time delivery: Are customers getting what we promised them? Customer retention: Are we gaining or losing customers? Sales forecast accuracy: Can we factor this into our sales planning process.

Edwards Deming called this ‘appreciation for the system.’ When it comes to IT, you face two difficulties: On the one hand, in Dick’s second slide, you now see that there are organizational commitments that IT is responsible for helping uphold and protect that no one has verbalized precisely yet. On the other hand, John has discovered that some IT controls he holds near and dear aren’t needed, because other parts of the organization are adequately mitigating those risks. “This is all about scoping what really matters inside of IT. And like when Mr. Sphere told everyone in Flatland, you must leave the realm of IT to discover where the business relies on IT to achieve its goals.” I hear him continue, “Your mission is twofold: You must find where you’ve underscoped IT—where certain portions of the processes and technology you manage actively jeopardizes the achievement of business goals—as codified by Dick’s measurements. And secondly, John must find where he’s over-scoped IT, such as all those SOX-404 IT controls that weren’t necessary to detect material errors in the financial statements.

Go talk to the business process owners for the objectives on Dick’s second slide. Find out what their exact roles are, what business processes underpin their goals, and then get from them the top list of things that jeopardize those goals. “You must understand the value chains required to achieve each of Dick’s goals, including the ones that aren’t so visible, like those in IT.

You’ll be ready for your meeting with Dick when you’ve built out the value chains, linking his objectives to how IT jeopardizes it. Assembling concrete examples of how IT issues have jeopardized those goals in the past. Make sure you’re prepared.

I’d like three weeks with each of the business process owners on that spreadsheet. We need to get the business risks posed by IT better defined and agreed upon and then propose to you a way to integrate those risks into leading indicators of performance. Our goal is not just to improve business performance but to get earlier indicators of whether we’re going to achieve them or not, so we can take appropriate action.

….master the Second Way, creating constant feedback loops from IT Operations back into Development, designing quality into the product at the earliest stages. To do that, you can’t have nine-month-long releases. You need much faster feedback. “You’ll never hit the target you’re aiming at if you can fire the cannon only once every nine months. Stop thinking about Civil War era cannons. Think antiaircraft guns.

In any system of work, the theoretical ideal is single-piece flow, which maximizes throughput and minimizes variance. You get there by continually reducing batch sizes. “You’re doing the exact opposite by lengthening the Phoenix release intervals and increasing the number of features in each release. You’ve even lost the ability to control variance from one release to the next.”

The First Way is all about controlling the flow of work from Development to IT Operations. You’ve improved flow by freezing and throttling the project releases, but your batch sizes are still way too large. The deployment failure on Friday is proof. You also have way too much WIP still.

An inevitable consequence of long release cycles is that you’ll never hit the internal rate of return targets, once you factor in the cost of labor. You must have faster cycle times.

He automated the build and deployment process, recognizing that infrastructure could be treated as code, just like the application that Development ships. That enabled him to create a one-step environment creation and deploy procedure, just like we figured out a way to do one-step painting and curing.

Stop focusing on the deployment target rate. Business agility is not just about raw speed. It’s about how good you are at detecting and responding to changes in the market and being able to take larger and more calculated risks. It’s about continual experimentation, like Scott Cook did at Intuit, where they did over forty experiments during the peak tax filing season to figure out how to maximize customer conversion rates.

If you can’t out-experiment and beat your competitors in time to market and agility, you are sunk. Features are always a gamble. If you’re lucky, ten percent will get the desired benefits. So the faster you can get those features to market and test them, the better off you’ll be. Incidentally, you also pay back the business faster for the use of capital, which means the business starts making money faster, too.

The competitive advantage this capability creates is enormous, enabling faster feature time to market, increased customer satisfaction, market share, employee productivity, and happiness, as well as allowing organizations to win in the marketplace. Why? Because technology has become the dominant value creation process and an increasingly important (and often the primary) means of customer acquisition within most organizations. In contrast, organizations that require weeks or months to deploy software are at a significant disadvantage in the marketplace.

One of the hallmarks of high-performers in any field is that they always “accelerate from the rest of the herd.” In other words, the best always get better. This constant and relentless improvement in performance is happening in the DevOps space, too. In 2009, ten deploys per day was considered fast. Now that is considered merely average. In 2012, Amazon went on record stating that they were doing, on average, 23,000 deploys per day.

During the 1980s, there was a very well-known core, chronic conflict in manufacturing: Protect sales commitments Control manufacturing costs In order to protect sales commitments, the product sales force wanted lots of inventory on hand, so that customers could always get products when they wanted it. However, in order to reduce costs, plant managers wanted to reduce inventory levels and work in process (WIP). Because one can’t simultaneously increase and decrease the inventory levels at the plant, sales managers and plant managers were locked in a chronic conflict. They were able to break the conflict by adopting Lean principles, such as reducing batch sizes, reducing work in process, and shortening and amplifying feedback loops. This resulted in dramatic increases in plant productivity, product quality, and customer satisfaction. In the 1980s, average order lead times were six weeks, with less than seventy percent of orders being shipped on time. By 2005, average product lead times had dropped to less than three weeks, with more than ninety-five percent of orders being shipped on time. Organizations that were not able to replicate these performance breakthroughs lost market share, if not went out of business entirely.

And of course, DevOps extends and builds upon the practices of “infrastructure as code” pioneered by Dr. Mark Burgess, as well as continuous integration and continuous deployment (pioneered by Jez Humble and David Farley), which is a prerequisite to achieving fast deployment flow. DevOps also benefits from an astounding convergence of philosophical management movements, such as Lean Startup, Innovation Culture, Toyota Kata, Rugged Computing, and the Velocity community. All of these mutually reinforce each other, creating the conditions of a powerful coalition of forces that can accelerate DevOps adoption.

The Second Way is about the left-to-right flow of work from Development to IT Operations to the customer. In order to maximize flow, we need small batch sizes and intervals of work, never passing defects to downstream work centers, and to constantly optimize for the global goals (as opposed to local goals such as Dev feature completion rates, Test find/fix ratios, or Ops availability measures). The necessary practices include continuous build, integration, and deployment, creating environments on demand, limiting work in process, and building safe systems and organizations that are safe to change.

The necessary practices include “stopping the production line” when our builds and tests fail in the deployment pipeline; constantly elevating the improvement of daily work over daily work; creating fast automated test suites to ensure that code is always in a potentially deployable state; creating shared goals and shared pain between Development and IT Operations; and creating pervasive production telemetry so that everyone can see whether code and environments are operating as designed and that customer goals are being met.

The Third Way is about creating a culture that fosters two things: continual experimentation, which requires taking risks and learning from success and failure, and understanding that repetition and practice is the prerequisite to mastery. Experimentation and risk taking are what enable us to relentlessly improve our system of work, which often requires us to do things very differently than how we’ve done it for decades. And when things go wrong, our constant repetition and daily practice is what allows us to have the skills and habits that enable us to retreat back to a place of safety and resume normal operations.

…necessary practices include creating a culture of innovation and risk taking (as opposed to fear or mindless order taking) and high trust (as opposed to low trust, command-and-control).

…allocating at least twenty percent of Development and IT Operations cycles towards nonfunctional requirements, and constant reinforcement that improvements are encouraged and celebrated

In fact, we claim that DevOps is even more important for the horses than for the unicorns. After all, as Richard Foster states, “Of the Fortune 500 companies in 1955, 87% are gone. In 1958, the Fortune 500 tenure was 61 years; now it’s only 18 years.”. We know that the downward spiral happens to every IT organization. However, most enterprise IT organizations will come up with countless reasons why they cannot adopt DevOps, or why it is not relevant for them.

One of the primary objections from horses is that all the unicorns (e.g., Google, Amazon, Twitter, Etsy) were born that way. In other words, unicorns were born doing DevOps. In actuality, virtually every DevOps unicorn was once a horse and had all the problems associated with being a horse. Amazon, up until 2001, ran on the OBIDOS content delivery system, which became so problematic and dangerous to maintain that Werner Vogels, Amazon CTO, transformed their entire organization and code to a service-oriented architecture. Twitter struggled to scale capacity on their front-end monolithic Ruby on Rails system in 2009, starting a multiyear project to progressively re-architect and replace it. LinkedIn, six months after their successful IPO in 2011, struggled with problematic deployments so painful that they launched Operation InVersion, a two-month feature freeze, allowing them to overhaul their compute environments, deployments, and architecture. Etsy, in 2009, according to Michael Rembetsy, “had to come to grips that they were living in a sea of their own engineering filth,” dealing with problematic software deployments and technical debt. They committed themselves to a cultural transformation. Facebook, in 2009, was at the breaking point for infrastructure operations, barely able to keep up with user growth, code deployments were becoming increasingly dangerous, and staff were continually firefighting.

Assuming that all work centers were ninety percent busy, the graph shows us that the average wait time at each work center is nine hours—and because the work had to go through seven work centers, the total wait time is seven times that: sixty-three hours. In other words, the total “% of value added time” (sometimes known as “touch time”) was only 0.16% of the total lead time (thirty minutes divided by sixty-three hours). That means for 99.8% of our total lead time, the work was simply sitting in queue, waiting to be worked on (e.g., in a ticketing system, in an e-mail).

In The Goal, Dr. Goldratt starts to describe the steps in the Theory of Constraints (TOC) methodology. Briefly, the five original TOC steps are: Identify the constraint Exploit the constraint Subordinate all other activities to the constraint Elevate the constraint to new levels Find the next constraint.

Absence of trust—unwilling to be vulnerable within the group Fear of conflict—seeking artificial harmony over constructive passionate debate. Lack of commitment—feigning buy-in for group decisions creates ambiguity throughout the organization. Avoidance of accountability—ducking the responsibility to call peers on counterproductive behavior, which sets low standards Inattention to results—focusing on personal success, status, and ego before team success.

It was probably one of the most important lessons in my life. It is now my aspiration in every domain of my life to never fear conflict, never be afraid to tell the truth, and never be afraid to say what I really think.

I’ve been in situations where I’ve observed leadership teams locked in chronic underperformance and strife, because of the utter inability for the team members to trust each other. And when leaders don’t trust each other, then almost certainly, their respective teams won’t trust each other.

From my professional experience, the cost and true consequence of not being able to have candid discussions about problems that everyone knows about, but is unwilling to confront, is incredibly high. Tackling this problem requires overcoming some of our most ingrained and learned behaviors, but the rewards are worth.

Kata impacts your organization by providing a systematic, scientific routine that can be applied to any problem or challenge, commonizing how the members of an organization develop solutions, migrating managers toward a role of coach and mentor, by having them practice coaching.

Patterns like those in the Netflix culture, such as relentless improvement and innovation, ruthless eradication of variance, and injecting faults into the production environment (embodied in tools such as the famous Chaos Monkey), are the perfect embodiment of the Improvement Kata that Mr. Rother describes.

Continuous delivery is the perfect embodiment of the First, Second, and Third Ways, as it emphasizes small batch sizes (e.g., check into trunk daily), stopping the line when problems occur (e.g., no new work allowed when builds, tests, or deployments fail; elevating the integrity of the system of work over the work itself), and the need to continually build the validation tests necessary to either prevent failures in production, or, at the very least, detect and correct them quickly (e.g., the transition from manual process reviews to automated tests, especially in the ITSM release, change, and configuration process areas