Devops Case Studies
Devops Case Studies
All rights reserved, for information about permission to reproduce selections from this
book, write to Permissions, IT Revolution Press, LLC, 25 NW 23rd Pl, Suite 6314, Portland,
OR 97210
First Edition
10 9 8 7 6 5 4 3 2 1
For further information about IT Revolution, these and other publications, special
discounts for bulk book purchases, or for information on booking authors for an event,
please visit our website at www.ITRevolution.com.
Table of Contents
Preface
Executive Summary
Giving Context
DevOps Roadmaps: Five Case Studies that Showcase the Adoption of Modern
Technology Practices
Case Study 1 - Retail DevOps: Rebuilding an Engineering Culture
Case Study 2 - Technology Changes in Government Agencies (A
Compilation of Cases): Lessons in Legacy and DevOps
Case Study 3 - Agile Implementation in a Large, Regulated Industry:
DevOps and Accelerating Delivery
Case Study 4 - DevOps and Moving to Agile at a Large Consumer Website:
Getting Faster Answers at Yahoo Answers
Case Study 5 - Real-Time Embedded Software: DevOps Practices for an
Unhappy Customer
Summary
Endnotes
Contributors
PREFACE
In May of this year, IT Revolution once again had the pleasure to host 50
technology leaders and thinkers from across the DevOps Enterprise com-
munity at the DevOps Enterprise Forum in Portland, Oregon. The Forum’s
ongoing goal is to create written guidance, gathered from the best experts
in these respective areas, for overcoming the top obstacles in the DevOps
Enterprise community.
• Leading Change: What are effective strategies and methods for lead-
ing change in large organizations?
The end result can be found on the Forum page at IT Revolution web site
(http://itrevolution.com/devops_enterprise_forum_guidance) and all
the forum papers, from both this year and last year, are free to the community.
Gene Kim
November 2016
Portland, Oregon
• architectural approaches
Technical Practices
Incident
Visibility Performance Security / Audit Management /
Documentation
Monitoring Change Process
Cultural Norms
CULTURAL NORMS
While this document is not meant to delve too deeply into cultural norms,
it is clear that the success of any transformation requires changing elements
of culture in areas like trust, collaboration, experimentation, and risk-taking
to be successful. And technical practices can enable cultural change. For
example, teams can only be truly self-empowered when they have the ability
to obtain the technology resources they need—ideally, self-service. In this
section we will explore the cultural elements we think are important in the
context of implementing modern technology practices.
Enables
Technology Cultural
Practices Changes
Requires New
Cultural Norms
Experimentation/Risk-Taking
A key element of DevOps transformation is the ability to test new capabilities
or the business hypothesis,[FN1] and get quick feedback on the value that
it has provided to end users. Technology practices support risk-taking by
providing safety nets for when things go wrong. In addition, a cultural norm
of supporting and encouraging experimentation and risk-taking is important
Collaboration
Collaboration and trust are important elements of cross-skilled teams. For
example, utilizing Operations deputies who have earned trust and who
can be granted additional access rights to tools and environments reduces
the need to interact with Ops, thus speeding up team processes and sav-
ing both time and money. In addition, having the team expand into the
business area[FN2] can help them better understand the business value of
their products and aid them in designing the experiments discussed above.
Capabilities such as ChatOps are also an enabler in this area.
Continuous Learning
Investment in learning is essential to driving continuous improvement. This
can take many forms. Pairing and creating cross-teams and doing rotations
are ways to achieve this while delivering capabilities. But leaders in this area
also set aside time for associates to invest in their development. This includes
peer-based learning, teaching katas, internal conferences, and collaboration
on continuous improvement (e.g., Google’s learning culture).
• defining what “success” looks like and what “done” is, which (1) re-
duces waste from developers doing too much and (2) ensures the
capability developed meets the acceptance criteria
A model line is one in which you experiment with a new practice. Using
model lines enables you to start innovating in a few areas and then ex-
pand once the model has been developed. This model can then be scaled
and sustained. For example, modeling a practice such as release readiness
would involve performing a tool/technology trade study, implementing
the release readiness in a model line area/application, looking at the results,
and iterating. The output of the model would be a set of reusable practices
and processes, as well as the approved tools and technologies needed to
implement the practice.[FN3]
Learning by Doing
Rather than waiting for change to drive action, it’s more effective to act to drive
change. The 70:20:10 Model for Learning and Development is a commonly
used formula within the training profession to describe the optimal sources of
learning by successful managers. It holds that individuals obtain 70% of their
Incident
Visibility Performance Security / Audit Management /
Documentation
Monitoring Change Process
Architectural Architectural
Assessment Patterns
Automation
Identify repeated manual work and apply automation. This goes beyond test
automation and should be applied to process automation (e.g., automation
of change with release and deployment), configuration, and orchestration. A
key concept to reduce lead times is the elimination of service-level agreements
Automated Testing
This needs to be treated like a software development discipline in terms of
defining test architects, framework, applying configuration management,
and design/reuse principles. Key testing components include test-driven de-
velopment (TDD), where failing tests are written first and then just enough
code is developed to make the tests pass, and acceptance test-driven develop-
ment, which is important for each feature and is where everyone agrees on
an acceptance criteria and develops an automated test in the same iteration
that the story is being developed. Other important components include test
data management; infrastructure and deployment test automation; code cov-
erage measurement;code security and quality analysis (static and dynamic)
automation; and test results repository management.
• canary deployments
• zero-downtime deployments
• continuous flow
Feature flags and dark launches decouple application releases with the ability
to turn on/off capabilities in the code base, which can then be enabled at
a later time in production when the dependent services are available to be
deployed into production. Feature flags and dark launches enable trunk
based development, which Jez Humble and David Farley say is “the most
important technical practice in the agile canon.”[FN6][FN7]
CD requires more than just the automation. It also requires thinking about
and planning your work. Continuous flow, starting with release planning
through deployment, is enabled by smaller batch sizes, self-service on-de-
mand environments, and so on.
Ensure the hero metrics are clearly defined. For example, one metric could
be the percentage of defects assigned to each person on the team. So, if Brent
has 50% of the total defects of the team assigned to him, there’s likely an
issue. Another example would be metrics that show bottlenecks depending
on the same resource, so you might measure work in process as related to
a single person (e.g., the Kanban board reflects too many cards associated
with Brent).
Visibility
Accelerated delivery is enhanced by visibility across the value stream. This
takes concepts, such as visual system management, that have been applied
to Agile teams across the entire lifecycle. The goal is transparency through
reporting/dashboards and production telemetry. There are many ways to
expose the status of work, depending on the tools in your pipeline, but the
general idea is to automate the collection and visualization of all the steps
• development status
• build status
• security scans
• static analysis
• tests status
• deployment status
Performance Monitoring
Performance testing should be done early in the release cycle to ensure that
nonfunctional requirements are being met as the release is being designed and
developed and are not being left to be validated in the last iteration. Stories which
have the most impact on performance should be developed first, and automated
regression tests should be developed and run to ensure nothing impacting per-
formance is being introduced into the code base. It enables you to conduct
functional monitoring, allows you to plan how service will be monitored at
design time, and lets you begin monitoring in pre-production environments.
Documentation
Documentation takes many forms and should be located nearest to where it
is used, not centralized in a locked cabinet that nobody can access. Design
documentation includes requirements, functional specifications, and technical
specifications for systems, application program interfaces (APIs), architecture,
Security static code analysis capabilities can also be built into the CI process,
along with other code quality analysis, to flag any critical violations that should
be treated as breaking the build and need to be fixed accordingly. This ensures
that the introduction of any vulnerability is immediately addressed at the source.
Architectural Assessment
The goal of an assessment is to allow you to:
Architectural Patterns
There are several architectural patterns shown below that facilitate the goals
of DevOps.
Agile teams are often in wait states: waiting for work, waiting for environ-
ments, waiting for other teams to deliver a service or capability. By providing
continuous flow and smaller batch sizes, the Agile team can continue to be
productive. Likewise, having an architecture that can translate the smaller
batch size into decoupled, smaller changes in the system allows the Agile
team to more quickly move through design, development, and automated
testing and reduces the need for larger and more complex datasets in down-
stream environments. Self-service, on-demand environments (enabled by
utilizing standard architecture patterns) ensure environments are available
for security and performance testing as the product is being built. And in-
tegrating security, audit, and monitoring capabilities into the pipeline and
demonstrating these in earlier environments again assures that any feedback
is amplified into what changes may need to be made.
But as the writers of this paper know, change is not only possible, it can
mean interesting and innovative things for the organization. Each of these
five case studies showcases modern technology practices and offers a unique
story detailing how the organization set about adopting them.
This study focuses on the different phases of driving DevOps and rebuilding
the engineering culture at a retailer. The retailer required a lot of technology
and thousands of applications to support its business, and it had begun
changing from a large legacy enterprise delivery machine to a more mod-
ern, nimble Agile technology organization. The organization had lost sight
of the importance of engineering a while ago, and there was a culture of
stopping changes whenever something broke. Because of this culture, IT
would freeze production whenever problems arose, which in turn resulted
in them needing to push changes in big batches. The IT organization was
too complex, with silos inside of silos. Server provisions took many teams,
and there was no end-to-end accountability. The system was way too com-
plex, and over the years they had built up a large amount of technology
debt. It is here that, driven by a commitment to customers and engineers,
the organization’s DevOps journey and the rebuilding of their engineering
culture began.
There are four stages in this journey: enabling and unleashing change agents,
cultivating and growing a grassroots movement, getting top-down align-
ment, and figuring out how to scale across the enterprise.
A few years ago, IT leaders realized that the organization needed to step up
its game in the multi-channel guest experience. That was a complicated and
difficult thing to do. Core data was locked in legacy systems, and it would
take three to six months to develop the integrations needed to get at core
data. There were multiple sources of truth. The dotcom was different from
It was then that IT leadership started talking about using Agile, not water-
fall, systems. Team members, not contractors, began doing the engineering
work, and team leaders insisted on full staff ownership, so that queuing and
waiting were decreased, enabling them to get control of the ecosystem. They
brought in more modern tools. Then, a couple of years in, they added tech-
nologies such as Cassandra and Kafka to give them scale and allow them to
keep up with volume. The retailer’s API platform was scaled as the speed of
new customer experiences grew. As a result of these changes, they now do
80 deployments per week. The net present value on investment is amazing.
Digital sales were up 42% last holiday season. At the peak of this year’s hol-
iday season, 450 stores will help fulfill orders. APIs and platforms continue
to scale. The organization is now investing heavily in tech because leaders
understand how essential it is to a successful business.
When the IT organization’s leaders saw the successes resulting from the
implementation of DevOps processes, they knew they needed to scale across
the entire organization. So, they started a grassroots movement, which in-
cluded beginning an internal DevOps program. Right away, over 100 people
showed interest in learning more about DevOps. The organization went
from nobody wanting to talk about DevOps because it seemed like only
Silicon Valley web companies were doing it, to increasing numbers of peo-
ple talking about the amazing successes resulting from it. They also started
getting engineers together to talk about infrastructure as code and began
to connect to the larger tech community by bringing in external speakers.
After all this, it became obvious that senior leadership buy-in was neces-
sary. The bottoms-up phase needed to work into top-down engagement.
To achieve this, they did a lot of momentum building. DevOps and Agile
became core pillars and goals, as well as part of the daily conversation, and
they did town hall huddle meetings. They also aligned with Thoughtworks’
CI/CD Maturity Model to baseline measure products. They set goals and
assigned DevOps champions to help drive practices within their spaces
and be champions within their teams. Despite these efforts, they still had
hundreds in middle management who had not yet been exposed to the
thinking. So, they decided to do an internal mini-DevOps enterprise sum-
mit. The retailer’s management became involved in these discussions, and
it energized a lot of people.
Once all of these other steps were in place, the next question was how to
actually scale across a large organization. First, they focused on structural
changes. They moved from a COBIT-based, highly-segmented model, to
a product and service model; simplified accountabilities and established
key practice areas across the organization; and shifted their delivery model
from waterfall and project-based to product-based using Scrum and Kanban
models. They also embarked on a tech modernization strategy. Because
they had a lot of legacy and tightly-coupled integrations, they needed key
strategic capabilities to move to a more modern tech architecture: API based,
loosely coupled, lightweight tooling, self-service, and optimized for cloud
based and CI/CD practices.
They also needed to change how people worked. They began this process
by converging the Agile and DevOps movements. What was once loosely
connected became more tightly connected. They then pulled in an effec-
tiveness organization to look at how they could scale learning. Traditional
training was combined with a focus on coaching and hands-on immersive
The IT organization was structured into nine service delivery teams, each
focused on a different application (email, finance, etc.) or technology (net-
working, data centers, security). These divisions were arbitrary and isolated;
PMs were responsible for figuring out which subset of teams were needed
for each project.
The process for each project was waterfall. It took a minimum of nine months
to gather requirements, design, develop, and finally deploy the result. This
waterfall was enforced through gate review meetings and a strong change
control process, which required prior gates to have occurred before up-
dates could flow between phases (e.g., before code could be moved from
Development to Stage or Stage to Production).
Historically, the central IT team was highly successful because they offered
reliability. The teams were fairly good at delivering results on time, though
with nine months being the minimum for a project, reliability did not mean
speed or efficiency.
The appearance of Amazon Web Services (AWS) meant that the IT division
had competition for the first time in their history. All the teams considered
AWS to be a threat. It was only a direct threat to the Datacenter team, however.
Even if clients worked around the IT department by using AWS, they still
required the services of the Security, Network, Development, and other teams.
The front end developers and middle-tier support team were enthusiastic
about being involved. They didn’t care where their code ran, and AWS was
a modern technology that looked good on their resumes. The Security team
was neutral. The leader of the Datacenter team actively tried to sabotage
The sabotage and other political moves created an “us versus them” situation.
People were picking sides, often based on who they thought would “win”
rather than what was best for the organization.
Second Attempt
Their first project was estimated to take eight months. However, the CIO
needed it done in four.
They were radically successful. After a six-month lead-in period, the team
started doing weekly incremental deployments to AWS. They were able to
launch in four months and stayed within budget. They also experienced a
high profile success during the project when there was a news event that
created the need for an ad-hoc website update. The organization wanted to
be able to quickly offer a simulcast feed of a news conference on their website.
The DevOps team was able to rearrange their priorities and respond to this
feature request in under a week and, once tested, deploy the update in 24
hours to the website using their automation tools. They were also able to
quickly engineer the website to scale elastically to support the high-traffic,
high-profile announcement.
This convinced a lot of people that this was the better way to work. Once they
realized that AWS wasn’t taking their jobs away (for example, the Security
team’s role wasn’t diminished no matter where code ran), more people got
on board.
Short-Lived Success
In support of their position, the legacy people pointed out that they had
never had a failed launch. Their waterfall-style projects took years, but their
launches were always a “success.” However, this was because any possible
The DevOps team understood that it was important to lean forward and
push things. It wouldn’t hurt if there were occasional mistakes. In fact, it
would help because they would then be able to learn from them. They could
go further using this new way, even with occasional rollbacks.
The legacy team was involved in launching a new mobile app. Press releases
had already gone out with the launch date; therefore, there could be no delays.
Production got the software delivery two days before the launch, leaving no
time for load testing. Response was larger than expected, and the system
became overloaded, making it unusable. Media criticism was huge. It was
a black-eye for the client. However, the legacy team considered this a suc-
cessful launch because it was on time and met the requirements as specified.
Additionally, the system was technically “up” the entire time, even though end
users could not access it. The legacy team measured uptime based on their SLA,
which was entirely based on total server uptime and not end-user availability.
Around the same time, the DevOps team launched a new website that wasn’t
expected to receive a large amount of traffic. However, the PR department
learned of news that was going to go public in a few hours, which, based
on a similar situation, was going to increase the traffic by 70x. The news
The legacy team pointed at the DevOps team scaling as a failure because a
correction had to be made after launch. What the legacy team didn’t un-
derstand was that value added to the end user is more important than the
team meeting their milestones.
The legacy team’s process for handling such a situation would have been to con-
vene a change control board and find hardware that could be scavenged from
other projects and repurposed to add capacity. They would have reduced features
by replacing the main page with a static HTML page, substituting smaller or
lower-resolution images, and so on. The technical changes wouldn’t have started
for at least a day. In today’s 24-hour news cycle, the solution wouldn’t have been
implemented until well after the media had moved on to the next big story.
On the contrary, the DevOps team didn’t reduce functionality, they added it.
They did an early launch of video and other content that they found related to
the event. These were features they hadn’t planned on launching yet, but due
to their CI/CD methodology, they were confident in delivering them early.
Sadly, the CIO that had been supporting the DevOps efforts retired, and
the leader of the legacy team was selected to replace him. This was a major
step backwards.
• Transforming a new team every few months rather than waiting for
the DevOps experiment to be complete would have helped. This
would have built a community of expertise that would have snow-
balled into a larger, more sustainable success.
• Isolating the DevOps team let the other teams off the hook about
following the better practices. While it was good that the DevOps
team was successful, it created an “us versus them” situation. It would
have been better to make the others be “fast followers,” partnering
them with stakeholders faster by including them in daily stand-ups,
involving them in Scrum or Kanban activities, and so on.
• They could have focused on rewarding not just the first team, but
all teams. The first team got the top cover and did it out of love.
The other teams took more personal risk because they weren’t “true
believers.” People can relate better to non-trailblazers winning the
award.
• Finally, they could have started all new projects and organizations
with DevOps practices. New-hires are more open to the practices
in general, and there is no transformation needed if they arrive and
find that DevOps exists already.
There were wait states at the beginning of the life cycle, waiting for work to
flow into the backlog of Agile teams, and wait states downstream waiting for
dependent work to be done by teams and for environments. Sixty percent
of the time spent on an initiative was prior to a story card getting into the
backlog of an Agile team. Once work left the final iteration, there were high
ceremony, manual practices leading to increased lead time for deployments.
This was due to disparate release and deployment practices, technologies,
and dependences due to large release batch sizes.
Yahoo Answers was created in 2006 as a place to share knowledge on the web
and bring more knowledge to the Internet. It’s basically a big game: people
compete to answer visitor questions, and the most approved questions help
them work their way up to higher levels.
In 2009, their growth was flat, at around 140 million monthly visits. In ad-
dition, they had declining user engagement, flat revenue, and a contentious
team of employees. They used waterfall development with four to six week
cycles because there were quality issues in Operations and Development, and
people were obstructing releases. Fourteen months later, Yahoo Answers was
getting over 240 million monthly visits and over 20 million people answering
questions.[FN15] It was also available globally in twenty languages. It was a
very large-scale property and a significant part of Yahoo traffic. They were
able to grow traffic by 72%, user engagement was up 3x, and revenue was
up 2x. They had daily releases and better site performance, and they had
moved from a team of employees to a kick-ass team of owners.
So, what is the backstory of the transition? Yahoo Answers had an amaz-
ing team of four to five leaders across Engineering, Product, Design, and
Operations who helped transform the business. Everyone sat down together
and came to the conclusion that they could no longer run the business like
this. They all developed a plan. The first step was to get everybody closer to-
gether. When Jim Stoneham arrived in 2009 as VP of Communities, they had
people in London and France, while Jim was in the United States. The odd
Once they were all in the same place, they decided it was necessary to
focus on a few key metrics. Their old dashboard tracked every single
metric, meaning, of course, that nobody paid attention to anything. So,
they simplified. They asked customers what mattered. The responses they
received revealed that customers were primarily concerned with time
to first answer, time to best answer, upvotes per answer, answers/week/
person, second search rate, and trending down (negatively correlated).
Revenue was not a key metric was pageviews. Those would follow if the
other metrics were doing well.
The next step was to reduce the size to small units of work focused on a
key metric, which would have the effect of making the unit of work smaller
To overcome this, they got everyone into a room and came up with a process
that would work for all of them all as a team and would quickly drive exper-
iments so that people would own the quality. This involved all stakeholders.
Their new product process included weekly sprints; daily deploys (except
Fridays); reviewing metrics daily or more, which was key to moving from
a team of employees to a team of owners and helped create a cultural trans-
formation; and weekly iteration planning. This weekly planning kept up a
cadence of looking forward and backward by a week. The new process also
included monthly business reviews (all hands) during which they would take
five core metrics and revenue and look at the information together. IT was
made up of extended Operations people and community managers, which
turned out to be a group of eighty people or so. It took them 60 to 90 days
to get the process working well.
Metrics,
Automation CI Continuous Full Agile
Agile
Improvement Practice Stack
As with many DevOps journeys and transformations, this one began with the
customer. The customer was providing constantly-changing requirements,
increasing demands, and wanting things done faster and faster, which trans-
lated to failed waterfall schedules and budgets, as well as increased pressure
on Development and Operations to do exceedingly impossible things.
As a result of all of these difficulties, the project manager had been replaced
four times due to failing to meet schedule and budgets. The customer was
unhappy because the organization often wasn’t able to deliver what the
customer needed, and even when it was delivered, it wasn’t on time. The
Development and Test teams were exceedingly frustrated because no matter
how many hours they worked, they had zero automated tests and no auto-
mated builds, and there never seemed to be a path to success.
As all of these changes played out, there was a lot of pushback from the
teams developing requirements and testing the system. After two years, a
new Requirements team manager was hired who was open to change and
really listened to arguments for becoming “one team.” Almost three years in,
they managed to break down the siloes and become a cross-functional team.
Starting environment:
• frustrated customer
• 2K organization
Outcomes:
• The customer is now much happier and is an advocate for the team.
• The team lost some team members who just weren’t willing to change
or couldn’t get used to the new process.
We hope you have found these examples useful and that you are able to use
them as inspiration as you approach the idea of transforming your own
team and processes. When faced with new innovations, ever-changing envi-
ronments, and the need to make delivery of services much faster and more
efficient, all while remaining competitive and successful, implementing new
DevOps practices can seem a daunting endeavor. However, keep in mind
that there are three elements we have seen in all our examples that create a
successful environment for adopting DevOps practices: cultural norms to
support the transformation, modern technology practices, and architectural
approaches. As we have shown you, with these in place it is possible to adopt
DevOps practices one step at a time. Before you know it, you will have
created a map of your journey that you can share with others as proof that
DevOps practices can mean positive business outcomes, personal successes,
and fully-functional teams who believe in the work they are doing because
the positive business outcomes keep piling up.
Authors
Other Contributors