I visited the YOW! conference in Sydney in December 2019. What better way to persist what I learned than in a blog post? This way I can look it up again when I have forgotten it (not “if”, but “when” - the half-life of things my brain can remember is usually not more than a day or two).
And of course, you can read it up, too, even if you weren’t there. The things that modern technology can do…
So read on to be inspired by the insights of some great speakers. There’s some interesting stuff in here!
Gene Kim on the Ideals of DevOps
In his talk “The Unicorn Project and the 5 Ideals”, Gene Kim talked about the business value of DevOps and the ideals that get us there.
He started by making clear that the value of DevOps is much higher than we would expect. For instance, high-performing teams deploy their software 208x more often than low-performing teams. Also,
- they have a 106x faster lead time (time from starting work to having it deployed to production),
- they have a 7x better deployment success rate,
- they have a 2.6x better Mean Time to Restore in case of a failed deployment,
- they spend 23% more time on developing features,
- and they are 2.2x more likely to refer their company as a great place to work.
Ideal 1: Locality and Simplicity
The organizational environment must support DevOps. We can measure it with the “Lunch Factor”: how many people do we have to talk to over lunch to get things done? Reduce the lunch factor to a minimum.
- Have fewer teams that can deploy more often independently.
- Have an architecture that allows a fast turnaround time.
- Developers must be empowered to work independently including that most integration tests must be run without an integrated staging environment (note: done right, contract testing might be a means to achieve that).
- Developers should not have to understand a big codebase.
- Developers must have the authority to deploy code to production.
- Developers must not have to wait for other business units to get things done for them (like enterprise architecture review, database schemas, …).
Ideal 2: Focus, Flow, and Joy
Developers should solve the business problem and not spend most of their time on configuration and infrastructure tasks.
- Put the infrastructure into a platform.
- Reduce the lead time between code check-in and feedback to a minimum to connect the cause to the effect; ideally, know within seconds if a feature works or not.
- Do trunk-based development to reduce the PR review and merge overhead.
Ideal 3: Improvement of Daily Work
We can only improve if we get feedback, so we should put as much feedback into our daily work as possible to get a little better each day.
- Toyota had a cord along the assembly line that is expected to be pulled by any worker that sees a problem (the Andon Cord. If it’s pulled, they have 55 seconds to fix the problem, otherwise, the assembly line stops. The cord is pulled something like 3500 times a day (talk about feedback).
- Big companies like Microsoft, Google, and Amazon only survived their technical debts because at some point they froze feature development.
- The build time of Nokia’s own operating system Symbian was 48 hours before they “pulled the cord” and went with another OS (didn’t help them in the long run, though…).
- Always take 20% off the development cycle to fix (or avoid) tech debt.
- Enable greatness by having some engineers solely concentrate on dev productivity.
- Have a virtual Andon Cord to fix things right when they happen.
Ideal 4: Psychological Safety
Only a psychologically safe environment allows for new ideas, innovation, and to learn from errors. An environment where you get bullied for each production error does not support a fast DevOps cycle, because everyone wants to be 100% certain that it works, which results in big deployments with slow processes and more things that can go wrong per deployment.
Ideal 5: Customer Focus
Finally, focus on the things that bring value to the customers instead of building functional silos and long internal decision chains within the organization.
- The Unicorn Project by Gene Kim
- Team of Teams by General Stanley McChrystal
- Flow by Mihaly Csikszenthmihalyi
- Transforming Nokia by Risto Siilasmaa
- Inspired by Marty Cagan
Aino Corry on Retrospective Antipatterns
Antipatterns are as useful as patterns if we learn from them. They might first be perceived as patterns and might only later be identified as antipatterns. Then, it’s time to refactor them so they can really become a pattern we should follow.
Aino Corry introduced 6 antipatterns in her talk “Retrospective Antipatterns” that she has learned about in her agile career.
Antipattern #1: Prime Directive Ignorance
The prime directive of retrospectives says that the participants must believe that everyone did the best job they could and that they must believe that a retro can really bring change.
If the participants or the facilitator don’t believe in that, a retrospective cannot have the effect it should have.
Antipattern #2: The Wheel of Fortune
If we collect problems and talk about solutions without identifying their root causes, we could just as well spin a wheel of fortune for the same effect. Better go through all the 5 steps of a retrospective:
- Set the stage - make everyone say something to warm them up.
- Gather data - look into the past to find out what we might want to change.
- Generate insights - find the root causes of things we want to change.
- Decide what to do.
- Close the retrospective.
Not part of the talk, but if you’re looking for inspiration about what to do in each of the above steps, have a look at the Retromat.
Antipattern #3: The Disillusioned Facilitator
If the facilitator doesn’t believe in the activities they are using, the retrospective cannot have the desired effect. Only use those activities that you believe can spark change in the group you are facilitating. Not every group likes the same retro activities.
Antipattern #4: In the Soup
If you hear things like “we can’t do anything about this” or “we just have to accept this” it’s a sign that people think their fate is controlled by outside factors.
A way out of this is to use the “in the soup” analogy. There are things we can completely control ourselves, there are things we can influence by nudging someone outside of the team and there are things we can’t do anything about - those are “in the soup” (Note: if you want to go deeper into this analogy, I recommend you read Stephen Covey’s book The 7 Habits of Highly Effective People - he calls it the “Circle of Influence”).
Draw three concentric circles on a whiteboard (things we can do something about, things we can influence, and the soup) and put solutions on post-its into the circle depending on our influence implementing it. This makes obvious which solutions we actually can do something about.
Antipattern #5: DIY Retrospectives
I obviously didn’t pay enough attention to this part of the talk since I can’t remember the reasoning behind this antipattern.
Antipattern #6: Disregard of Preparation
People coming unprepared is a general problem in meetings. This is especially true for remote meetings (Aino included this video in her talk - it paints a hilarious picture of video conferences).
Send an email the day before and another one 15 minutes before to make people aware of preparations.
If you don’t get the preparation you want from the participants, like everyone in a video conference to share their faces and concentrate on the meeting, make it embarrassing for them.
Todd Montgomery on Pride and Quality
In his talk “Level Up: Quality, Security, and Safety”, Todd Montgomery took a stance for taking pride in our work as software developers to increase software quality.
Our industry creates data breaches and other major incidents non-stop. That should give us a pause. Knight Capital fell to a programming error that sold stock orders high and sold them low and lost several hundred million within an hour.
Many people that are not in the software industry expect software to be clunky or not work as they would expect.
We’re the only industry that has EULA’s that absolve us of responsibility. Imagine that in other engineering disciplines!
Things that don’t help in creating secure and safe (= high-quality) software:
- Languages, frameworks, and methodologies don’t matter for software quality.
- The same is true for the recruiting process of developers and using the latest technologies like AI, ML, and Reactive.
- Code reviews and code coverage alone don’t create high-quality software, though they might have an effect if applied together (100% code coverage might even motivate developers to don’t implement any error handling anymore, because it’s so hard to test).
- Any kind of dogma holds us back and might cause harm.
Things that do help in creating secure and safe (= high-quality) software:
- Getting to the bottom of errors. Analyze the root causes, even if it’s hard!
- Using specs for communication.
- Early requirements analysis.
- Early domain expertise.
- A culture of accountability.
The thing that really helps in creating safe and secure software is taking pride in the work. As Steve Jobs said:
When you’re a carpenter making a beautiful chest of drawers, you’re not going to use a piece of plywood on the back, even though it faces the wall and nobody will ever see it. You’ll know it’s there, so you’re going to use a beautiful piece of wood on the back. For you to sleep well at night, the aesthetic, the quality, has to be carried all the way through.
Develop a taste for software development. Our sense of taste will then find out very quickly if something doesn’t taste good. Care about the work. This leads automatically to accountability and responsibility.
- Putt’s Law and the Successful Technocrat by Archibald Putt
Martin Thompson on Good-Mannered Protocols
In his talk “Interaction Protocols: It’s all about Good Manners”, Martin Thompson made a case for thinking deeply about the protocols we’re creating when building software. Think of little-endian vs. big-endian byte order. Little and big endianness originates from Gulliver’s Travels where the little-endians and the big-endians fight over which side to open an egg (I really did not know that! Probably because I watched the movie in German.). Which side to open an egg on is a protocol, and a good or a bad manner depending on who you ask.
When we think about how our components should interact, however, we talk about API design and not protocol design. But an API is a very narrow view of the interaction between components. It doesn’t include the sequence of things. To create robust interaction, we need to think about the sequence of things, and ask ourselves the following questions:
- What if things were called in a different order?
- What if things don’t arrive?
Studies show that 25% percent of production outages are caused by missing or buggy error handling.
Martin went on to discuss some interaction design best practices:
- Use a binary format instead of a text-based format for interaction protocols. The data centers of the world use more energy than the airline industry. Much of that comes from processing wasteful text-based data formats. It could be reduced by using a more efficient binary format.
- Synchronous interaction is also part of the energy waste. Instead of waiting for a response, think about using async to utilize the processing power we have.
- Batch the things we send over the line. We wouldn’t send someone to fetch things from the supermarket for us one-by-one. We would send someone to get all the things you need in one go.
- Beware of “snake oil protocols”. Protocols like 2PC/XA are broken. Guaranteed delivery doesn’t exist. Instead, implement a feedback and recovery mechanism if something doesn’t reach its destination. A computer should check if the other computer it talked to has understood what it wanted to say (we humans do it all the time when talking to someone else).
Edith Harbough on Feature Flagging
In her talk “Mistakes we made - Patterns & Anti-Patterns For Effective Feature Flagging”, Edith Harbaugh - CEO and co-founder of the Feature Flagging server LaunchDarkly - introduces some good and bad practices when working with feature flags.
Feature flags make it possible to reduce deployment cycle time. We cannot afford big “we need to make everything perfect”-type deployments anymore. Instead, we deploy code behind feature flags and release it to a fraction of the user base to test it.
Edith discussed the following feature flagging best practices:
- Use feature flags to create a kill switch for features. Simply turn it off if it doesn’t work and fix it on Monday (instead of on the weekend). No big rollbacks needed.
- We can use feature flags in a single branch instead of one branch for each feature. This enables trunk-based development.
- Use feature flags to create controlled rollouts. At first, only expose a fraction of the users to the new feature and progressively increase this fraction to make sure the feature works as expected.
- A side-effect of controlled rollouts is that we can let users opt-in to a beta test and only activate certain features for the beta testers without having to set up a whole new environment for the tests.
- Inversely, we can block a feature for certain users if we need to.
- With feature flags, we can test in production if we activate the features only for ourselves. “We always test in production - sometimes we’re just lucky enough to have tested before that”. Instead of trying (and failing) to reproduce errors in a staging environment, make the production environment observable enough to support error analysis.
- In subscription models, we can wrap features that are available for a certain subscription level with long-lived feature flags. This way we can easily move features between subscription levels.
- Feature flags easily allow the sunsetting of features that we no longer want to have in the system.
She mentioned the following antipatterns:
- Ambiguously named feature flags lead to misunderstandings and activation of deactivation of the wrong feature. Think long and deep when naming feature flags. Have a naming pattern.
- Overused feature flags, i.e. using a single feature flag for multiple things, can lead to feature flags with mixed responsibilities and unwanted side effects. We don’t understand what the feature flag does anymore.
- Overlapping feature flags might conflict with each other and may have unwanted side effects on each other.
- Dangerous feature flags are feature flags that wrap a very important feature without which many users couldn’t work with the software anymore. Either remove the feature flag completely or at least don’t make it too easy to disable (e.g. don’t put a button on a dashboard that allows to disable it - there’s a true story behind that).
- Leftover feature flags are feature flags that stay in the system and are not maintained anymore. They become technical debt. What’s worse: when the feature flag server isn’t available for some reason, the feature flags fall back to the default value we defined 2 years ago, which might not be the value we want anymore.
Troy Hunt on Security Breaches
In his keynote “Rise of the Breaches”, security researcher Troy Hunt told of big security breaches, shared some stories about how he and other security researchers have privately disclosed security issues before they could be exploited, and of the often naive behavior of the companies responsible for the security issues.
Troy is the creator of haveibeenpwned.com, where he collects data that has been made publicly available by security breaches so you can check if your email has been among the leaked accounts (Sit down before you paste your email address in there, you have almost certainly been pwned)!
Today, it’s as easy as never to search for and exploit security holes in software that is available via the internet. All it takes is a Google search for pages with an “id” parameter in the URL and a freely downloadable tool that exploits this parameter to inject SQL. If successful, the tool will automatically list all database tables and lets you pick which ones you want to copy. A 77 million-pound data breach in the UK was done by a 17-year-old. Everyone can do it! There are Youtube videos and tutorials freely available!
IoT makes things worse. Usually, you have an IoT device accompanied by an app. The app talks to a server on the internet, which then talks to the IoT device. By simply sniffing the traffic between the app and the server, you can find very embarrassing security holes:
Nissan had a companion app to its model LEAF that allowed, among other things, to control the heating of the car. By simply replacing your own serial number in the URL with that of another car, you could control the other car’s heating. From everywhere across the globe. And the serial number is printed on the windshield for everyone to see.
With the TicTocTrack smartwatch, TicToc has created a watch for kids that allows parents to track where their kids are and who is allowed to call them via the watch. Sadly, you can replace your own “family ID” in the URL with that of another family so you can see the locations of other children. You can also add yourself to the allowed caller list and call those children. They don’t even need to accept the call, because the watch accepts it automatically. The TicTockTrack has been banned in Germany but is still available in Australia and other countries (just in case you still need a Christmas present for your kids).
Confronted with security holes like these, companies often defend their behavior instead of quickly putting a stop to security issues.
Simon Brown on Software Design
In his talk “The lost art of software design”, Simon Brown made a case for doing a little up-front design instead of “just using a whiteboard” and getting started with development right away.
It’s crucial to make some decisions early in a project (programming language, microservices vs. monolith, …) so that the developers have some architectural guardrails to guide their work. But people hesitate to make such decisions in the design phase of a project, often due to misinterpreted Agile values. When writing a technical book alone, you want to have an outline before starting, so why not have an outline when building software with tens of people?
Going for an MVP is often used as an excuse for not doing any design work, since “the design will emerge eventually”. But doing no upfront design can lead to each iteration of the MVP to become a complete refactoring of the architecture.
Even the Agile Manifesto even promotes design work:
Continuous attention to technical excellence and good design enhances agility.
So good architecture is an enabler for agility, not an inhibitor.
Up-front design doesn’t need to be perfect, we don’t want to do BDUF (Big Design Up-Front). But instead of re-inventing the architecture with each iteration of an MVP (i.e. start with a skateboard and evolve it into a scooter, then a bike, and finally a car), we can begin with a “primitive whole” (i.e. start with a very small car without an engine, and evolve that car to its final state).
Evolutionary architecture is doable, but it’s very hard.
As a side effect, design work helps in creating estimates.
So, how much up-front design should we do? We should stop when
- we understand the architecture drivers
- we understand the quality attributes
- we understand the constraints
- we understand the significant design decisions to be made
- we can share a technical vision
- we’re comfortable with the risk.
- Building Evolutionary Architectures by Neal Ford and Patrick Kua
Sarah Wells on Mature Microservices
In her talk “Mature microservices and how to operate them”, Sarah Wells - responsible for operations and reliability at the Financial Times - shared her experience in managing microservices.
In a microservice environment, we never understand the whole picture because it’s too complex. We might not know how to access the database of a specific service to quickly fix a problem. Most problems you encounter are new.
So why not go back to monoliths? Because we want to be able to make cheap experiments. We want to quickly implement a certain feature and test it in production. We want to release many times a day.
Optimizing For Speed
In the old days, we did deployments into production by following a spreadsheet with >50 steps to be performed by managers, developers, and ops people. The Financial Times had all of 12 releases a year (note: I find that 12 times a year was still pretty good in the old days … I remember quarterly or even yearly deployments in my projects from back then, and, yes, we had that spreadsheet, too). Features we built took 8 weeks or more to get into production.
If we want to be able to get features out there fast and to do experiments, we need to optimize for speed:
- automated release pipelines
- automated testing as part of the pipeline
- continuous integration
- zero-downtime deployments
- test and deploy services independently
- little to none coordination between teams for deployments
- the person who built it is the person who deploys it
- no filling out forms or collecting permissions for deploying a change
To operate microservices successfully, DevOps is key. Only the developers really know what’s going on and can fix things.
Besides DevOps, other factors help:
- Make things like infrastructure (queues, databases, …) someone else’s (= your cloud provider’s) problem so that building stuff takes days instead of weeks.
- Bake resilience into your services because things will fail.
- Bake observability (logs, metrics) into your services because reproducing errors locally will be impossible.
- Don’t measure every metric you can think of. Request rates, error rates, and request durations for the main business use cases go a long way.
- In case of incidents mitigate now and investigate later instead of trying to fix the root cause directly.
- Automate things as soon as they get painful.
- Use a service mesh to take care of communication between services.
When People Move On
Naturally, people leave a team and new people join it. Even whole teams are dissolved. What happens to the services the team owned, then? In the DevOps world, services must be owned by a team.
The following actions have helped the Financial Times with changing people and teams:
- Either invest in a service or abandon it. Don’t let it live without an owner.
- Build a graph database to manage properties of the services like ownership and dependencies and to build support tools atop of that graph.
- Practice failures to gain trust.
- Use unique service codes to identify services across different supporting tools.
- A searchable runbook library.
- Keep the runbooks in near the sourcecode (i.e. in the repository) so you don’t have to search for them.
- Make a game of who has the best documentation.
- Automatic and manual reviews of the documentation.
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
- Continuous Delivery by Jez Humble and David Farley