Categories
Experience Reports Guide

Meaningful RCAs: Documenting the results

So far I’ve written a few blog posts around conducting RCAs where I’ve focused on the people and questions. However what I’ve yet to touch upon is the documentation side.

In a similar idea to the concept that the activity of coming up with a test plan is more important than the document itself, I have similar thoughts with the RCA. With this in mind, the most detailed document that I’d have is the collaboration board that I’ve used to facilitate the discussion. It captures our thoughts, discussion and key point.

A screenshot of a board created in Mural containing several sections loosely related to the SDLC and a number of different sticky notes
Example of a RCA, although this is all lorem ipsum text as I obviously can’t share a real one!

After the session I will then (as soon as possible) write up the overview. This is to capture the key findings from the RCA, explaining the nature of the problem, what we’ve learnt, any actions and so forth. This is shared with the team(s) on Slack to have a first look before I’d share it more widely.

I did like keeping a spreadsheet with my RCA findings. It would include the summary, a link to the board & tickets and an overly simplified “category” (missed requirement, domain knowledge, coding error etc).

This category is useful for metrics to help us understand patterns. This was useful when I was pushing to drive new initiatives because I could say “if we’d been using examples in refinement, we wouldn’t have had these massively complicated bugs”. If I’d had more time with my former employer, I’d have loved to explore a means of saving RCA summaries where I can tag the RCAs with different things to help demonstrate patterns.

I had also dabbled with feeding this data to an AI agent (one where we’d got the legal protections that it wouldn’t feed back into the main models). This was quite neat… but a topic for another day…

One final note is that I am aware that most people would still prefer to use a more formal & structured documentation approach than myself. I get that. Some of the things recorded could I guess be useful. However I’ve yet to experience any time where a 2 page document is useful. I have found these RCA discussions really useful and subsequently my documentation approach is similar to my retro approach. It is collaborating & capturing a conversation.

If you’d like to read more on RCAs, check out the collection of my posts on the Meaningful RCAs page!

Categories
Experience Reports Guide

Meaningful RCAs: Structuring questions

I’ve already talked about how we need to tap into unleashing our inner toddler by asking “why”. But what questions do we ask?

Background

Before getting into the guts of the RCA I like to go through the background. This is partly to act as a refresher for everyone as it may have been a few weeks but also it will help guide me in my questioning.

This usually means sharing:

  • Links to the defect we’re RCAing & the original ticket
  • Links to PRs to fix the issue and where possible the original (“offending”) PR.

Then asking:

  • Can you describe the problematic behaviour? (i.e. what was actually wrong from a user’s point of view)
  • Can you describe the describe the nature of the code fix?
  • What do you remember from working on the story?
    • How long did it take?
    • How many people were involved?

The Fix

Before learning more about why the issue came to be, let’s make sure that we’re confident in the fix. I like to ask two questions here:

  1. How resilient is the fix?
  2. Will we know if the behaviour regresses again? (i.e. did you add automated tests)

Quality Engineering Throughout The SDLC

Now we get into the real important questions. This is where we go through the software development life cycle and think about what we did and whether there were opportunities to (realistically) catch it then.

First of all, if this was an escape, lets ask if we could have caught it in production (e.g. monitoring), release testing or epic close off testing. I wouldn’t advocate for just asking “could have have caught it here?” but asking around what the process is, what was the testing performed and is this something in the scope of what we’d usually test?

We then move on to the story within the sprint, starting with testing of the original story / bug. We’re trying to understand whether this was a brain fart (it happens) or is it just something that we wouldn’t usually consider testing? If not, why not?

Then we get into more technical. We’re looking at the PR, starting with code review. I’ll be asking about the nature of the bug and is that something that we’d look for? I’d want to understand whether SMEs were involved & if not, why not? Did they check the testing notes & automated tests in the code review? Code reviews aren’t ever going to catch everything but it is good to discuss this process. It is a nice chance for people to get to talk about the value and role of a code review too.

I then concentrate on the developer’s testing. What had they covered through automated and hands on tests? How much was iterative? As a former dev, I know all too well how even a well intended developer who tests their work can let things come through here (see dev BLISS).

We’re back then to technical discussions on the code. This is where I hope the architect can ask a few questions, although regularly other team members often chip in. This discussion is a great way for the team to learn from each other.

You might think that now that we’ve talked about the types of testing and the development challenges that we may stop there, but no we don’t!

The teams will have planning and refinement when we’re breaking down the story. We do test strategies and planning at epic and sometimes user story level. We think about the complexity of the code work with architectural studies before starting an epic. Let’s continue diving into these.

Again we’re asking about what was done, whether this is a scenario that could have been caught, either behaviour wise or in code, and tapping into what more we could have done. This helps us with spread left.

A Parting Question

Near the start I asked about our confidence in catching this issue again. Unless we’re running out of time (unfortunately often), I like to ask a similar but slightly wider question. How confident are we that we won’t see a repeat of the issue? Not necessarily the same issue but a similar one.

Summary Section

Finally I’ll have a summary section with actions, learnings and a summary of the RCA. Often written up afterwards because unsurprisingly the hour I book for RCAs isn’t always enough to cover everything in this post! I’ll explain a little more on this in a separate post.

So in short…

We start off by discussing the background of the story to refresh ourselves and help us get an idea on what threads are best to pull on as we go into things. We’ll also check we’re confident in the fix.

We then take our time going through the SDLC. We’re not just asking “could we have caught it?” or “why didn’t we catch it?” but looking at the actions, steps and processes to understand the answer to this.

I switched the ordering from starting with the first stages of the story to starting in prod after advice from a great chap called Stu Ashman. I found this got us much more engagement in some of the testing and activities around post release. You’ll also see how through the different stages we are asking slightly different questions to consider more than “why didn’t we catch it?”.

We’re using every stage as a learning opportunity.

… and that makes for a meaningful RCA!

Categories
Experience Reports Guide

Meaningful RCAs: Involving the right people

I love collaboration and making exercises something that people can engage with. It is usually the discussion that matters more than what gets written on paper. For this to be successful, you need to have the right people in the (virtual) room.

As we’ve touched upon already, the RCA should touch upon all areas of the lifecycle of the source of the defect. Consequently I’d invite:

  • At least one person involved in refinement
  • The developer for the original story/defect
  • The code reviewer for the original story/defect
  • The tester for the original story/defect
  • The developer who fixed the defect that we’re doing the RCA for
  • An architect, even if they’ve no involvement before (arguably better). Failing that, a team lead.
  • Optionally any other team members.

I would have liked to invite a PO to some but I never got quite that bold.

There’s two things to highlight here.

First is that we’re focusing on who was involved when the defect was introduced. We have insight from the person who understands the fix but it is the processes, decisions and challenges in that original issue that we want to understand.

Secondly, with the architect and myself we have a cracking blend of insight. There’s someone who can analyse the code, design and technical side & ask meaningful questions and I can look at testing, process and examine ways of working.

For this to be successful you need all participants bought into the idea of being a safe place & no blame to be placed. I’ve written about this previously.

Categories
Experience Reports Guide

Meaningful RCAs: Asking Why

I did my first RCA back in 2019. We looked at each stage of development and used a “5 whys” approach to tap into why it wasn’t caught at this stage. We maintained this approach over the next few years and when I switched to my more coaching/leadership role in 2023, I started running them with the teams that I supported. We actually used Google Sheets or Excel for these. I’d have my bunch of questions to ask then I’d be asking the question then “why” to the answer. For example:

  • “During refinement did we call out our support for different browsers”
    • Nope, we should have
      • “Why wasn’t it raised? Tell me about your refinement”
      • Well we look at the high level requirements and turn that into ACs. We don’t look at more non-functional aspects.
        • “Why not?”
        • That is for the story’s test plan I guess.
          • “Why does this only impact testing and why isn’t that part of the estimate?”
          • … and so on

The point I’m trying to make in the fictional example above is that rather than just taking that first answer, you tap into it more to understand the real reason why the issue happened.

To help with asking questions, I like to lean on my ignorance – even if I had actually done my homework. It is more powerful to get someone to explain their problem or have to articulate what they did than you saying X & Y was done but not Z. This is why despite having read the ticket several times by now and dived through Slack conversations, I’ll ask who worked on it. I’ll ask whether everything went OK. I’ll ask whether topics like this came up in review.

My point here is I’m not running the RCA to have an answer to “why did this bug come to be” but so that we can discuss it and learn from it. This is what makes an RCA meaningful.

In time I’ve adapted and improved my questions, including the order that I ask them… but I’ll cover more in a subsequent post.

Categories
Experience Reports Guide

Meaningful RCAs: Process not individuals

I’m currently writing about RCAs in a series of posts. If you aren’t familiar with them, check out my Meaningful RCAs page first.

One of the key things that I would emphasise before every RCA is that there needs to be either zero blame or full org blame. As I’ve written before, it takes a team to ship a bug, and the reason why we have refinement, unit tests, static analysis tools, code review and hands on testing is to catch such issues. If one person can truly be responsible for shipping a bug – that is your problem.

So why is it important that we have an open safe space and zero blame culture? Well if we want to truly tap into what happened and understand the chain of decisions that led to a mistake, we need to be able to talk about them. If people fear repercussions then how can they be expected to be open and honest? Whether that is disciplinary or more likely, just the social awkwardness of having your mistakes discussed.

However mistakes are part of software development and they will always occur. Tools & processes help us reduce the likelihood. We have our ways of working to help us build software correctly but they will sometimes have gaps or fail us. Sometimes that is fine. Shit happens. Bugs will happen. Sometimes we just need to accept this. But other times we can understand why something failed and improve it.

To help with this I like to share a little blurb at the start. I’ll joke about it requiring a team. I’ll make a point that this isn’t about blaming anyone and mistakes will happen. It is that mix of people, process and tools. Fixing the fallibility of human nature isn’t possible but fixing processes & improving tool usage is.

I also have about 17 years experience in a variety of different roles and experience of screwing up in a variety of exciting and new ways. I’m not ashamed of having made many mistakes down my career – it is natural – and if sharing that makes it a more open & safe space.

So when you’re running your RCA, or in the invite, share that it is zero blame. Share that we’re it is the process that we are interested in.

Categories
Experience Reports Guide

Meaningful RCAs: Why they are valuable

Software development is tricky. It can involve complicated tools, languages, domain spaces and a variety of teams & roles. The world in which we work is constantly changing and evolving and we need to adapt. On top of that, knowing best working practices isn’t instinctive like shark hunting or bird migration patterns. We need to learn. Just because your ways of working was fine on one project, it doesn’t mean it’ll work on the next or in a new scenario.

We all know about the idea of learning from mistakes.

“The only real mistake is the one from which we learn nothing” — Henry Ford 

“Mistakes are the portals of discovery” — James Joyce

Within the world of software, arguably the best was to learn from our mistakes is through root cause analysis of escalations or high priority / worrying defects. Whilst some people may have done RCAs to conclude “missing test case” or “incorrect null handling”, meaningful RCAs are about learning why we made the mistakes and what we can do to avoid them… because this won’t be the last time that we miss test cases and incorrectly handle null.

Over my next few blog posts I’ll share how we can go about running meaningful RCAs to help us improve our ways of working to allow us to avoid being bitten twice and help build better quality software.

For more on RCAs, you can find a summary of what I’ve shared on my Meaningful RCA page.