The Best Laid Plans: Why you should never do the big rewrite, even using Agile.

Latest version of this document | Read more articles

tldr: this article looks at a project I joined where we were seemingly doing everything 'right': the best practices, the best architecture, the best paradigms, the best technology. The result however was that we under-delivered, delivered late, and didn't deliver something of very great deal value; from outside the perception was of a slow team who evangelised Agile practices yet wasn't able to deliver valuable, working software in a timely manner.

So, what went wrong?

In this article I'll reflect on the approach we took and attempt to answer that question.

Introduction

It was my first day on the team and I was already writing production-ready code! Yes, that's right. I joined the team in the morning, had my customary on-boarding ("this is what we do, this is what you'll be doing, here's your desk, here's your laptop, there are the toilets and the fire exits!") and then joined my new colleagues who were practising mob programming at the time. Within the hour I was the driver writing production Java code that the navigators were requesting. I didn't quite understand what the code was for, but I was writing it! The use of mob programming was just one of the many modern software development practices that the team had adopted or were in the process of adopting at the time I joined. The tech lead on the project was very much a follower of Agile, Lean and XP practices. During the first five or six months of the project this person had been the sole developer and principal architect, and they had determined a clear architecture, employing practices such as domain driven design, event driven data replication, emergent design, and loosely coupled micro-services.

It was really inspiring, and at first glance it appeared to be my dream software engineering project: We'd been given the go-ahead to build a new architecture that would replace an old monolithic system using the best software engineering practices and the most up to date architectural ideas, and we were following the organisations' desired reference architecture for new software systems; for me, all boxes were ticked.

Over the next few months, the tech lead took our motley crew of four to five developers, and coached us and guided us through designing, building, and implementing the new distributed software platform. I was up on most of the practices, but not being a Java developer myself, there was a lot to learn for me technically; others were seasoned Java developers clambering up a steep XP learning curve. Because we were doing everything ourselves, not only were we learning micro-service based architectures, a domain-driven design approach, using event-driven models for replication in decoupled architectures, faking third party systems and writing contract tests, but also, we needed to do DevOps too: AWS, Network design, Terraform, Python & Fabric, Docker, and building our own CI/CD server.

In short, a lot of stuff to learn, and do.

It all went really well for the first four or five months following my joining the team: we built our AWS-hosted services, we built our CI/CD environment (using ThoughtWorks' GoCD), we worked our way through building our architecture with micro-services, we built our local virtual development environment, and we did it all pairing, mobbing, using TDD, and ensuring that we implemented only what we needed to implement. It was tough going. There was a lot to build. There was a lot to discover, and a lot to decide; none of us had ever done anything quite like it before.

Then, suddenly, the messages from the product folks started to change. Whereas it had always been "no, you guys are fine, carry on doing the great stuff you're doing, we love it!", suddenly the messages became "it's all very well all this emergent Agile malarkey, but when is the software going to be delivered?"!

When we heard this, I looked at what we'd done and what was left to do: we'd achieved a great deal, but we were a long way off finishing. We'd spend so much time on learning, on architecture design, and on supporting infrastructure, on working out the best design, that we hadn't delivered any working production features by that point. I think this is what had started to make the product guys nervous.

At this point, our development team started to panic: We pivoted between various architectural approaches and lost time doing it; we made the product folks even more nervous, got even more freaked out, pivoted some more, slowed down some more, panicked a bit more, changed tack a bit more, made people more nervous ... I'm sure you get the picture.

Eventually, after 18 months, we did deliver: something. It was late, and it lacked so many features that it couldn't replace the existing functionality that it was meant to. It did go live, but users could only get to it via a subtle link off a page that hardly any users visited. The result stank: it was demoralising, and it did a lot of damage of the development team, as well as the reputation of Agile practices within the development team, with the product team, and with the wider management.

The tech lead left, and we moved on to a different project: adopting and looking to improve the existing platform, which is arguably what we should have done in the first place.

So, what did go wrong?

I've split my analysis into the various areas that I think require consideration before anyone takes on such a project...

Emergent Design

It's was a core principal of ours not to build too much up front. Not to over-engineer functionality that we felt we didn't know we would need, not to do defensive programming.

I think that this is an excellent approach, but I think we did too much of it in places. There were certain bits of functionality that we knew we needed to do. These were the must dos. For example, our tool needed to not show data that was flagged as private. We knew this up front, it was a must do, but we didn't start to implement it until very late in the day.

The result of postponing the must dos was that we ran into some unforeseen issues that we hadn't originally engineered for and had to undo some work we'd already done, and perhaps shouldn't have done in the first place. In addition, we of course had to spend some time re-familiarising ourselves with areas of the code we hadn't touched for months.

In hindsight, as software engineers working with the product owner, we should have split deliverables into must dos, should dos, and could dos, and architected for the must dos, not left them until late in the day to implement.

That said, the upside was a truly lean codebase. There was zero non-utilised production code when we went live. There was no unnecessary defensive coding (such as checking for invalid integers when the consumers of a method would never send non-valid integer values). As we de-scoped the original story map, delivering what really was the minimum viable product, we found that there was virtually nothing we had to remove; because we resisted the tendency to add something you'll think you'll need just because you're in there.

Getting the balance right is the key. I'm quite sure that hindsight makes it much easier to see what was a must do and what wasn't. When I joined the project, there was a whole story map of very high-level epic stories and it wasn't clear what the critical path was. We didn't do a whole lot of in-depth specification up front, as is the Agile way, but I do feel that we didn't quite do enough.

Mob Programming

I first heard about Woody Zuill's mob programming from the guy himself at Agile on the Beach in 2015. I think it's an excellent way for an Agile team to work and do it all the time. But perhaps for certain things. One of the main issues we had with mobbing in my opinion, was one of perception. The offices we were in had very few work areas that were enabled for Agile/XP software development and we were sat very visibly in a space with people working in a very traditional manner (developers working alone at their terminals). We had our work projected onto a large screen, with us sat on a sofa looking up at it. Everyone who walked past, including the bean counters, could see five developers all sitting there apparently writing Java code, at the pace of a snail, debating over whether to use functional or imperative programming approaches and how best to name a particular method or variable. I'm not sure it made the best of impressions, especially amongst the more Agile-suspicious staff who walked past our little area.

My advice to you: if your company is struggling to buy into mobbing, then don't do it in public: book a meeting room and do it there, behind closed doors. If it works and you deliver great software, then shout about it, and then work to get those new mobbing spaces installed in your building. If you screw up, then no-one will have noticed you were ever mobbing in the first place, just having a lot of meetings, and meetings are acceptable, right?

Another issue was one of trying to engineer new solutions en masse when most of us were unsure of what we were doing. There were a few stronger personalities in the team who had strong opinions on the technology stack, or the architecture, we ought to be using, and this quite often led to the whole of us being dragged into a debate. When the pressure was put on and we started panicking and pivoting, we did a lot of mob wheel spinning, and I found this really painful to be involved in.

I feel that mobbing is great for when everyone is on the same page, at roughly the same level, and the path is clear. It's bad for indecisiveness, for wheel spinning, for mob googling; mob cluelessness sucks!

We did adopt a mob programming charter (or rules of engagement if you prefer). This document was a good idea. We didn't stick to it rigidly, but it did prove to be a good document to help us stay on track.

Lastly, mob programming was used by the tech lead to on-board five developers over the course of several months. It was probably the only way of doing this, but it was excruciatingly slow, quite frustrating, and might well have looked inefficient to management.

Project (and Perception) Management

We had one enormous user story that took us nearly a year to complete. It was broken into sub-tasks, but essentially, we'd do stand-up each morning and still the story would be sat in the "In Development" column. There was no visibility as to what was going on, so no one cottoned on to how much work we were doing from a learning and architecture point of view, but only how little progress we were really making from the product viewpoint.

The story went "I can do this simple thing that I can already do on the existing system", but the devil in the detail was of 100s of tasks around building a micro-services architecture; writing testing frameworks, let-alone tests; writing a continuous integration and a continuous deployment pipeline from scratch; writing a deployment framework from scratch in Python; and designing a real-time event-driven replication system build on AWS; oh yes, and scripting the entire AWS architecture in Terraform.

The perception amongst our peers in architecture within the organisation was that we rocked! We were doing great stuff. We were doing everything by the architect's dream book, doing exactly what the technical strategists wanted, and were doing lots of stuff other developers on other projects only dreamed of doing. This was a great feeling, lots of people were inspired, we got great feedback, and follow-up explanations and knowledge-shares have proved useful and fruitful for all concerned.

All well and good, but the product owner felt we were delivering not a lot, and from a user-story perspective we weren't. The problem here I think was one of perception.

We'd created an issue by deciding early-on that we wanted to have only user stories driving the development. This meant that in the first year we completed 100s of technical stories to deliver one single user story.

This is where the perception management issues came in: we weren't representing ourselves and all the work we were doing, and this lack of visibility disempowered our stakeholders: Had we shown all the work we were doing, the stakeholders may have started to question things sooner, and this may have saved the project; at least we could have been more in it together, and we could have all made the decisions to stick, stop or pivot together. Knowlege, or rather awareness is a dangerous, but also empowering thing.

Lone Chief Architect Designs in Silo

The majority of my fellow developers and I joined the team six months into the project. Our tech lead had been on the project since the start. It had been they that had agreed to do the rewrite, they that had agreed what part of the system to start with, and they that had single-handedly planned and built a spike of the architecture that we were to use.

I must be careful at this point not to lay the blame solely on their shoulders. It wasn't their fault. The outcome was the result of their having worked solo for the project kick off. There was no one to question their decisions, no one to temper the desire to do everything right, or use the latest and greatest of everything, with questions such as "what are we trying to deliver here?", "is this the best approach we're taking?", and even "is this really the thing we ought to be doing at all?".

It's not good to work in a silo. And it goes against everything Agile stands for. Don't do it, and don't put anybody else in the position of having to do it. Period.

When Doing Everything Is Doing Too Much

As I've mentioned a few times we were a cross-functional team doing everything we needed to do, including our own DevOps.

The DevOps movement's aim of unifying software development (dev) and system operations (sysops) is laudable and I wholeheartedly agree with the movement away from these silos.

I've pretty much always been in both camps: developer and sysops.

Obviously it's all good, but if you contract in a team Java developers, you cannot reasonably expect them to necessarily know anything about operations, operating systems, networks, orchestration etc. This means you'll be needing to teach them about all this stuff, and work with them to ensure that it's done right, that it meets the team's standards.

This process takes time. You need to factor that in.

Micro-services Architecture

It's a great idea: micro-services allow scaling, they allow massively decoupled systems, they allow for different teams to work on separate parts of the system with contract tests as their agreements, they allow these teams to implement software in the language that is the best fit for the task at hand; or maybe the one that's the new thing all the cool kids are coding in. But, be warned: distribution, micro-service-based architectures can be a lot more involved to implement than our old friend the monolith, you know, the uncool one you no longer want to invite to parties.

Love Legacy Code

And all the above, brings me to the biggest elephant in the room: why were we doing the big rebuild at all?

At the time of joining the project I was giving a conference talk about how we should be wary of the big re-write and should be working instead on engaging with and improving our legacy code (video, slides.

Somehow I didn't clock this at first, but after a while of being on the project it became clear to me that there was actually another software delivery team involved! This team was working on the monolithic platform, had done for many years, and was delivering user value on a daily basis.

We were trying to rebuild the whole thing in a bubble!

As time went on it became apparent to me that we'd chosen to rebuild a relatively low-value part of the system as well, a site search tool. It was not a piece of functionality that people used all that much, and they had an reasonable alternative for the current creaky old system: Google.

The existing monolith was very tightly coupled, with business logic strewn over front-end, back-end code, as well as stored database procedures. The search tool was complex architecturally, and I can see why we might have chosen it, but building it in a bubble (silo) separate from the rest of the monolith meant that we weren't engaging in the real integration issues we'd most likely face in the future.

We weren't employing any of the patterns for improving legacy systems, weren't adding test coverage to the existing monolith, we're working out integration points so that we could start to build out new features and replace old ones, weren't using strangler patterns to gently lift

This meant that we were never going to get it done! Xenon's paradox.

Do You Speak My Language?

One of the driving factors for the rebuild work was that the majority of the code was written in ColdFusion, a rapid web-application development framework created in 1995. ColdFusion was not one of the company's strategic languages at the time, and therefore it was seen as legacy. But was it really? Let's take a look at ColdFusion:

Originally developed by Allaire in 1995, I had my first taste of it back in 1996 whilst working with erstwhile Internet pioneer DC Creative (via my own company Psand. It is a proprietary rapid development web-application framework now developed by Adobe Inc, having bought Macromedia, who in turn had bought Allaire. It uses an SGML compatible tag-orientation mark-up language called CFML and the CFScript extension to CFML permits more JavaScript like coding styles, such as:

  
component {
    public void function foo() {
        WriteOutput("Method foo() called
"); } public function getString() { var x = "hello"; return x; } }

Ultimately the problem turned out not to be ColdFusionm, but rather the way the ColdFusion had been written over the years. Some parts were good, but some lacked structure, and there was a lack of consistency, and of course no tests! In fact the biggest single most problematic part of the codebase was written in none other than Java. There were some 80+ Java "scripts" that ran various automation tasks written in very old-school Java and very unstructured, including one core depedency of several thousand lines that depended on an Oracle stored procedure also of several thousand lines, with the domain logic spread liberally across both Java and Oracle and a number of other Java programs.

You get the picture I hope.

In short, let's not get too hung up on the programming language (as did one person I interviewed the other day who, when we mentioned that some of the codebase was in Scala started frothing at the mouth!) and focus on the fact that good code is good code (even in a language such as COBOL) and bad code is horrible in any language.

As programmers we need to be polyglot, but mostly we need to focus on improving our code.

The Blame Game

And the result?

The result was that we under delivered, and we delivered late. It's not an exaggeration to say that the result was the worst I've experience in my career as a software engineer, certainly from my own personal point of view. When the tech lead left, I took over the role, and found myself in the unenviable position of having to defend decisions that I really didn't agree with.

It's my problem, and it was a personal choice to accept the role.

The perception was pretty much that we were a team that couldn't deliver, and maybe worse still, that the problem was the Agile and XP practices we used. We had a workshop on Scrum thrust upon us, and another on story writing. Our colleagues perceived that we needed to stop using pair programming, perhaps be pragmatic about testing, and really question quality, when looking at future delivery: a regrettable outcome.

The Day After the Storm

Once the software went live, we all lost interest in it. Partly because of the low value of the feature it implemented, and partly because I think everyone just wanted to move on from the experience.

One of the downsides with this was that no-one was really interested in maintaining it, of improving it. The lovely CI/CD pipelines stopped working, the contract tests started failing, but there was never going to be a moment when we were going to get around to fixing them; there just wasn't the desire to do any more work on it; it was a damp squib.

Conclusion

In summary, here are my lessons learned:

(c) 2018-2022 Mike Harris. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".