Over-engineering is a developer’s cry for help

Over-engineering is a developer’s cry for help

As always when I write about “anti patterns”, or “things not to do” – I’m speaking from experience.
In the case of over-engineering, not only did I use to do this “bad thing” – I still sometimes do it today – knowing full well it’s a “bad thing” to do!
For me, this proves a tough habit to break.

This is partly because I wasn’t actually clear, until recently, what “over-engineering” actually is.
I used to think that “over-engineering” is over-application of “engineering”. Like:

  • Creating a 7-layer architecture for a CRUD app
  • Using redux for a website with less than 5 pages
  • Using kubernetes when you’re not google 😉

But that’s not over-engineering. that’s just bad / overly-complex / resume-driven engineering.
Over-engineering, as I now understand it, is:

Building functionality that is not required.

(yes, “required” can be a bit blurry. For simplicity’s sake, we can define “required” as “appears on the work ticket we’re currently working on”)

On the face of it, this looks pretty simple. You’d have to be pretty dumb to work on something that nobody’s asked you to do. Right?

Well, what about “future proofing” and over-generalizing? Have you ever done those? I have do!
These are, actually, cases of building functionality that is not required.

As usual, XKCD explains it best:

How many times have you implemented a system to support any arbitrary condiment, when the customer just wanted some damn salt?

Some other examples of over-engineering, from personal experience:

  • Recently I added a new string field to a data model. We just had to read it from a form, and later show it in the UI.
    A colleague suggested we write code to normalize the values in that field (e.g. downcase, trim whitespaces), as well as create an index in the DB for it. This is in case we’d need to filter or sort reports by that field in the future.
  • I was tasked with building a very simple survey tool. All the questions asked would be “yes/no” questions.
    I implemented a system that can process any text answer.
  • When calling out to a 3rd-party API, we wanted to retry 2 times when an error occurs, in case the error is transient.
    We’ve implemented a configurable system that retries X number of times.
  • When a user submits an identifier of an OWASP top security risk (e.g. A03), we needed to show some information from OWASP about that risk.
    I was already planning a real-time API client or web-scraper, and how we can cache data even between requests to improve performance and resource utilization. That way we’ll always have up-to-date data in real time.
    My colleague just scraped the OWASP website into a JSON file, and set himself a reminder for next year to check the new top 10 list.

Also, have another look at the examples of “bad engineering” I gave above. In some contexts, they too can be considered over-engineering:
Building extra application layers “in case” we need to add more logic. Using kubernetes “in case” we need to handle web-scale load. etc.

Why is this a problem?

Some of the above “over-engineering” is not very complicated or difficult to do. Downcasing some strings, or reading from a configuration file. Is it really such a problem?

Keep in mind, though, that the cost of initial implementation is not the largest cost associated with writing code.

Any code that’s written requires maintenance.
Every developer looking at that code for the first time needs to figure out what it does, and why. How confusing is it when the answer to “why?” is “no reason”!
Every existing functionality needs to be preserved. So from now until forever, we need to be careful to not break it as we change the code around it. We need to regression-test it every time we make changes.

furthermore – when we guess at additional functionality, we guess that it may be required along some axis X.
And we write code based on some assumptions about X.

However, the actual functionality in the future may be along axis Y, which is different to X.
Now we’ve painted ourselves into a corner, designing our software to handle change in the wrong direction.

(For example – we designed a system to configure the number of retries for an API call.
But what if, in the future, we still want to only retry twice, BUT, we need to control the amount of time between retries? The “extra” code to check how many times to retry may make the implementation of configurable wait times more difficult)

Why do we do it?

Hopefully I’ve convinced you (or, you already knew) that implementing unwanted functionality is a bad idea.
So why would developers who are otherwise smart, skilled and reasonable, engage in over-engineering? Making their code harder to work with, in order to build functionality that nobody wants?

Are they crazy?

No. Quite the opposite – we over-engineer because we’re smart.

Here’s a very logical thought process where the logical conclusion is over-engineering:

  1. It’s difficult to understand and / or to change this piece of code. AND / OR
  2. It’s difficult / time consuming to validate that this piece of code is working as intended.
  3. I currently have to change this piece of code, and validate that it works as intended (e.g. because I’m extending this functionality).
  4. This is a difficult / time consuming process due to the above.
  5. If I have to do this again in the future, this will result in more difficulty and time wasted.

Conclusion: As long as I’m here, I might as well make some further changes that may be needed in the future, even if they’re not needed now.
This will save the overhead of having to understand and / or validate this code again in the future.

Statistically, this seems like a sound conclusion.
Let’s say that I’m investing a further 1 hour of work on those not (yet) necessary features.
And let’s say that this 1 hour now can save me 3 hours in the future, if my guess about the future requirements is correct.
In that case, even if my guesses are only 34% correct, I still come out ahead.
I like dem odds!

However, if we look a bit closer, the above logic is flawed.
Specifically, the assumptions that we begin with:

The code is difficult to understand and / or change and / or validate.

But this is not due to some act or god, or a law of nature.
The code is difficult to understand because we wrote it this way.
It’s difficult to validate because we wrote it this way (or didn’t write good automated tests).

That’s why I claim that over-engineering is a cry for help:
If we start from the result (“we are over-engineering”) we can work our way back to the reason (“the code is difficult to work with”).

If we do that, we can understand the reason behind the need to over-engineer. Then we can hopefully deal with it, rather than doing the poor-person’s optimization of over-engineering.

Making our code safer and easier to change will make over-engineering unnecessary,
which will help to keep our code simpler,
which means it’ll be safer and easier to change,
which will make over-engineering unnecessary…
rinse and repeat.

On the other hand:
Over-engineering means that our code will be more complex (because we’re adding extra code and functionality),
which means it will be harder to understand and to change,
which will create an incentive to over-engineer,
which will make our code more complex…
rinse and repeat.

Which of these cycles would you rather be on?

P.S.

I don’t want to ignore the fact that writing simple, easy-to-understand, well-tested code is hard. It’s a skill that i still haven’t mastered, 15 years in.

But if we don’t face this difficult challenge now, we’ll end facing the impossible challenge of changing an over-engineered mess.

Hey managers – this is why we don’t believe you when you say that you “care about quality”

Hey managers – this is why we don’t believe you when you say that you “care about quality”

Let’s start this with a short exercise:
Think of all the places you’ve worked at, where you heard “we care about code quality”, “we’re willing to invest the time to achieve high quality”, “we want to tackle tech debt”, or similar.

Now think of all the places you’ve worked at where you’ve felt like people actually cared about code quality, tech debt etc.
What’s the relationship between these two groups?

Naively, we may guess that they are the exact same group.
In practice, for me, the first group consists of 4-5 workplaces, and the second one of 1 workplace.
I think that this is a pretty representative result.

This begs the question: Why is it that we hear from our CTOs, directors, CEOs about their “commitment” to quality, but we so rarely see the results?
Are they all liars?

In my experience, no. I believe these people when they say that they do want that rainbows and unicorns utopia.
I also think that they don’t understand how to change the culture of their organization.
Let me explain:

What is culture, anyway?

Culture is how an organization behaves. It’s “how we do things ’round here”.
Some examples are:

  • “we pair program”.
  • “we don’t write automated tests”.
  • “we don’t deploy on a Friday”.
  • “we work overtime to meet deadlines”.
  • “developers only do the coding; after that, it’s ops’s responsibility”.

The majority of these “rules” aren’t written anywhere. But still everyone knows them.

Equally as important, is what culture is not.
Culture is not what is printed on posters, press releases, or talked about at company all-hands.
For example, a company says it “enjoys a reputation for fairness and honesty”. In reality, it committed one of the biggest frauds in history.
Or, a company who’s motto is was “don’t be evil” turns out to do some.. pretty evil things.

From the above, it’s clear to see where our well-intentioned managers are going wrong. They’re trying to change how we do things (== culture) by talking.
But that’s precisely what culture is not. it’s what we do, not what we say.

How does culture get set?

If culture is not set by what the top brass say, then how is it determined?
there are a few elements that impact culture:

1. Incentives

Meaning: which types of behaviours get rewarded, and which get “punished” (or received negatively).
This also includes priorities – what are employees incentivised to prioritise?
Some examples:

  • At Enron, management said that they’re committed to a high ethical standard. But employees who stepped on their customers, colleagues, and ethics, were hailed as “top earners”.
  • To the contrary, employees who did not “play dirty” were disciplined for not generating enough profits.

This behaviour taught employees what’s the expected way to do business.

  • At a software company, a CTO declares that they want to invest in automated testing.
  • However, they don’t prioritise a project for building a testing framework for their product. They don’t push back deadlines to allow testing activities.
  • When Pull requests with untested code were merged, the CTO did not comment on it.
  • When a project was running late, the CTO called for a meeting to discuss the situation.

This behaviour showed the developers what’s the importance of testing.

2. Established practice

“That’s how we do it because that’s the way we’ve always done it”.
When we enter a new organization, we tend to follow the lead of the more tenured folk.
We do things in the same way that they do it. This is especially true for more junior employees, who may not have previous experience to compare against.

This is illustrated by the monkeys’ parable:

4 monkeys in a room. In the center of the room is a tall pole with a bunch of bananas suspended from the top.
One of the four monkeys scampers up the pole and grabs the bananas. Just as he does, he is hit with a torrent of cold water from an overhead shower. He runs like hell back down the pole without the bananas.
Eventually, the other three try it with the same outcome. Finally, they just sit and don’t even try again. To hell with the damn bananas.
But then, they remove one of the four monkeys and replace him with a new one. The new monkey enters the room, spots the bananas and decides to go for it. Just as he is about to scamper up the pole, the other three reach out and drag him back down. After a while, he gets the message. There is something wrong, bad or evil that happens if you go after those bananas.
So, they kept replacing an existing monkey with a new one and each time, none of the new monkeys ever made it to the top. They each got the same message. Don’t climb that pole.
None of them knew exactly why they shouldn’t climb the pole, they just knew not to. They all respected the well established precedent. EVEN AFTER THE SHOWER WAS REMOVED!

(source: Competing for the future by Gary Hamel and C. K. Prahalad (1996))

Some human examples:

  • “At my previous company, we wrote a test for every method. But here, we don’t write tests”.
  • “Every morning, we give a status update to ourselves. We don’t know what purpose it serves”.
  • “Every two weeks, we try to guess how much work we can do in the next two weeks.”

3. Ability

It may seem obvious, but employees can’t do things that they don’t know how to do, or maybe don’t even know about.
For example:

  • A team can’t engage in continuous integration if they don’t know what that is.
  • A developer can’t practice TDD if they weren’t trained in it.
  • It’s hard to refactor when you don’t know the common code smells, and their fixes.

You may notice that these three factors are not isolated from each other. Each one influences the others. For example –

  • Incentives lead to certain behaviours. These behaviours become established practice, even if the incentives are removed.
    (as seen in the monkeys’ parable – the (dis)incentive of the shower was removed, but the established practice remained)
  • Our ability determines what practices we can, or can’t, establish.
  • Inversely – even if we have the ability to do something, it may decay and deteriorate if it’s not the established practice. (for example – we become worse at TDD due to lack of practice).
  • We may provide incentives that encourage employees to gain certain abilities. For example – provide free training.

In practice

Going back to my own experience – All those companies who talked about quality, but churned out bugs:
Can these situations be explained by the above theory? Let’s see.

1. Incentives

  • When a feature was running late, developers faced questions and pressure.
  • Testing or refactoring activities were cut to meet deadlines.
  • Even when a “technical excellence” initiative was announced, teams were expected to deliver the same amount of features as they have done before.
  • Developers who hurriedly fixed production bugs in a hacky way were praised for their quick problem fixing.

On the other hand –

  • When a production bug occurred, the developer who caused it did not face any questions or pressure to learn from that experience, and make sure it doesn’t happen again.
  • Teams who prioritised “technical excellence” activities faced pushback from “the business”, for “delaying features”.
  • Developers who took time to write automated tests, or to improve code structure, were, at best, tolerated. But not praised.

2. Established practice

Many organizations I worked for adapted to the incentives described above. The resulting practices became the norm, never to be questioned:

  • We don’t have time to write automated tests.
    (Corollary – we have no problem approving untested code to go into production)
  • We don’t have time to refactor.
    (Corollary – we have no problem approving spaghetti code to go into production)
  • We need a week of manual testing before releasing a new version.
  • After every release, we release a series of patches, to fix bugs in production.
  • We routinely work evenings and weekends to fix urgent bugs.

There have been several cases where I tried to challenge those established practices. I proposed writing automated tests, working in smaller batch sizes, and more.
In some cases I encountered skepticism, or even outright hostility.
Even when the reaction to my ideas has been positive, a change in behaviour didn’t necessarily follow:
“You do it like this, which is better. But I’m used to working like that“.
Old habits die hard.

3. Ability

I rarely worked with colleagues who practised TDD (or any form of automated testing), refactoring, etc. Even senior members of the team.
Hell, I’ve been a senior member of the team who didn’t practise TDD and refactoring.
Many of my colleagues were willing to try these things, but didn’t know where to start.

Some more examples –

  • At a previous job, the organization was willing to prioritise automated testing. I was asked to lead a project to create an automated tests suite for the product.
    However, I had virtually no experience with automated tests, and ended up doing a pretty crappy job of it.
  • Only a single organization I’ve ever worked for offered any training on automated testing or refactoring.

Conclusion

There are several interconnected factors that determine how an organization functions.
Counter to intuition, these are not related to what managers are saying. They’re related to what managers are doing – what behaviour they praise, what behaviour they raise questions and concerns over, what skills they teach, etc.
We respond to these factors, even as most of us aren’t consciously aware of them.

For employees, being unaware of these factors is no big deal.
For managers who try to affect change to their organization, this can make or break a “digital transformation” project.

Responding to “Are bugs and slow delivery ok?”: The blog post that I’ve hated the most, ever

Responding to “Are bugs and slow delivery ok?”: The blog post that I’ve hated the most, ever

A few days ago I came across a blog post, by Valentina Cupać, titled “Are bugs and slow delivery ok?”.

The title itself was enough to irritate me. What do you mean? That’s been an answered question for decades! Surely this is nothing more then clickbait.

Reading through the post, my irritation grew into real anger.
In the post (which is short and well-written, well worth the read), Cupać asserts that, indeed, bugs and slow delivery are OK.

Based on her experience, most companies can afford to ship buggy products, slowly.
Bugs have low impact on the business. The speed of fixing them doesn’t matter much.
Making developers miserable due to a constant stream of “urgent” bug fixes is fine.

As a proponent of such noble ideas as TDD, trunk-based development, refactoring, etc, you can see why Cupać’s ideas would be hateful to me. It’s the exact opposite of everything I believe in!

But there was one thing that I hated most about her blog post:

She was right.

Cupać says that she came to her conclusion based on her personal experience.
That made me reflect on my personal experience, in the last 15 years of working in software.

I’ve seen (and wrote) some terrible quality code. Really bad stuff. Untested, most of it.
In nearly every place I’ve worked at.
I’ve seen enormous amounts of time wasted with testing for, or fixing, bugs.

You know what I haven’t seen? not once in 15 years?
A company going under.

I’ve seen expensive and unnecessary re-writes, once technical bankruptcy was inevitably declared.
I’ve seen developers burn out by being regularly pushed to work all hours to fix bugs.
I’ve seen my product colleagues scramble to sooth, explain, and yes – sometimes downright lie, to customers, about why features are late.

But none of that caused these companies to go out of business.

And why is that?
I think the answer lies in this uncomfortable truth:

Software doesn’t make or break an organization.

Even a software organization.

There are many ways for software products to survive, and many of them don’t depend on quality:

Some software is built for an internal audience, who are captive customers. They have to accept whatever sh*t we developers deliver to them.
External customers can be so bound to a specific vendor that they may as well be captive. (This can be due to contracts, or because of deep integration with the client’s systems, or sometimes even downright corruption)
The competition may be just as bad, suffering from the same bugginess and slowness problems.
A strong sales and customer care teams can compensate for lack of features.
And many more considerations, completely unrelated to software.

Once I woke up to the truth Cupać’s arguments, I realised that I was the victim of dogmatic thinking.
I was so convinced that “high quality” (whatever that is) is the right thing, that I refused to consider the evidence before me.
Like an Orwellian sheep, I bleated “High quality good! Low quality bad!” without questioning.

That’s not to say that I now think that “High quality good! Low quality better!”.
As many commenters on Cupać’s post observed, while high quality isn’t necessary for success, it improves the chances of it.
Even if bugs and long wait times are acceptable, no bugs and short wait times represent a competitive advantage.

I now arrived at a more refined view of technical quality. Where high quality is nice to have, but is not the be-all-and-end-all.
Organizations can be justified in choosing not to invest in quality. And they’re not “wrong” for doing it.

Up until now, I assumed that every organization should aspire to high quality of software delivery. And if they don’t have that currently – then surely they’re working towards it.
And I’ve been disappointed many times when companies turned out to not share in my dogma.

This new insight will fundamentally change the way I view software organizations.
Especially, it would help me to consider which organizations align well with my preferences on quality (yes, they are preferences, not absolute truths), and which aren’t.

So thank you, Valentina, for your thought-provoking post.
I loved to hate it.

In America, developers write tests to “cover” their production code. In TDD Russia, developers write production code to cover their tests

In America, developers write tests to “cover” their production code. In TDD Russia, developers write production code to cover their tests

recently, an enterprising engineer on my team has added a code coverage check to our CI pipeline. On every pull request, a bot posts a comment like this:

Test coverage report
Total statements added in this PR: X
Total statements added covered by tests: Y
Coverage % in this PR : Z%

It’s a neat little reminder for developers to make sure our testing is up to par.

After a while, I noticed that many of my PRs have a 100% test coverage*.
Ever since then, many (read: zero) colleagues have approached me and, with eyes filled with wonderment, asked:
“Wow uselessdev, how are you able to produce such consistently thorough tests?”

Which got me thinking – how am I able to produce such consistently thorough tests? I never set out to achieve 100% coverage. I only learn about it after the fact, when the bot calculates the coverage percentages.

In fact, it’s not a hard question to answer – it’s because of the way I work.
Whenever possible, I try to practice TDD:

I think of a use-case that the software needs to support.
I write an automated test that verifies it.
Inevitably the test fails.
I write the minimal amount of code to make the test pass.
Repeat.

Have you spotted the answer? it all hinges on one word – “minimal”.

Another way to put it –
I don’t write code unless there’s a red test that requires it.

If you think about it, I’m not writing tests to cover production code.
Rather, it’s the other way round – I write production code to “cover” test cases.

This doesn’t only have the nice effect of achieving high code coverage. it also helps me avoid a common pitfall that all developers are tempted to make: writing speculative code.

Speculative code

I define “speculative code” as “code that was written, but was never executed, by the developer”.

This may sound confusing. Even if a developer doesn’t write a test for every line of code, surely they click around in their browser, or execute it in another way, to verify that it works as intended.

That’s generally true, but we also often write some code “just in case”.

take this piece of code, for example:

var user = usersRepository.find(userId);
var dataFromSome3rdParty = ThirdPartySDK.getData(userId);

This code works as intended.

“Well actually”, says a knowledgable reviewer, “ThirdPartySDK is just a random place over the internet, that is out of our control. What happens if it’s down? or has a bug?”

So the diligent developer adds error some handling:

var user = usersRepository.find(userId);
try {
    var dataFromSome3rdParty = ThirdPartySDK.getData(userId);
} catch (Exception err) {
    logger.error(err, "Cannot find data for " + userId);
    return "Sorry, we weren't able to find 3rd party data for " + user.emailAddress;
}

And everybody goes home happy.

Only, on our local machine, the sandbox environment of ThirdPartySDK always returns successfully. So the error handling code gets shipped without ever being executed.

Another example of speculative code:

var email = request.body?.user?.emailAddress;
var isValid = validateEmail(email);

Can you spot the code that the developer has never executed?
It’s pretty subtle. It’s those ?s (aka safe navigation operator).

Whenever a developer ran this code on their machine, `body` had a `user` property, and `user` had an `email` property. They never “used” the safe navigation functionality.

OK, but what’s so bad about some extra code for added safety?

Yes, we don’t expect these edge-case situations to happen, but better safe than sorry, right?

Well, any line of code that we write has a potential to have a bug hiding in it.
And if we never run that line of code, we have no chance to discover that bug.

And, it just so happens, that both of the code samples above have a bug in them.
Have you spotted them? Go back and have a look.

In the first example, if userId doesn’t exist in our system, then usersRepository.find will return null. And the 3rd-party SDK will throw a NotFoundOnThirdPartyError.

And our error-handling code will try to read user.EmailAddress in order to provide a meaningful message to the user. Oops!

Errors that happen inside error-handling code are a special type of hell.

In the second example, the developer never saw what happens if either of the `user` or `email` properties are missing.

What actually happens, is that email will be evaluated to null. And validateEmail may, or may not, blow up, when given a null argument.

Back to TDD

So, if we want to avoid shipping code that we never tested, we must test all the code we write.

When thinking “what if X happens”, we should make sure that we actually see what happens when X happens.

It could be using an automated test. You may choose to keep the test, or delete it before committing the code.

It doesn’t have to be an automated test, though. It’s possible to “generate” test cases by fiddling with a function input, or with your own code, to generate a “rare” scenario.

The idea is to make sure not to ship untested code. There are many ways to do it, but TDD is the most thorough one.

TDD makes us cover our tests with production code, and helps us avoid speculative, untested code.

*Footnote: the usefulness of test coverage

Whether test coverage is a good, or useful, metric, is beyond the scope of this article.
I just want to clarify here – I don’t think that test coverage should be a target.
It can’t tell us if we’re doing a good job testing our code, but it can tell us if we’re doing a bad job of it.
Meaning – 80%, 90% or 100% test coverage doesn’t guarantee that our code is well tested.
However, 50% test coverage does guarantee that our code is not well tested.

If I were a CTO… My manifesto for running a tech business

If I were a CTO… My manifesto for running a tech business

This post is a documentation of the way I, personally, think that a successful software organization could be structured and run.

I previously wrote about the way a successful individual developer could work.
This post is taking a much broader view of the entire tech organization.
You’ll find that the different principles and practices described here are similar, identical, or enable the ones mentioned in the above post.
As always, the opinions here are influenced by well-known best practices (DevOps, agile), and by my own ~15 years of experience in different software organizations.
Also important to mention, that these opinions are not influenced by actual experience of holding a senior leadership role. So this post is quite one-sided in favour of the “internal” tech organization, without much / any consideration to the CTO’s role as part of the wider management team.

This is meant to serve as a living document – I’m sure it’ll change it based on readers’ feedback, and my own learnings and observations.

Goals (The “Why?”)

I don’t currently, nor do I ever intend to, serve as a CEO, CTO, or any other ‘chief’.
So this isn’t an instruction manual for my future self.
I also don’t presume to be a “consultant” or “executive coach” (yet?).

I’ll never have the authority to actually implement all the items in this list. But I think it’s still valuable to put in writing, for a few reasons:

  1. To put my own thoughts in order – writing down this list will force me to articulate my “philosophy”, and to clarify it to myself. Clarifying my values and priorities is a valuable exercise. Especially for times such as job searching, when I consider whether a company is a good fit for me.
  2. To use in my own little domain – even though I’ll never be a “top dog”, I may have the opportunity to lead a team again. Some of the principles and practices outlined here can be implemented even at a small scale.
  3. To influence others – I hope to influence the thoughts and actions of my employers in the direction I believe is right. Even if this document itself is not enough to affect change, it could serve as a starting point for a conversation.

Principles

These are high-level overview of “How we win” – if we succeed in the below, then we will win as a team.
They are outcomes, or metrics, rather than concrete actions and steps (see “practices” below for breaking down of principles into actionable items)

You won’t find anything ground-breaking here; as I mentioned, this is built on top of well-established philosophies.

Generative culture

(sometimes referred to as “Westrum organizational culture”).

This is an organizational culture that is goal- and mission-driven, fosters collaboration, encourages risk taking, and implements novel ideas.
It is informed by the belief that employees are internally motivated.
Meaning – Everyone wants to do a good job. There’s no need to “force” the workers to do a good job. (this is known as “Theory Y”)
If we espouse this theory, then there’s no need for management to overly supervise, check up, or impose limitations on employees. But rather, give them the necessary tools, knowledge, and training, to succeed.
Some concrete examples of this may be:

  1. Team autonomy in what they do – projects and tasks are decided on by the people who do the work. Management is responsible for priorities, vision, and “big picture”. Not the everyday work.
  2. Team autonomy in how they work – teams are free to choose how they go about achieving their goals. Scrum / kanban / waterfall / anarchy.. whatever gets good, consistent results.
  3. Trust – no “code owners” that must approve every change. no requirement for X number of “approvals”. We employ grown ups – they won’t start riffing on main , committing bugs and spaghetti code, just because they can. We trust them to be responsible, and come up with quality mechanisms that work for them, without forcing anything on them.
  4. Failure (such as a bug, production outage, miscommunication with a customer) does not lead to punishment. After all, the person(s) who made the mistake had the best of intentions. This means that the system they operated in allowed for that mistake to happen (even in the case where that system tasked them with doing a job that they’re not qualified for).
    Therefore, failure is an opportunity to learn and improve for the company, as well the individual.

Psychological safety

This is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes, and the team is safe for interpersonal risk taking (definition by Dr. Amy Edmondson).
That’s a core aspect of a generative culture. In a generative culture, we rely on individuals, not “leadership”, to come up with ideas, initiatives and the execution that drive the company forward. This cannot be done if employees don’t feel safe expressing their opinions.

Some other required attributes of a generative organization include:

  • Autonomy
  • Trust
  • Continuous learning (and improvement)
  • Cross-team collaboration

The “Four Key Metrics”

DevOps research and assessment (DORA) has consistently found that excelling at these four metrics leads to excelling in business outcomes (profitability, market share, customer satisfaction, employee satisfaction, and more):

Delivery

Deployment Frequency – how often you put new code in front of customers
Lead Time for Changes – how long it takes from the first commit on a developer’s machine, until that code is in front of customers

Stability

Time to Restore Services – the time between introducing a failure (bug / outage), and resolving it.
Change Failure Rate – % of deployments that cause a failure in production.

Moving these needles upwards requires being really good at quite a few behaviours and practices, as outlined below.
While they are not “principles” per-se, they are very helpful high-level goals that we can use as guidance.

Feedback

We recognise that we are often wrong. But we don’t know that we’re wrong, or in what way.
Therefore, we solicit, we value, and we act on, feedback. We aim to get feedback, and act on it, as quickly as possible.
This has multiple manifestations –

  1. Feedback about whether our product meets customers’ needs
  2. Feedback about whether our software behaves as intended
  3. Feedback about the quality of our code
  4. Feedback about us, our processes, and tools

Practices

The above principles are nothing in their own right. Only daily behaviours and incentive structures can bring a principle to life.
Here are some concrete practices and processes that, I believe, help realise the above principles:

Organization

Product teams / reverse Conway manoeuvre

Conway’s law dictates that our software structure will reflect the company’s organizational structure.
So, if we want to create a software architecture of independent, loosely-coupled components, then we need to structure our organization in such a way.
This would look different in every problem domain. But generally, a team is assigned a cohesive, independent sub-domain of the company’s business. For example – a “loans” team, a “savings” team, a “mortgages” team, and so on.
Each team has the responsibility for, and the personnel / tools to, provide the best loans / savings / mortgage software product. Starting from ideation, up to maintaining a service in production.

The team may be asked to provide some big picture outcome (e.g. “x% more savings account customers”, or “y% less churn for mortgage holders”). But the way the go about it is up to the team itself.

Realizes principles:

Generative culture / autonomy, trust – by making teams self-sufficient. They’re not dependent on anyone outside the team (e.g. QA team, ops team) to accomplish their goals.
Four key metrics / delivery – by removing dependencies and coupling, there’s less need for communication and coordination. Teams are free to work as quickly as they’d like.

Communication – Communities of practice

While teams are autonomous, none of them is an island. Teams still need to effectively work together, communicate about what they’re doing, coordinate, etc.
Additionally, learnings from one team (e.g. how to solve a specific problem) can be applicable to other teams.

The “standard” approach to these needs is a hierarchical one, traversing the organizational “tree”:
if team A needs to coordinate with team B, then it will go up through team A’s manager, who will go to team A’s director, who will go to their VP, who will go down to team B’s director, who will go to team B’s manager.
This approach is wasteful, and contradicts the principles of autonomy and theory Y.

An alternative would be to create structures where teams and individuals can communicate directly.
This could be communities of practice (e.g. “frontend devs”, “DB administrators”), technical all-hands (e.g. weekly open engineering meeting), or ad-hoc working groups. The details should be self-organized by the team(s) for whatever works for them. Management’s role is to allow the time and space (and encouragement) for these structures to emerge.

Realizes principles:

Generative culture / autonomy – even when coordination outside the team is necessary, the team has the autonomy to choose how to do that.
Generative culture / collaboration – allowing (and encouraging) direct communication, rather than hierarchical one, increases collaboration.
Generative culture / continuous learning – by providing opportunities for individuals and teams to learn from each other.

Management

We already touched on many things that management does not do – supervision, validation,tactical decision-making.
So what does management do, in our fairytale, rainbows-and-unicorns organization?
The main responsibilities of management are, generally, twofold:

  1. Provide context and “big picture” – making sure that everyone in the organization knows what the overall company goals and priorities are. So that they’re able to prioritise their own work accordingly.
    Making connections between different parts of the organization (e.g. “oh, you’re doing project X? well, Joanne from marketing is doing project Y which is related. You should talk!”)
  2. Reinforce the desired organizational culture – not by talking about it, but by consciously incentivizing desired behaviour
    some examples:
    • When a mistake / outage occurs, celebrate it as an opportunity to learn and improve, rather than playing blame games. Encourage teams to look at how the system / work processes can be improved to prevent the next issue.
    • Proactively reward employees who exemplify desired principles. Get rid of employees who don’t.
    • Back teams up when they need to invest resources in improving their processes, even in the face of external pressure (e.g. customer requests)
    • Back teams up even when they’re going in a different direction than what the people in management would’ve done
    • Share their own mistakes and vulnerabilities openly, to promote a culture of psychological safety

As you can see, the job of management is extremely important, but not large in volume.
This means that the organization requires less managers to function (for example, there’s no need for “directors” that have several teams under them, or “VPs” that have several directors under them, etc.)

Realizes principles:

Generative culture – this style of management allows individuals, not management, to generate value for the company (hence, “generative” culture).

Processes

As said earlier, each team can choose whatever workflow works for them.
There are a few common guidelines that are helpful across the board, though:

Relationship with the customer

Everyone on the team is expected to engage with, and learn from, customers.
Customers are not hidden away from developers behind product managers, business analysts etc.
This has the added benefit of removing menial tasks from product people (such as being a go-between between developers and customers, or writing down work tickets). They are free to focus on value-adding activities, such as customer behaviour analysis, market research, forward-planning, etc.

Realizes principles:

Feedback – developers have access to direct feedback from customers about what works and what doesn’t.

Technical

there are many technical practices required to achieve the above principles (especially the “four metrics”). DORA has a comprehensive list of them. I’ll only mention the ones that I’ve found especially important or valuable:

Continuous integration and delivery

In order to understand as quickly as possible whether our software behaves as intended, we must integrate all changes as frequently as possible, and check whether the software indeed behaves as it should.

In order to understand as quickly as possible whether the changes we’ve made are useful to customers, we must put them in front as customers as quickly (and frequently) as possible.

The pursuit of continuous integration and delivery is beneficial in itself.
It forces us to improve in many aspects of our work – automated testing, configuration and source management (to maintain safety while going fast), loose coupling (to avoid teams being blocked) etc.

Delivery pipeline as a first-class citizen

If we can’t (safely) deploy our software, then our customers can’t benefit from anything that we do. In this case, there’s no point to any other activity (e.g. developing a new feature, or fixing a bug). As it will not go in front of customers.

However obvious this seems, it has profound implications. It means that any “blockage” of our deployment pipeline (bad configuration, flakey tests, even significant slowing down of the pipeline) is as bad as a customer-facing outage. (Actually, it is a customer-facing outage. The customer does not get the functionality that they should)

Realizes principles:

Feedback (at multiple levels)
The four key metrics / delivery

Automated tests / test-driven development

I’ve actually seen an organization that did great on delivery metrics (e.g. multiple deployments per day), without emphasizing automated tests. As expected, their stability metrics (e.g number of bugs) were incredibly poor. And it was noticeable – this company has lost multiple contracts because customers were dissatisfied with the quality of the software.
If we aim to be able to release frequently, with confidence, we must have a reliable test suite.
Writing tests-first also provides invaluable feedback about our software design.

Realizes principles:

Feedback – about whether our software behaves as intended
The four key metrics – all of them. Testing decreases the odds of introducing a bug (i.e change failure rate). But it also gives us the confidence to deploy rapidly, without lengthy manual verification.

Fearless refactoring

Many of us have an aversion to changing working code. Whether it’s because we don’t see the value (it’s working; so what if it’s hard to read?), or afraid of the consequences (i.e introducing a bug).

However, if we aim for excellence in delivery and reliability, we can’t accept code that is hard (for us) to maintain.
Code that’s hard to maintain means slower speed (since it takes longer to change). it also jeopardizes our reliability (because it makes it easier to introduce a bug).

Therefore, we must encourage (and expect) developers to improve code that is difficult to understand or to change.
More than that – It’s also important to change code based on improved understanding of the problem:
we’ve all seen cases where code was built to accommodate use case X. But actually, use case Y is what the customer actually needed. So the code implements use case X, with some hacks and workarounds to make it behave like Y.
This is another case of code that’s hard to understand and maintain, and must be changed.

There are many more valuable technical practices. But, I believe that the teams will find them for themselves, if they aim to improve on the principles and practices already mentioned.
For example –

  • observability and monitoring – a team will naturally invest in those areas if it aims to improve its reliability metrics
  • Change management, version control, deployment automation – a team will naturally invest in those areas if it aims to improve its delivery metrics

Conclusion

If you’ve read this far, and you are not my mother, than thank you very much for bearing with me.
(If you are my mother, then hi mum!)

You may be interested in reading some of DORA’s research, or the “Accelerate” book.
This post has turned out to be a sort of poor-person’s reader’s digest of the DORA materials..

My software development manifesto

My software development manifesto

This blog post details the ideal process I would like to follow when working as a software developer. It lists the activities I find most beneficial at an hourly, daily, and weekly basis.
Many of the systems and processes below I’ve followed myself, and found useful. Others – I only had the opportunity to read or hear about, but have not tried.

Like any ideal, it’s not always fully achievable, or even realistic at points. But it’s important for me to have a “north star” to aim towards, so I know which direction I’d like to move in.

The audience for this post is:

  1. Me: To clarify my own thoughts, and to refer back to when thinking about making changes in how I work.
  2. Colleagues, team-mates, managers: To articulate what my agenda is, what kind of changes I may propose to our working arrangements, and why.
  3. Anyone else: To gather feedback, suggestions, or hear about their own experiences with these patterns and practices.

I’ll organize the processes I like to follow into different timeframes, or “feedback loops”.
Knowing whether we’re on the right track or not as soon as possible is one of the most important things in our work. Therefore, quick, tight feedback loops are paramount.

You’ll notice a high degree of commanalities between the different loops. Essentially, it’s the same process, only at different scales: Make a small step, verify, put it out of your head, move to the next step. Frequently stop and evaluate if we’re on the right track. Repeat

Inner loop: Implementation. Time frame: minutes

This is the core software development loop – Make a small change, verify that it works. And another one, and another. Then commit to source control, repeat.
I start this loop with writing an automated test that describes and verifies the behaviour I’m implementing*. Then I’d write the minimal amount of unstructured, “hacky” code, to make that test pass.
And then another one, and another one. Over and over.
This is a long series of very very small steps. (For example – running tests on every 1-2 lines of code changed, commiting every 1-10 lines of code changed.)

I would defer any “refactoring” or “tidying up” until the last possible moment. Usually after I’ve implemented all the functionality in this particular area.
That may even take a few days (and a few PRs).
That’s because I’m always learning, as I implement more functionality. I’m learning about the business problem I’m solving. About the domain. About the details of the implementation.
I’d like to refactor once I have the maximum level of knowledge, and not before.

Personal experience: I found that the only way to verify every single small change, dozens of times an hour, is with automated tests. The alternatives (e.g. going to the browser and clicking around) are too cumbersome.
I love working in this way. I can make progress very quickly without context-switching between code and browser.
I can commit a small chunk of functionality, forget about it, and move on. Thus decreasing my cognitive load.
Additionally, automated tests give me a very high degree of confidence.
Ideally, I’d push code to production without ever opening a browser (well, maybe just once or twice..)

*A short appendix on testing:
I mentioned that I’d like to test the behaviour I’m trying to implement.
I don’t want to test (like is often the case with “isolated unit tests”) the implementation details (e.g. “class A calls class B’s method X with arguments 1, 2, 3).
Testing the implementation doesn’t provide a high degree of confidence that the software behaves as intended.
It also hinders further changes to the software (I wrote a whole blog post about this).

My ideal tests would test the user-facing output of the service I’m working on (e.g. a JSON API response, or rendered HTML).
I would only fake modules that are outside of the system (e.g. database, 3rd party APIs, external systems).
But everything within the scope of the system behaves like it would in production. Thus, providing a high degree of confidence.
You can find much more detail in this life-changing (really!) conference talk that forever changed the way I practice TDD.

Second loop: Deployment. Time frame: hours / one day

I’ve now done a few hours of repeating the implementation loop. I should have some functionality that is somewhat useful to a user of the software.
At this point I’d like to put it in front of a customer, and verify that it actually achieves something useful.
In case the change is not that useful yet (for example – it’s implementing one step out of a multi-step process), I’d still like to test and deploy the code, behind a feature gate.

Before deploying, I’d get feedback on the quality of my work.
I’d ask any interested colleagues to review the code I wrote (in case I wasn’t pairing with them this whole time).
Pull / merge requests are standard in the industry, and are a convenient way to showcase changes. But an asynchronous review process is too slow – I’d like to get my changes reviewed and merged faster.
I’d want my teammates to provide feedback in a matter of minutes, rather than hours. And I’ll follow up with a synchronous, face to face conversation, if there’s any discussion to be had.
(In return, I will review my colleagues’ work as quickly as possible as well :))

If the changes are significant, sensitive, or change code that is used in many places, I may ask a teammate to manually verify them as well. or double-check for regressions in other areas.
I may ask a customer-minded colleague, such as a product person, or a designer, to have a look as well.

Once I’ve got my thumbs-up (hopefully in no more than an hour or two) I’ll merge my changes to the mainline branch.
The continuous delivery pipeline will pick that up automatically, package up the code, and run acceptance / smoke tests. After 30-60 minutes, this new version of the software will be in front of customers.

Personal experience: Working in this way meant that I could finish a small piece of work, put it out of my mind, and concentrate on the next one. That’s been immensly helpful in keeping me focussed, and reducing my cognitive load.
Additionally, it’s very helpful in case anything does go wrong in production. I know that the bug is likely related to the very small change I made recently.

Once I’ve finished a discrete piece of work, I need to figure out what to do next.
Getting feedback on our team’s work is the most important thing, so I’ll prioritise the tasks that are closest to achieving that.
Meaning – any task on the team that is closest to being shipped (and so, to getting feedback), is the most important task right now.
So I’ll focus on getting the most “advanced” task over the line. It may be by reviewing a colleague’s work, by helping them get unblocked, or simply by collaborating with them to make their development process faster.
Only if there isn’t a task in progress that I can move forward, I’ll pick up the next most important task for the team to do, from our prioritised backlog.

Personal experience: The experience of a team working in this way was the same as the individual experience I described above.
As a team, we were able to finish a small piece of work, put it out of our minds, and concentrate on the next one.
We avoided ineffective ways of working, such as starting multiple things at once while waiting for reviews, or long-running development efforts that are harder to test and to review. We always had something working to show for our work, rather than multiple half-finished things.
Working in this way also helped the team collaborate more closely, focussing on the team’s goals.

Third loop: Development Iteration. Time frame: 1-2 weeks

We’ve now done a few days of repeating the deployment loop. We should have a feature or improvement that is rather useful to a user of the software.
The team would speak to users of the software, and hear their feedback on it. Preferably in person.
Even if the feature is not “generally available” yet, “demo-ing” the changes to customers is still valuable.

The feedback from customers, as well as our team’s plans, company goals, industry trends etc. will inform our plans and priorities for the next iteration. The team (collaboratively, not just “managers” or “product owners”) will create its prioritised backlog based on those.

This point in time is also a good opportunity for the team to reflect and improve.
Are we happy with the value we delivered during this iteration? Was it the right thing for the customer? Are we satisfied with the quality of it? the speed at which we delivered? It’s a good point to discuss how we can deliver more value, at higher quality, faster, in the future.
What’s stopping us from improving, and how can we remove those impediments?
We can use metrics, such as the DORA “4 key metrics” to inform that conversation.

We plan and prioritise actions to realise those improvements.
(Some examples of such actions: improvements to the speed and reliability of the CI / CD pipeline; improvements to the time it takes to execute tests locally; simplifying code that we found hard to work with; exploring different ways to engage with customers and get their input; improvement to our monitoring tools to enable speedier detection and mitigation of production errors.)

We can also create, and reflect on, time-bound “experiments” to the way we work, and see if they move the needle on the speed / quality of our delivery (examples of such experiments: pair on all development tasks; institute a weekly “refinement” meeting with the whole team; have a daily call with customers…).

Personal experience: I only have “anti-experiences” here, I’m afraid. I’ve worked in many “agile” flavours, including many forms of scrum and kanban. I haven’t found any one system to be inherently “better” than the others.
I did find common problems with all of them.

The issue with agile that I observed in ~100% of the teams I’ve been on, is this:
we use some process blindly, without understanding why, or what its value or intended outcome is. We’re not being agile – we’re just following some arbitrary process that doesn’t help us improve.

My ideal process would involve a team that understands what it is we’re trying to improve (e.g. speed / quality / product-market fit).
We understand how our current process is meant to serve that. We make changes that are designed to improve our outcomes.
In that case, it doesn’t matter if we meet every day to talk about our tasks, or if we play poker every 2 weeks, or whatever.

So, what do you think?

This list is incomplete; I can go on forever about larger feedback loops (e.g. a quarterly feedback loop), or go into more details on the specifics of the principles and processes. It’ll never end. I hope I’ve been able to capture the essence of what’s important (to me) in a software development process.


What’s your opinion? Are these the right things to be aspiring to? Are these achievable? What have I missed?
Let me know in the comments.

Stop assuming your future self is an idiot (an alternative to YAGNI)

Stop assuming your future self is an idiot (an alternative to YAGNI)

I have been aware of, and even talking about, YAGNI (“You ain’t going to need it”), and the dangers of “future-proofing” for a long while. But not until recently have I actually applied this principle in earnest.

Trying to understand what took me so long, I took note of what makes other developers hesitant to apply this principle.

When I try to get others to practise YAGNI, I find the same reluctance that I myself have shown.
When I say to a team member (or to my younger self) “you ain’t going to need it”, the answer is always “yes, you’re probably right. But what if…??”.

And there’s no good way to answer that. I can’t prove that, in every possible future universe, we will never need this code.
Thus YAGNI fails to convince, and the redundant code stays.

I think I’ve been able to find a better argument, though.

My solution has been to play along with this thought experiment.

“OK, so what if we don’t put that ‘future-proof’ code there right now?
And suppose we do find out, in the future, that we do need to make that change?
Would that be such a disaster?
Or, if that happens, we can then make the change that you’re proposing now, right?
And even better – at that point, we’ll have more information and ability to make the right sort of change.”

I’ve had much better success with this line of arguing. We realise that future us are better equipped to deal with this change then present us.
We “just” need to believe in ourselves.

..About that “just”

So why don’t we, by default, believe in future us?
Why don’t we believe that future us can make that change just as well as present us?

I actually already alluded to this in a previous post, about being scared of changing code:
Loss of context and lack of confidence, are the main issues here.

Context

We know that at this point in time, when we’re well-versed with this part of the code, we can see a good (future) solution.
However, we’re not confident that we’ll see it in the future, when we maybe don’t remember everything about this area of the codebase.

It’s easy to see where this sentiment comes from. Many times when we read past code, we’re not confident that we 100% understand it. So why should the code we’re writing now be any different?

It’s taken me many years to have enough self confidence to counter that.
No, I am competent enough that when I do come back to this in the future, I will understand it well enough.

And I’ll make sure of that by leaving some clues for myself – clear names, easy to understand design, descriptive tests, etc.

(An important side point here is about continuity of knowledge.
The person, or team, that authored the original code, would only need a reminder of what it does, and how.
But a different person / team will have much much lower confidence in understanding the code.
The amount of clues – good names, comments, tests etc. would have to be even higher for them.)

Confidence

By this I don’t mean self-confidence, but the confidence in our changes. That we won’t be breaking anything.

For that, a good test suite, good monitoring and remediation tools are required.
But especially needed is a high degree of psychological safety. The confidence that, if we do end up breaking something, we won’t be punished for it.

Conclusion

Saying “YAGNI” is often not convincing enough. Many people’s response to that is that “Well, we can’t know that for certain. And when we will ‘need it’, then it’ll be too late!”.

I propose a more convincing argument – “We Can Always Change It Later”, or “WCACIL” (pronounced… er… however you want to pronounce it).

This argument needs to be supported by a framework that makes future change less scary:
Tests, documentation, simple design, and a safe environment.
And also, maybe a prod from more experienced team members who’ve done it before.

The best tool for the job is the tool you know how to use

The best tool for the job is the tool you know how to use

A recurring cliche at tech companies is that they use the “right tool for the job”. This is meant to show that the company is pragmatic, and not dogmatic, about the technology they use. It’s supposed to be a “good thing”.

for example – “We’re predominantely a nodeJS shop, but we also have a few microservices in golang”. Or (worse), “We let each team decide the best technology for them”.

I don’t agree with that approach. There are benefits realised by using, say, golnag in the right context. But, they are dwarfed by some not-so-obvious problems.

A “bad tool” used well is better than a “good tool” used badly

In most cases, an organization has a deep understanding, experience, and tools in a specific technology.

Suppose a use case arises where that specific technology isn’t the best fit. There’s a better tool for that class of problems.

I contend that the existing tech stack would still perform better than an “optimal”, but less known, technology.

there are two sides to this argument –

1. The “bad” tool isn’t all that bad

Most tech stacks these days are extremely versatile.

You could write embedded systems in javascript, websites with rust, even IoT in ruby..

It wouldn’t work as well as the the tools that are uniquely qualified for that context. But it can take you 80% of the way there. And, in 80% of cases – that’s good enough.

2. The “good” tool isn’t all that good

I mean – the tool probably is good. Your understanding of it, is not.
How to use it, best practices, common pitfalls, tooling, eco-system, and a million and one other things, that are only learned through experience.

You would not realise the same value from using that tool as someone who’s proficient in it.

Even worse – you’ll likely make some beginner mistakes.

And you’ll make them when they have the most impact – right at the beginning, when the system’s architecture is being established.

After a few months, you’ll gain enough experience to realise the mistakes you’ve made. But By then it’ll be much harder, or even infeasible, to fix.

There are some other issues with using a different technology than your main one:

Splitting the platform

Your organization has probably built (or bought) tooling around your main tech stack. They help your teams deliver faster, better, safer.

These tools will not be available for a new tech stack.

New tools, or ports of existing tools, will be required for the new tech stack.

The choice would be to either:
Invest the time and resources in (re)building (and maintaining) ports of the existing tools for that new technology, OR
Let the team using the new technology figure it out on their own.

In either case, this will result in a ton of extra work. Either for the platform / devX team (to build those tools), or for the product team (to solve boilerplate problems that have already been solved for the main tech stack).

Splitting the people

There’s a huge advantage to having a workforce that are all focused on a single tech stack. They can share knowledge, and even code, very easily. They can support each other. Onboarding one team member into a different team is much easier.

That means that there’s a lot of flexibility whereby people are able to move around teams. Maybe even on temporary basis, if one team is in need of extra support.
This is made much more difficult when there are different technologies involved.

Hiring may become more difficult also, if different teams have vastly different requirements.

What can I do if my main tech stack really is unsuitable for this one particular use case?

A former colleague of mine, in a ruby shop, had a need to develop a system that renders pixel-prefect PDFs.
They found that ruby lacked the tools and libraries to do that.
On the other hand – java has plenty of solid libraries for PDF rendering.

So they did something simple (but genius)- they ran their ruby system on the JVM.
This allowed them to use java libraries from within ruby code.
Literally the best of all worlds.

This is not unique to my colleague’s case, though.
You can run many languages on the JVM, and benefit from the rich java echosystem.
You can call performant C or Rust code from ruby, python, .NET, etc.

It’s possible to use the right tool at just the right place where it’s needed, without going ‘all-in’.

What can I do if I can’t get away with using my familiar tools?

Your existing tools probably cover 80% of all cases. But there will always be those 20% where you simply have to use “the right tool”. So let’s think about how to mitigate the above drawbacks of using an unfamiliar tool.

The most obvious option is to buy that familiarity: Bring in someone from the outside who’s already proficient with this tool. This can be in the form of permanent employees, or limited-time consultancy / contractors.

There’s a problem with any purchased capability, though.
They may be an expert in using the tool, but they are complete novices in using it in your specific context.
While they won’t make the beginner mistakes with the tool, as mentioned above, they’ll likely make beginner mistakes regarding your specific domain and context.

For this reason, I’d try and avoid using the consultancy model here. Firstly – they won’t have enough time to learn your domain. Secondly- your team won’t have enough time to learn the tool, to see where it doesn’t fit well with your domain.

Even hiring in full-time experts should be done with caution. They, too, will have no knowledge of your specific business context to begin with.
It may seem like a good idea to hire a whole team of experts, that can get up and running quickly. But consider pairing them with existing engineers with good understanding of your product and domain. The outside experts can level-up the existing engineers on the technology. The existing engineers can help the experts with context and domain knowledge.

It may seem slower to begin with, but can help avoid costly mistakes. And has the benefit of spreading the knowledge of the new tech stack, raising its bus factor.

Exceptions to the “rule”

Like any made-up-internet-advice, the position I outlined above is not a hard and fast rule.
There are cases where it would make complete sense to develop expertise in a technology that is not your core competency.

The most obvious example would be a new delivery method for your services: If you want to start serving your customers via a mobile app, for example. Then building the knowledge and tools around mobile development makes perfect sense.
Or creating an API / developer experience capability, if you want to start exposing your service via a developer API / SDK.

Or, if you’re a huge organization with thousands of developers. You’ll naturally have so many employees that have prior experience with different technologies. In that case you may find many of the issues outlined here do not apply.

In summary

Going all-in on a technical capability can have many benefits.
Richness of tools, flexibility of developers being able to move around different codebases, knowledge sharing, and more.
It makes sense to try and preserve that depth of expertise, and not to dilute it by bringing in more technologies and tools into the mix. Today, with every technology being so broad and multi-purpose, it’s easier to do than ever.

And remember – “select isn’t broken“: Many times I thought that some technology “cannot do” some task. Only to find out that it can, actually do that. It was just that I couldn’t.

Stop lying to yourself – you will never “fix it later”

Stop lying to yourself – you will never “fix it later”

Recently I approved a pull request from a colleague, that had the following description: “That’s a hacky way of doing this, but I don’t have time today to come up with a better implementation”.
It got me thinking about when this “hack” might be fixed.
I could recall many times when I, or my colleagues, shipped code that we were not completely happy with (from a maintainability / quality / cleanliness aspect, sub-par functionality, inferior user experience etc.).
On the other hand, I could recall far far fewer times where we went back and fixed those things.

I’ve read somewhere (unfortunately I can’t find the source) that “The longer something remains unchanged, the less likely it is to change in the future”.
Meaning – from the moment we shipped this “hack”, it then becomes less and less likely to be fixed as time goes on.
If we don’t fix it today, then tomorrow it’ll be less likely to be fixed. And even less so the day after, the week after, the month after. I observed this rule to be true, and I think there are a few reasons for it.

Surprisingly, it isn’t because we’re bad at our jobs, unprofessional, or simply uncaring.
It’s not even because of evil product managers who “force” us to move on to the next feature, not “allowing” us to fix things.

There are a few, interconnected, reasons:

Loss of context and confidence

The further removed you are from the point where the code was written, the less you understand it. You remember less about what it does, what it’s supposed to do, how it does it, where it’s used, etc.
If you don’t understand all its intended use cases, then you’re not confident that you can test all of them.
Which means you’re worried that any change you make might break some use case you were unaware of. (yes, good tests help, but how many of us trust their test suites even when we’re not very familiar with the code?)

This type of thinking leads to fear, which inhibits change.
The risk of breaking something isn’t “worth” the benefit of improving it.

Normalization

The more you’ve lived with something, the more used you are to it.
It feels like less and less of a problem with time.

For example – I recently moved house. In the first few days of unpacking, we didn’t have time or energy to re-assemble our bed frame.
It wasn’t a priority – we can sleep just as well on a mattress on the floor. There are more important things to sort out.
We eventually did get round to assembling it. SIX MONTHS after we moved in.
For the first few days, it was weird walking past the different bits and pieces of bed; laying on the floor.
But we got used to it. And eventually, barely thought about it.

Higher priority

This is a result of the previous two reasons.
On the one hand, we have something that we’re used to living with, which we are afraid to change.
We perceive it as high risk, low reward.
On the other hand, we have some new thing that we want to build / improve. There’s always a new thing.
The choice seems obvious. Every single time.

You’re now relying on that bad code

Even though we know that this code is “bad”, we need to build other features on top of it.
And we need to do it quickly, before there’s a chance to fix this “bad” code.
So now we have code that depends on the “bad” code, and will probably break if we change the bad code.

For example, we wrote our data validation at the UI layer. But we know that data validation should happen at the domain layer. So we intend to move that code “later”.
But after a while, we wrote some domain-level code, assuming that data received from the UI is already valid.
So moving the validation out of the UI will break that new code.

A more serious, and “architectural” example:
“We’re starting out with a schema-less database (such as mongoDB), because we don’t know what our data will look like. We want to be able to change its shape quickly and easily. We can re-evaluate it when our data model is more stable”.
I’ve worked at 3 different companies that used this exact same thinking. What I found common to all of them is:
1. They’re still using mongoDB, years later. 2. They’re very unhappy with mongoDB.
But they can’t replace it, because they built so much functionality on top of it!

What’s the point, then?

So, we realise that if we don’t fix something right away, we’re likely to never fix it. So what? why is that important?

Because it allows us to make informed decisions. Up until now, we thought that our choice was “either fix it now, or defer it to some point in the future”.
Now we can state our options more truthfully – “either fix it now, or be OK with it never being fixed”. That’s a whole new conversation. One which is much more realistic.

We can use this knowledge to inform our priorities:
If we know that it’s “now or never”, we may be able to prioritise that important fix, rather than throwing it into the black hole that is the bottom 80% of our backlog.

We can even use this to inform our work agreements and processes.
One process that worked pretty well for my team in the past, was to allocate a “cleanup” period immediately after each project.
The team doesn’t move on to the next thing right away when a feature is shipped. But rather, it has time to improve all those things that are important, but will otherwise never be fixed.

If we keep believing that a deferred task will one day be done, we’ll never fix anything.
If we acknowledge that we only have a small window of opportunity to act, we can make realistic, actionable plans.

(Case in point: I thought about writing this post while in the shower yesterday. When I got out, I told my wife about it. She said “Go and write this post right now. Otherwise you’ll never do it”.
And she was right: Compare this, written and published post, to the list full of “blog post ideas” that never materialized. I never “wrote it later”
I’m going to take my own advise now, and delete that list entirely.)