This post is a documentation of the way I, personally, think that a successful software organization could be structured and run.
I previously wrote about the way a successful individual developer could work.
This post is taking a much broader view of the entire tech organization.
You’ll find that the different principles and practices described here are similar, identical, or enable the ones mentioned in the above post.
As always, the opinions here are influenced by well-known best practices (DevOps, agile), and by my own ~15 years of experience in different software organizations.
Also important to mention, that these opinions are not influenced by actual experience of holding a senior leadership role. So this post is quite one-sided in favour of the “internal” tech organization, without much / any consideration to the CTO’s role as part of the wider management team.
This is meant to serve as a living document – I’m sure it’ll change it based on readers’ feedback, and my own learnings and observations.
Goals (The “Why?”)
I don’t currently, nor do I ever intend to, serve as a CEO, CTO, or any other ‘chief’.
So this isn’t an instruction manual for my future self.
I also don’t presume to be a “consultant” or “executive coach” (yet?).
I’ll never have the authority to actually implement all the items in this list. But I think it’s still valuable to put in writing, for a few reasons:
- To put my own thoughts in order – writing down this list will force me to articulate my “philosophy”, and to clarify it to myself. Clarifying my values and priorities is a valuable exercise. Especially for times such as job searching, when I consider whether a company is a good fit for me.
- To use in my own little domain – even though I’ll never be a “top dog”, I may have the opportunity to lead a team again. Some of the principles and practices outlined here can be implemented even at a small scale.
- To influence others – I hope to influence the thoughts and actions of my employers in the direction I believe is right. Even if this document itself is not enough to affect change, it could serve as a starting point for a conversation.
These are high-level overview of “How we win” – if we succeed in the below, then we will win as a team.
They are outcomes, or metrics, rather than concrete actions and steps (see “practices” below for breaking down of principles into actionable items)
You won’t find anything ground-breaking here; as I mentioned, this is built on top of well-established philosophies.
(sometimes referred to as “Westrum organizational culture”).
This is an organizational culture that is goal- and mission-driven, fosters collaboration, encourages risk taking, and implements novel ideas.
It is informed by the belief that employees are internally motivated.
Meaning – Everyone wants to do a good job. There’s no need to “force” the workers to do a good job. (this is known as “Theory Y”)
If we espouse this theory, then there’s no need for management to overly supervise, check up, or impose limitations on employees. But rather, give them the necessary tools, knowledge, and training, to succeed.
Some concrete examples of this may be:
- Team autonomy in what they do – projects and tasks are decided on by the people who do the work. Management is responsible for priorities, vision, and “big picture”. Not the everyday work.
- Team autonomy in how they work – teams are free to choose how they go about achieving their goals. Scrum / kanban / waterfall / anarchy.. whatever gets good, consistent results.
- Trust – no “code owners” that must approve every change. no requirement for X number of “approvals”. We employ grown ups – they won’t start riffing on
main, committing bugs and spaghetti code, just because they can. We trust them to be responsible, and come up with quality mechanisms that work for them, without forcing anything on them.
- Failure (such as a bug, production outage, miscommunication with a customer) does not lead to punishment. After all, the person(s) who made the mistake had the best of intentions. This means that the system they operated in allowed for that mistake to happen (even in the case where that system tasked them with doing a job that they’re not qualified for).
Therefore, failure is an opportunity to learn and improve for the company, as well the individual.
This is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes, and the team is safe for interpersonal risk taking (definition by Dr. Amy Edmondson).
That’s a core aspect of a generative culture. In a generative culture, we rely on individuals, not “leadership”, to come up with ideas, initiatives and the execution that drive the company forward. This cannot be done if employees don’t feel safe expressing their opinions.
Some other required attributes of a generative organization include:
- Continuous learning (and improvement)
- Cross-team collaboration
The “Four Key Metrics”
DevOps research and assessment (DORA) has consistently found that excelling at these four metrics leads to excelling in business outcomes (profitability, market share, customer satisfaction, employee satisfaction, and more):
Deployment Frequency – how often you put new code in front of customers
Lead Time for Changes – how long it takes from the first commit on a developer’s machine, until that code is in front of customers
Time to Restore Services – the time between introducing a failure (bug / outage), and resolving it.
Change Failure Rate – % of deployments that cause a failure in production.
Moving these needles upwards requires being really good at quite a few behaviours and practices, as outlined below.
While they are not “principles” per-se, they are very helpful high-level goals that we can use as guidance.
We recognise that we are often wrong. But we don’t know that we’re wrong, or in what way.
Therefore, we solicit, we value, and we act on, feedback. We aim to get feedback, and act on it, as quickly as possible.
This has multiple manifestations –
- Feedback about whether our product meets customers’ needs
- Feedback about whether our software behaves as intended
- Feedback about the quality of our code
- Feedback about us, our processes, and tools
The above principles are nothing in their own right. Only daily behaviours and incentive structures can bring a principle to life.
Here are some concrete practices and processes that, I believe, help realise the above principles:
Product teams / reverse Conway manoeuvre
Conway’s law dictates that our software structure will reflect the company’s organizational structure.
So, if we want to create a software architecture of independent, loosely-coupled components, then we need to structure our organization in such a way.
This would look different in every problem domain. But generally, a team is assigned a cohesive, independent sub-domain of the company’s business. For example – a “loans” team, a “savings” team, a “mortgages” team, and so on.
Each team has the responsibility for, and the personnel / tools to, provide the best loans / savings / mortgage software product. Starting from ideation, up to maintaining a service in production.
The team may be asked to provide some big picture outcome (e.g. “x% more savings account customers”, or “y% less churn for mortgage holders”). But the way the go about it is up to the team itself.
Generative culture / autonomy, trust – by making teams self-sufficient. They’re not dependent on anyone outside the team (e.g. QA team, ops team) to accomplish their goals.
Four key metrics / delivery – by removing dependencies and coupling, there’s less need for communication and coordination. Teams are free to work as quickly as they’d like.
Communication – Communities of practice
While teams are autonomous, none of them is an island. Teams still need to effectively work together, communicate about what they’re doing, coordinate, etc.
Additionally, learnings from one team (e.g. how to solve a specific problem) can be applicable to other teams.
The “standard” approach to these needs is a hierarchical one, traversing the organizational “tree”:
if team A needs to coordinate with team B, then it will go up through team A’s manager, who will go to team A’s director, who will go to their VP, who will go down to team B’s director, who will go to team B’s manager.
This approach is wasteful, and contradicts the principles of autonomy and theory Y.
An alternative would be to create structures where teams and individuals can communicate directly.
This could be communities of practice (e.g. “frontend devs”, “DB administrators”), technical all-hands (e.g. weekly open engineering meeting), or ad-hoc working groups. The details should be self-organized by the team(s) for whatever works for them. Management’s role is to allow the time and space (and encouragement) for these structures to emerge.
Generative culture / autonomy – even when coordination outside the team is necessary, the team has the autonomy to choose how to do that.
Generative culture / collaboration – allowing (and encouraging) direct communication, rather than hierarchical one, increases collaboration.
Generative culture / continuous learning – by providing opportunities for individuals and teams to learn from each other.
We already touched on many things that management does not do – supervision, validation,tactical decision-making.
So what does management do, in our fairytale, rainbows-and-unicorns organization?
The main responsibilities of management are, generally, twofold:
- Provide context and “big picture” – making sure that everyone in the organization knows what the overall company goals and priorities are. So that they’re able to prioritise their own work accordingly.
Making connections between different parts of the organization (e.g. “oh, you’re doing project X? well, Joanne from marketing is doing project Y which is related. You should talk!”)
- Reinforce the desired organizational culture – not by talking about it, but by consciously incentivizing desired behaviour
- When a mistake / outage occurs, celebrate it as an opportunity to learn and improve, rather than playing blame games. Encourage teams to look at how the system / work processes can be improved to prevent the next issue.
- Proactively reward employees who exemplify desired principles. Get rid of employees who don’t.
- Back teams up when they need to invest resources in improving their processes, even in the face of external pressure (e.g. customer requests)
- Back teams up even when they’re going in a different direction than what the people in management would’ve done
- Share their own mistakes and vulnerabilities openly, to promote a culture of psychological safety
As you can see, the job of management is extremely important, but not large in volume.
This means that the organization requires less managers to function (for example, there’s no need for “directors” that have several teams under them, or “VPs” that have several directors under them, etc.)
Generative culture – this style of management allows individuals, not management, to generate value for the company (hence, “generative” culture).
As said earlier, each team can choose whatever workflow works for them.
There are a few common guidelines that are helpful across the board, though:
Relationship with the customer
Everyone on the team is expected to engage with, and learn from, customers.
Customers are not hidden away from developers behind product managers, business analysts etc.
This has the added benefit of removing menial tasks from product people (such as being a go-between between developers and customers, or writing down work tickets). They are free to focus on value-adding activities, such as customer behaviour analysis, market research, forward-planning, etc.
Feedback – developers have access to direct feedback from customers about what works and what doesn’t.
there are many technical practices required to achieve the above principles (especially the “four metrics”). DORA has a comprehensive list of them. I’ll only mention the ones that I’ve found especially important or valuable:
Continuous integration and delivery
In order to understand as quickly as possible whether our software behaves as intended, we must integrate all changes as frequently as possible, and check whether the software indeed behaves as it should.
In order to understand as quickly as possible whether the changes we’ve made are useful to customers, we must put them in front as customers as quickly (and frequently) as possible.
The pursuit of continuous integration and delivery is beneficial in itself.
It forces us to improve in many aspects of our work – automated testing, configuration and source management (to maintain safety while going fast), loose coupling (to avoid teams being blocked) etc.
Delivery pipeline as a first-class citizen
If we can’t (safely) deploy our software, then our customers can’t benefit from anything that we do. In this case, there’s no point to any other activity (e.g. developing a new feature, or fixing a bug). As it will not go in front of customers.
However obvious this seems, it has profound implications. It means that any “blockage” of our deployment pipeline (bad configuration, flakey tests, even significant slowing down of the pipeline) is as bad as a customer-facing outage. (Actually, it is a customer-facing outage. The customer does not get the functionality that they should)
Feedback (at multiple levels)
The four key metrics / delivery
Automated tests / test-driven development
I’ve actually seen an organization that did great on delivery metrics (e.g. multiple deployments per day), without emphasizing automated tests. As expected, their stability metrics (e.g number of bugs) were incredibly poor. And it was noticeable – this company has lost multiple contracts because customers were dissatisfied with the quality of the software.
If we aim to be able to release frequently, with confidence, we must have a reliable test suite.
Writing tests-first also provides invaluable feedback about our software design.
Feedback – about whether our software behaves as intended
The four key metrics – all of them. Testing decreases the odds of introducing a bug (i.e change failure rate). But it also gives us the confidence to deploy rapidly, without lengthy manual verification.
Many of us have an aversion to changing working code. Whether it’s because we don’t see the value (it’s working; so what if it’s hard to read?), or afraid of the consequences (i.e introducing a bug).
However, if we aim for excellence in delivery and reliability, we can’t accept code that is hard (for us) to maintain.
Code that’s hard to maintain means slower speed (since it takes longer to change). it also jeopardizes our reliability (because it makes it easier to introduce a bug).
Therefore, we must encourage (and expect) developers to improve code that is difficult to understand or to change.
More than that – It’s also important to change code based on improved understanding of the problem:
we’ve all seen cases where code was built to accommodate use case X. But actually, use case Y is what the customer actually needed. So the code implements use case X, with some hacks and workarounds to make it behave like Y.
This is another case of code that’s hard to understand and maintain, and must be changed.
There are many more valuable technical practices. But, I believe that the teams will find them for themselves, if they aim to improve on the principles and practices already mentioned.
For example –
- observability and monitoring – a team will naturally invest in those areas if it aims to improve its reliability metrics
- Change management, version control, deployment automation – a team will naturally invest in those areas if it aims to improve its delivery metrics
If you’ve read this far, and you are not my mother, than thank you very much for bearing with me.
(If you are my mother, then hi mum!)
You may be interested in reading some of DORA’s research, or the “Accelerate” book.
This post has turned out to be a sort of poor-person’s reader’s digest of the DORA materials..