Trillions spent and big software projects are still failing
At a large box retail chain (15 states, ~300 stores) I worked on a project to replace the POS system.
The original plan had us getting everything working (Ha!) and then deploying it out to stores and then ending up with the two oddball "stores". The company cafeteria and surplus store were technically stores in that they had all the same setup and processes but were odd.
When the team that I was on was brought into this project, we flipped that around and first deployed to those two several months ahead of the schedule to deploy to the regular stores.
In particular, the surplus store had a few dozen transactions a day. If anything broke, you could do reconciliation by hand. The cafeteria had single register transaction volume that surpassed a surplus store on most any other day. Furthermore, all of its transactions were payroll deductions (swipe your badge rather than credit card or cash). This meant that if anything went wrong there we weren't in trouble with PCI and could debit and credit accounts.
Ultimately, we made our deadline to get things out to stores. We did have one nasty bug that showed up in late October (or was it early November?) with repackaging counts (if a box of 6 was $24 and if purchased as a single item it was $4.50 ... but if you bought 6 single items it was "repackaged" to cost $24 rather than $27) which interacted with a BOGO sale. That bug resulted in absurd receipts with sales and discounts (the receipt showed you spent $10,000 but were discounted $9,976 ... and then the GMs got alerts that the store was not able to make payroll because of a $9,976 discount ... one of the devs pulled an all nighter to fix that one and it got pushed to the stores ).
I shudder to think about what would have happened if we had tried to push the POS system out to customer facing stores where the performance issues in the cafeteria where worked out first or if we had to reconcile transactions to hunt down incorrect tax calculations.
Scale is separately a Product and Engineering question. You are correct that you cannot scale a Product to delight many users without it first delighting a small group of users. But there are plenty of scaled Engineering systems that were designed from the beginning to reach massive scale. WhatsApp is probably the canonical example of something that was a rather simple Product with very highly scaled Engineering and it's how they were able to grow so much with such a small team.
Okay but how much more software is used ? If IT spending has tripled since 2005 but we use 10x more software I'd say the trend is good.
Mostly the issues are non technical and grounded in a lack of accountability and being too big to fail. A lot of these failures are failing top down. Unrealistic expectations, hand wavy leadership, and then that gets translated into action. Once these big projects get going and are burning big budgets and it's obvious that they aren't working, people get very creative at finding ways to tap into these budgets.
Here in Germany, the airport in Berlin was opened only a few years ago after being stuck in limbo a decade after it was supposed to open and the opening was cancelled only 2 weeks before it was supposed to happen. It was hilarious, they had signs all over town announcing how they were going to shut down the highway so the interior of the old airport could be transported to the new one. I kid you not. They were going to move all the check-in counters and other stuff over and then bang on it for a day or two and then open the airport. Politicians, project leadership, etc. kept insisting it was all fine right up until the moment they could not possibly ignore the fact that there was lots wrong with the airport and that it wasn't going to open. It then took a decade to fix all that. There's a railway station in Stuttgart that is at this point very late in opening. Nuclear plant projects tend to be very late and over budget too.
Government IT projects aren't that different than these. It's a very similar dynamic. Big budgets, decision making is highly political, a lack of accountability, lots of top down pretending it's going to be fine, big budgets and companies looking to tap into those, and a lot of wishful thinking. These are all common ingredients in big project failures.
The software methodology is the least of the challenges these projects face.
One reason why aws got so big is because it took months to get infrastructure to provision a virtual machine.
After every single project, the org comes together to do a retrospective and ask "What can devs do differently next time to keep this from happening again". People leading the project take no action items, management doesn't hold themselves accountable at all, nor product for late changing requirements. And so, the cycle repeats next time.
I led and effort one time, after a big bug made it to production after one of those crunches that painted the picture of the root cause being a huge complicated project being handed off to offshore junior devs with no supervision, and then the junior devs managing it being completely switched twice in the 8 month project with no handover, nor introspection by leadership. My manager's manager killed the document and wouldn't allow publication until I removed any action items that would constrain management.
And thus, the cycle continues to repeat, balanced on the backs of developers.
This leads to higher and higher towers of abstraction that eat up resources while providing little more functionality than if it was solved lower down. This has been further enabled by a long history of rapidly increasing compute capability and vastly increasing memory and storage sizes. Because they are only interacting with these older parts of their systems at the interface level they often don't know that problems were solved years prior, or are capable of being solved efficiently.
I'm starting to see ideas that will probably form into entire pieces of software "written" on top of AI models as the new floor. Where the model basically handles all of the mainline computation, control flow, and business logic. What would have required a dozen Mhz and 4MB of RAM to run now requires TFlops and Gigabytes -- and being built from a fresh start again will fail to learn from any of the lessons learned when it was done 30 years ago and 30 layers down.
It's also hard when the team actually cares, but there are skills you can learn. Early in my career, I got into solving some of the barriers to software project management (e.g., requirements analysis and otherwise understanding needs, sustainable architecture, work breakdown, estimation, general coordination, implementation technology).
But once you're a bit comfortable with the art and science of those, big new challenges are more about political and environment reality. It comes down to alignment and competence of: workers, internal team leadership, partners/vendors, customers, and investors/execs.
Discussing this is a little awkward, but maybe start with alignment, since most of the competence challenges are rooted in mis-alignments: never developing nor selecting for the skills that alignment would require.
Was there any literature or other findings that you came across that ended up clicking and working for you that you can recommend to us?
Most sizeable software projects require understanding, in detail, what is needed by the business, what is essential and what is not, and whether any of that is changing over the lifetime of the project. I don't think I've ever been on a project where any of that was known, it was all guess work.
I very rarely hear actual technical reasons for why a decision was made. They're almost always invented after the fact to retroactive justify some tool or design pattern the developer wanted to use. Capabilities and features get tacked on just because it's something someone wanted to do, not because they solve an actual problem or can be traced back to requirements in any meaningful way.
Frankly as an industry we could learn a lot from other engineering fields, aerospace and electrical engineering in particular. They aren't perfect, but in general they're much better at keeping technical decisions tied to requirements. Their processes tend to be too slow for our industry of course, but that doesn't mean there aren't lessons to be learned.
I guess that’s the real problem I have with SV’s endemic ageism.
I was personally offended, when I encountered it, myself, but that’s long past.
I just find it offensive, that experience is ignored, or even shunned.
I started in hardware, and we all had a reverence for our legacy. It did not prevent us from pursuing new/shiny, but we never ignored the lessons of the past.
The only thing that seems to change this is consequences. Take a random person and just ask them to do something, and whether they do it or not is just based on what they personally want. But when there's a law that tells them to do it, and enforcement of consequences if they don't, suddenly that random person is doing what they're supposed to. A motivation to do the right thing. It's still not a guarantee, but more often than not they'll work to avoid the consequences.
Therefore if you want software projects to stop failing, create laws that enforce doing the things in the project to ensure it succeeds. Create consequences big enough that people will actually do what's necessary. Like a law, that says how to build a thing to ensure it works, and how to test it, and then an independent inspection to ensure it was done right. Do that throughout the process, and impose some kind of consequence if those things aren't done. (the more responsibility, the bigger the consequence, so there's motivation commensurate with impact)
That's how we manage other large-scale physical projects. Of course those aren't guaranteed to work; large-scale public works projects often go over-budget and over-time. But I think those have the same flaw, in that there isn't enough of a consequence for each part of the process to encourage humans to do the right thing.
If there was sufficient consequence for this stuff, no one would ever take on any risk. No large works would ever even be started because it would be either impossible or incredibly difficult to be completely sure everything will go to plan.
So instead we take a medium amount of caution and take on projects knowing it's possible for them to not work out or to go over budget.
Ah finally - I've had to scroll halfway down to find a key reason big software projects fail.
<rant>
I started programming in 1990 with PL/1 on IBM mainframes and for 35 years have dipped in and out of the software world. Every project I've seen fail was mainly down to people - egos, clashes, laziness, disinterest, inability to interact with end users, rudeness, lack of motivation, toxic team culture etc etc. It was rarely (never?) a major technical hurdle that scuppered a project. It was people and personalities, clashes and confusion.
</rant>
Of course the converse is also true - big software projects I've seen succeed were down to a few inspired leaders and/or engineers who set the tone. People with emotional intelligence, tact, clear vision, ability to really gather requirements and work with the end users. Leaders who treated their staff with dignity and respect. Of course, most of these projects were bland corporate business data ones... so not technically very challenging. But still big enough software projects.
Gez... don't know why I'm getting so emotional (!) But the hard-core sofware engineering world is all about people at the end of the day.
https://en.wikipedia.org/wiki/Auburn_Dam
https://en.wikipedia.org/wiki/Columbia_River_Crossing
If you're 97% over budget, are you successful? https://en.wikipedia.org/wiki/Big_Dig
I don't like this as a metric of success, because who came up with the budget in the first place?
If they did a good job and you're still 97% over then sure, not successful.
But if the initial budget was a dream with no basis in reality then 97% over budget may simply have been "the cost of doing business".
It's easier to say what the budget could be when you're doing something that has already been done a dozen times (as skyscraper construction used to be for New York City). It's harder when the effort is novel, as is often the case for software projects since even "do an ERP project for this organization" can be wildly different in terms of requirements and constraints.
That's why the other comment about big projects ideally being evolutions of small projects is so important. It's nearly impossible to accurately forecast a budget for something where even the basic user needs aren't yet understood, so the best way to bound the amount of budget/cost mismatch is to bound the size of the initial effort.