One of the biggest fires in my career
Let me tell you a story of how one project I worked on ran into engineering issues caused by accumulated organisational problems, and how it all affected me just after coming back from holidays.
In October 2016 I went to Japan for the second time in my life, with my then-girlfriend-now-wife and our 5 friends. All of us had common interests, which allowed us to stick together and powered a lot of conversations. I had mixed feelings about that trip immediately after coming back, mostly due to problems with coordinating activities for a group of seven people (something I am keen on writing about in a separate post), but in retrospect those were really amazing three weeks. I was able to completely recharge my mental and motivational batteries, as I did not have to think about work or professional things at all.
After coming back to the office on Tuesday, it only took until the end of the same week to completely deplete both of those reserves and fall back to where I was before going for holidays.
This is a story of one of the biggest fires in my professional career.
At that time, I was working as a software engineer in a team of 8 people, consisting of software engineers, quality assurance, and a product owner. I was also the team leader for that team, making sure that the team was healthy, processes were efficient, work was done on time and both the customer and team members were happy.
The company we worked for was a software house. If you have never heard this term, think about agency, but on a slightly bigger scale. Our customers were mostly located in the Western Europe, and were businesses that wanted to quickly scale up their engineering department, but lacked human resources to do so. My company was pretty efficient in spinning up a full-fledged software engineering team (orteams) in a matter of just a few weeks, and “renting” these teams to the customer for a longer period of time. Think years, not weeks or months, as it is common with agencies.
The company itself is not really relevant in this post. Let me just say it was a medium-sized customer with established business from the Western Europe, with about 20-30 engineers on their side and several other non-engineering roles.
Their tech stack was predominantly PHP (CodeIgniter flavoured) and Python 2.7 (Twisted flavoured) in the backend, and—as it is common—a spaghetti in the frontend, where the common ingredient was jQuery, sprinkled with attempts on introducing something fancier here and there, like Angular.
Now, this post is not really about that tech stack. It ultimately does not matter. You can write terrible code in the best framework and language that exist there. You can also write excellent, readable and scalable code in PHP. The language itself is just a tool; it is up to its wielder how it will be used.
That customer also had microservices architecture in their backend code. Except that all microservices were dependent on single “shared models” library, so they were all intertwined, limiting opportunities to deploy them separately.
If you are reading these words in 2019, then just in a few weeks life support for Python 2 is going to be turned off. Everything indicates that this is final and not reversible—and thank goodness for that! Python 2.7 has been deprecated for more than 10 years now.
Three years ago, me and my team spent quite a lot of time discussing our customer’s situation: they were bound to 2.7 with no plans or motivation to upgrade. The management on their side mentioned once that they would prefer to keep using the existing tech stack, because otherwise they would need to spend time training people and getting used to new things. In case there was an issue in production, it would take more time to fix for an untrained person.
There were a few reasons why I thought it was a good time to consider experimenting with Python 3:
- The next big project for my team required writing a new microservice. What is more, due to the nature of the project we did not have to use the shared models library, so we had much more freedom than usual.
- We would make the customer’s tech stack more up to date with standards and more exciting, so that they have bigger chance of attracting talented developers on their side. After all, it can be a minor or major advantage for the company if it can advertise that they use modern tech stack, as opposed to the jQuery spaghetti and PHP.
- It would be more exciting for my team to work on it. Not only would we work with more pleasant tech stack, but also we would be able to drive the initiative of making the codebase better, and act as mentors for the upgrade process.
We created the plan for the new service, described how it would work and spent a few weeks coding. For what it is worth, the service was a bridge between a third party email campaigns service and our database, uploading data to or downloading it from that third party provider.
On top of that, the team spearheaded everything related to the infrastructure: installing new tooling, making sure that dependencies were satisfied, and that our service fit the way entire stack was orchestrated. All of this was scripted and documented, so that it would be easy to replicate it in the development environment, as well as in the production.
So far, so good. But then I left for my trip.
I did not have the opportunity to check my email during the trip due to intensivity of it. I trusted my coworkers in solving any issues related to the process. I was also confident in the plan we devised. I felt allowed to disconnect from work for 3 weeks, and what a great time it was.
For a complete change of atmosphere, here are a few photos from that trip.
There were several things that did not work quite well on the customer’s side.
I feel confident with putting the blame on the customer in this case. Even though my professional experience is just a few years now, thanks to the nature of work (software house) I was able to rotate between several projects done for several organisations, ranging from small startups to large companies with headcounts going into hundreds. I witnessed a range of various problems in different environments, and this one I am positive I can put on the customer.
Moreover, a side note: when working in a software house (or agency, it is pretty much the same thing), most likely you will be considered to be of worse sort than engineers employed directly by the customer, and thus your opinion will matter less. You will also get less information and access, and much less trust. This is worthy of having its own post, so I will not go too deep into this topic here.
All of the above unfortunately manifested in our relationship with the customer, regardless of what we did to gain their trust, and regardless of having already launched two major projects for them.
We got the approval from lead engineers and management to proceed with our plan. We made sure to announce our plans early on, at least one month in advance. We communicated what we want to do, what the advantages will be, and what we will need to succeed from other teams—especially infrastructure engineers. The response from the engineering body was… mild, let’s say. I do not think that anybody from other teams put more thought than “oh, cool ¯\_(ツ)_/¯” into it. It definitely felt that way whenever we reminded that our release day was coming and we would be using this new version of Python, and some infrastructure things would be different from what everyone was used to.
Moreover, even though we were working on the same product and had access to the codebase, we did not receive anything more than that. I understand the production environment,databases and credentials would be guarded against outsiders like us, as that data is really sensitive. That was fair. But we were not able to do anything on the staging environment as well—all installations, upgrades, migrations, bootstrapping and so on would have to be done by someone on the customer’s side. We made sure that there was such a person, and we prepared extensive guidelines and instructions for them, with included prerequisites which would be great to do before the release day.
Unfortunately, that was not sufficient. On that day, we found out that the steps we prepared were not followed at all. After short conversation the customer’s manager decided to postpone the release, which was not something taken easily because of the release process. It is a material for a separate post, but let me just say that due to reasons there could be just one feature release per day, and the next 2 weeks were already booked by other teams. We had to wait—it did not help that we were late on an already stretched deadline.
The next few days before the next release day were spent preparing the engineer from customer’s side and walking them through instructions. When that day came, however, they ran into some issues they could not solve by themselves. It was easy to reproduce and investigate for us, since we were more acquainted with setting the environment up. At the same time, we were not trusted enough to be given access to staging. After spending enough time trying to debug it remotely via Hipchat, we tried to solve this by video calling the other engineer, asking them to share their screen, and telling them in real time what to type.
That attempt at releasing failed as well. It turned out that our code, that worked nicely in our sandboxed development environment, was not compatible with the cloud environment (to which we were explicitly not given access). The release had to be rescheduled once again. This was also the time when customer’s management complained to our management, and one day before the end of my holiday.
In the evening on the day I landed home, after unpacking and relaxing after the end of the journey, I decided to check the email in order to see if there was anything requiring my attention in the morning. The very first message at the top of the queue was from my boss, to the entire team, informing about an emergency meeting that would happen first thing the following day.
The meeting went, we talked about issues and decided that we shall fix them first in order to unblock the release, and then chat with the customer about what went wrong and what should be changed about the release process. Immediately after coming back, I was thrown into the “fix everything related to this project” initiative—not only the obviously broken parts, but everything that the customer deemed to be of not high enough quality. By itself it would be alright, but the team was also working on another big project for that same customer simultaneously, and we still maintained two previous projects we built for them. In order to regain the trust, we agreed to fix everything until the end of that week. Purposely we did not point fingers at that time, knowing from previous experience that it would give us a much better standing point if we rectified the situation as soon as possible. The work continued into the weekend, and we spent Saturday in the office and Sunday finishing the work.
My batteries were depleted by then. All the energy accumulated during the trip disappeared. It was probably mostly because of that trip that I did not go into negative area and just fell back to normal.
All of this may not seem like a big deal: a release was delayed by about two weeks, so what. Not the end of the world. It did not seem so at that time, though. We were an external company helping our customers solve their complex software engineering and product problems. Due to the nature of that work, we had to work really hard to get them to trust us. It was also really easy to lose that trust, since outsiders are first to blame whenever anything goes wrong, even if problems are ingrained in the customer's organisation—and I bet that was the thinking here. This contributed to me stressing way too much than necessary about the entire situation, and putting all the energy I had into improving this project.
Trying to get into the cause of the entire situation—something that me and my team knew what it was, but for business reasons could not point at customer’s organisation and say “this is broken” without providing enough details other than our experience—I spent a few hours doing a detailed root cause analysis. After pondering for some time about it, I asked my boss for advice, when he told me that it probably will not be necessary any longer, as that customer decided to cut their overseas teams due to problems with funding. Interestingly enough, apparently it was coming anyway and the fire we just extinguished had little to do with it.
I have been lucky not to be in a similar situation since then. But every time I resync the work email account on my phone after coming back from a trip, I remember that event from three years ago and shiver for a second.