display: none;
My most cherished production incident.
In my short career as a Software Engineer, I've been involved with countless production incidents - from leap seconds, to sharks, to the classic database deletions. But there's one story that will always retain a special place in my heart.
March 16th, 2016. I was a recently promoted senior engineer, tasked with getting advertising on the new version of FT.com.
The FT.com rebuild - codename Next - had been in beta for a while, and was gearing up towards a full roll-out later that year. Just the day before we had increased the amount of traffic going to the new site from 2% to 10% of logged-out users, so it was very much considered a production site.
That day was Budget Day in the UK, up there with elections as one of the busiest traffic days for FT.com. As with every major news event, we were on a period of "heightened awareness" - essentially avoid doing anything risky.
With Next, we went to great efforts to ensure that deploying code to production was simple and painless. The site was split into many microservices (e.g. front page, article page, sign up form), each comprising of some API calls and some front end components. We had automated tests on every build that checked all the important pages were rendering, as well as monitoring to ensure the live site was OK. And so we'd be happily deploying to production dozens of times a day.
My task: hide one of the ad elements from all our pages. It was a shared component, with some CSS that targeted all the various different parts of the site using a data attribute on the body. A simple one line change plus a copy/paste job. I pushed the change up, got it reviewed, merged it, and released all the apps.
[data-page-type="article-page"] {
.my-advert {
display:none;
}
}
[data-page-type="list-page"] {
.my-advert {
display:none;
}
}
[data-page-type="front-page"] {
display:none;
.my-advert {
}
}
15 minutes later...
None of the tests failed - all the pages were returning valid HTML. Pingdom was getting 200 OK responses for the front page.
We had lots of safeguards in place to handle a broken website. Even if snuck got through our build process, we had Fastly serving stale-if-error
. And if a user still saw an error page, they'd not only get treated to some delightful economics jokes, but critically they'd have been able to opt-out to go back to the old/current at the time FT.com.
None of that applies if someone just hides it all.
I'm sure there were many learnings from this. We had an incident report, with some actions. Some were probably carried out, some not.
I'm sure this tale might also provide a glimmer of inspiration or comfort to anybody else that has made mistakes in their job.
That's not why I wanted to write about it.
It was just pretty funny is all.