16 Jobs

Gosh! it's been too long, and what I mean is that I haven't seen a green light on our CI build server for months. Let's get technical.

Now green lights are very important to our prod-dev process, and to our sanity, and our ability to adapt fast to our customers. And for the past few months (I am too embarrassed to say how many) we have been getting our fix of green lights only in our personal development environments, and even then only in fits and starts. (that's pretty embarrassing enough given my past roles in this game)

You see, we currently have over 3000 unit tests and over 1500 integration/e2e tests that are automated across our back-end and front-end architecture. Yes, we code test-first, and yes we invest a lot of time and effort testing our stuff at various layers in the code base. We have unit tests at every layer, and layer upon layer of integration tests, even in the angular javascript, and WebAPI, and we cover all those tests with integration tests that call our API's and UI tests that call the whole stack end to end. If you are interested, we also stub out all our 3rd party services (i.e. googlemaps, paystation, runthered, sendgrid, etc. etc.) in testing, and we even have numerous tests that call those live services to make sure they still respond in the way we expect. (never assume they don't change - they do, and without warning!).

It goes without saying that automated testing of all these parts comes together for the most important reason - which is to give us the 'confidence' to continually refine the architecture, evolve the features and refactor everything we create - continuously. We practice the boy scout rule daily, and maintainability at our age and stage is paramount for a solid future. Go slow to go fast!

So, what about CI? We love it, we live by it, and we had been depending on it up until about 6 months ago, when our online CI service hit a limitation.  We use a CI service called AppVeyor, and we love them. They are a fabulous little outfit in Canada, and the support and service is fabulous. I am not kidding. The guy who supports every project is (called Feodor) is an absolute legend. Day or night the support is amazing. Their service just works, and for our architecture and tec choices it is just perfect. Their service even allows us to auto-deploy our azure cloud solution after it hits green lights, and it deals with complex things like: encrypted config, git submodules, private repos and a bunch of other stuff that is just hard to do right in CI. It is not without its limitations and the one that has pulled us up months ago has been the limitation of build and test time.

You see, although 3000 unit tests build and run in about 30 secs, our 1500 odd Integration and e2e UI tests take about 6hrs on the current hosted hardware!! And the number of those kinds of test is increasing every day still! (BTW: it used to be 2hrs about 6 months ago!)

Now, I don't know about you, but I don't look forward the the end of a great tech-debt refactoring episode or a new feature dev, to have to wait for hours to get a green light on my latest changes, before I push them. In fact, I have such a strong incentive to ensure I have a green light that I will wait with my head in my hands watching progress bars for about 40mins after which time - it is just time to move on man! And moving on is exactly where we all this discipline comes undone. I won't explain all the how's here, but let's say, once you reluctantly move on there is a tendency to stack red lights upon red lights, and at some point you've lost a lot of confidence in what you have, and you have gained a lot of dread and overwhelmingness about what is coming next - and that is the real killer in this game.

Initially, our build/test time limitation was 40mins on a free plan, and we started exceeding that time limit within the first few weeks of development. We then bought a paid plan, and the great guys over at Appveyor had the good grace to extend that first to 1 hour, and then later to 2 hrs where it stands today. "No can do more!" - and fair enough. They have been very generous to us to this point. That just forced us to optimize our testing patterns, and identify some testing bottle-necks and even forced some optimizations in our backend API's. But shortly after gaining back 30mins or so, we shot past the 2hr limit again, and haven't' seen green since then on our CI build. We have so many long-running tests that the service simply times-out and fails the whole process. And that has left us with the only way to see green lights, by running the tests on our more powerful local machines, which execute them all in half the time, for whatever reason. (I suspect 16GB RAM , 3.5GHz CPU and SSD disks has a small part to play in that). But that is not CI, and even then  only the tip of the iceberg why CI is so powerful for any dev team.

So for the last few months, the CI build has become more and more irrelevant to our daily cadence, and slowly but surely test-rot has been setting in. Now, that is not to say we haven't been dealing to it, we absolutely have when we find it. The problem has been that to find it you need to find a few hours in the day to run all the tests, and then a few hours after that to fix them. We just don't have that most days, so it has become normal practice to run all the long running tests overnight (or when you leave the office for a meeting or whatever), and fix the broken tests in the morning. Leaving the early to late afternoon for new creation/refactoring. And boy it has taken some months of soldiering through that before I finally had enough. Had lost my self-worth and decided it was far past time to pay back this technical debt.

So I spoke to Feodor at AppVeyor. Actually, I pleaded with Feador at AppVeyor to put me right, and help us find a solution. The goal being to get the CI server to get us a green light within about 30mins. A green light any longer than than 30mins and it becomes a long feedback loop, and that leads to irrelevancy, and loss of sanity again. (You can rightly argue that 30mins is also too long, but for us it's a great starting point to get back in the saddle.)

Now, prior to this little story, I had been researching all kinds of resolutions for our testing problem, for months. I researched testing grids, parallel testing, even new versions or repurposing resharper and other tools we have or know about. But most of these tools are geared to unit testing, not long running test like e2e tests. There was selenium grid that looked like it would take care of parallel testing for our UI tests, until you realise you have to find the tin to support it, and that's the whole value proposition of cloud-based CI right there. (and it's not just having the tin in the cloud, it goes way past that in having an extensible automated platform and community to support it). Surprisingly, in this day and age, there is not much out there tailored for our needs, and it seems like not too many people either practice this skill, nor, heck, even understand the difference between unit testing and integration testing, let alone how to scale it up!!

So when I was confronted with the only sustainable option of applying more CPU power to the problem, needless to say, I was reluctant to spend, spend, spend on cloud CPU power, having already pissed away hundreds of dollars a month on it for little value.

So here is the deal. Unless you can make your tests run faster (i.e. optimize the code or testing patterns), or you stop writing so many tests (I love that many devs out there on the public forums actually recommend that as a viable solution!) you have one of two choices: You either apply more CPU speed and RAM to the problem, or you apply more CPU's of the same power to them. The reality is (as Feodor soberly administered to me) "If you have 100 tests and each test takes 1 minute of CPU time then it's either 100 minutes on 1 CPU or 1 minute on 100 CPUs". (you can't pay for that kind of wisdom these days - credit Feodor for waking me up).

So, we had to do the math. (It helps my business partner Andrew is an accountant) We figured every hour of every day that we sit and wait for tests cost us about, let say $100 (we are a startup). Then when we discover a broken test we spend about another hour fixing it. This is an hour or two hours later, and of course assumes nothing has been done during that wait.  If any of those broken tests compound (which is what tends to happen if you do something else while you wait of course), the fixes compound and eventually those compounding fixes delay discovering new broken tests. Now, If we are pushing multiple times a day (as we should), we are doing a lot of waiting for red lights, fixing in-haste and scrambling around to verify the fix again, for no good reason. Believe me it drives me crazy. (Of course, what really drives me crazy is that I know all this implicitly, heck I've been coaching dev teams for years about the costs and pitfalls of doing this! forgive my hypocrisy)   So, an extra 'job' (that is what we call an extra CPU running in parallel) costs us about, say $50/month (just for this discussion, the reality is actually cheaper). So, for every hour of every day that we are waiting for tests to give us a red light, we can afford to spend that money on a new parallel job, as long as that extra job reduces the overall time to complete the whole set of tests (which needs to tell you the whole truth not just part of it).

So, you figure in our case, 1500 long-running tests takes 2 parallel jobs 6hrs to complete. We need then 16 jobs in parallel to gets that test run down to under 30mins. (4 jobs to get it to 3hrs, 8 jobs to get that to 1.5hrs, and 16 jobs to get it to 45mins.) In our case, with some test group refactoring we got it down a little further closer to the target 30mins. 16 jobs, wikid!

Before we were forced down this track, I was reluctant to even pay for 2 parallel jobs! Our monthly subscription bill to various cloud services and tools is already painfully high. I didn't want to exacerbate that any further. But now I couldn't be happier, because by taking the economic point of view, we are actually not only saving money each day, we are getting back our velocity and sanity, because we need that more than anything to maintain any level of sustainability in this process. We were insane for not doing this sooner, and the universe is better aligned because I deeply believe that we should be using the best tools we can afford. It feels actually a relief and pleasure to be spending good money on this.

Would you pay $800 for your CI service a month? You probably should. It's worth every penny.

We now got green lights galore, and a desktop 'siren of shame' (and the 'Shamebrero') to let us know pretty quick when we just broke the build, and to wallow in our evil foolish shortcuts.

And yesterday, we just went live! at www.roamride.co.nz.