Planet Release Engineering

March 11, 2019

Chris AtLee (catlee)

Smaller Firefox Updates

Back in 2014 I blogged about several ideas about how to make Firefox updates smaller.

Since then, we have been able to implement some of these ideas, and we also landed a few unexpected changes!

tl;dr

It's hard to measure exactly what the impact of all these changes are over time. As Firefox continues to evolve, new code and dependencies are added, old code removed, while at the same time the build system and installer/updater continue to see improvements. Nevertheless I was interested in comparing what the impact of all these changes would be.

To attempt a comparison, I've taken the latest release of Firefox as of March 6, 2019, which is Firefox 65.0.2. Since most of our users are on Windows, I've downloaded the win64 installer.

Next, I tried to reverse some of the changes made below. I re-compressed omni.ja, used bz2 compression for the MAR files, re-added the deleted images and startup cache, and used the old version of mbsdiff to generate the partial updates.

Format Current Size "Old" Size Improvement (%)
Installer 45,693,888 56,725,712 19%
Complete Update 49,410,488 70,366,869 30%
Partial Update (from 64.0.2) 14,935,692 28,080,719 47%

Small updates FTW!

Ideally most of our users are getting partial updates from version to version, and a nearly 50% reduction in partial update size is quite significant! Smaller updates mean users can update more quickly and reliably!

One of the largest contributors to our partial update sizes right now are the binary diff size for compiled code. For example, the patch for xul.dll alone is 13.8MB of the 14.9MB partial update right now. Diffing algorithms like courgette could help here, as could investigations into making our PGO process more deterministic.

Here are some of the things we've done to reduce update sizes in Firefox.

Shipping uncompressed omni.ja files

This one is a bit counter-intuitive. omni.ja files are basically just zip files, and originally were shipped as regular compressed zips. The zip format compressed each file in the archive independently, in contrast to something like .tar.bz2 where the entire archive is compressed at once. Having the individual files in the archive compressed makes both types of updates inefficient: complete updates are larger because compressing (in the MAR file) already compressed data (in the ZIP file) doesn't yield good results, and partial updates are larger because calculating a binary diff between two compressed blobs also doesn't yield good results. Also, our Windows installers have been using LZMA compression for a long time, and after switching to LZMA for update compression, we can achieve much greater compression ratios with LZMA of the raw data versus LZMA of zip (deflate) compressed data.

The expected impact of this change was ~10% smaller complete updates, ~40% smaller partial updates, and ~15% smaller installers for Windows 64 en-US builds.

Using LZMA compression for updates

Pretty straightforward idea: LZMA does a better job of compression than bz2. We also looked at brotli and zstd for compression, but LZMA performs the best so far for updates, and we're willing to spend quite a bit of CPU time to compress updates for the benefit of faster downloads.

LZMA compressed updates were first shipped for Firefox 56.

The expected impact of this change was 20% reduction for Windows 64 en-US updates.

Disable startup cache generation

This came out of some investigation about why partial updates were so large. I remember digging into this in the Toronto office with Jeff Muizelaar, and we noticed that one of the largest contributors to partial update sizes were the startup cache files. The buildid was encoded into the header of startup cache files, which effectively changes the entire compressed file. It was unclear whether shipping these provided any benefit, and so we experimented with turning them off. Telemetry didn't show any impact to startup times, and so we stopped shipping the startup cache as of Firefox 55.

The expected impact of this change was about 25% for a Windows 64 en-US partial update.

Optimized bsdiff

Adam Gashlin was working on a new binary diffing tool called bsopt, meant to generate patch files compatible with bspatch. As part of this work, he discovered that a few changes to the current mbsdiff implementation could substantially reduce partial update sizes. This first landed in Firefox 61.

The expected impact of this change was around 4.5% for partial updates for Window 64 builds.

Removed unused theme images

We removed nearly 1MB of unused images from Firefox 55. This shrinks all complete updates and full installers by about 1MB.

Optimize png images

By using a tool called zopflipng, we were able to losslessly recompress PNG files in-tree, and reduce the total size of these files by 2.4MB, or about 25%.

Reduce duplicate files we ship

We removed a few hundred kilobytes of duplicate files from Firefox 52, and put in place a check to prevent further duplicates from being shipped. It's hard to measure the long term impact of this, but I'd like to think that we've kept bloat to a minimum!

March 11, 2019 10:03 PM

November 20, 2018

Chris AtLee (catlee)

PyCon Canada 2018

I've very happy to have had the opportunity to attend and speak at PyCon Canada here in Toronto last week.

PyCon has always been a very well organized conference. There are a wide range of talks available, even on topics not directly related to Python. I've attended previous PyCon events in the past, but never the Canadian one!

My talk was titled How Mozilla uses Python to Build and Ship Firefox. The slides are available here if you're interested. I believe the sessions were recorded, but they're not yet available online. I was happy with the attendance at the session, and the questions during and after the talk.

As part of the talk, I mentioned how Release Engineering is a very distributed team. Afterwards, many people had followup questions about how to work effectively with remote teams, which gave me a great opportunity to recommend John O'Duinn's new book, Distributed Teams.

Some other highlights from the conference:

  • CircuitPython: Python on hardware I really enjoyed learning about CircuitPython, and the work that Adafruit is doing to make programming and electronics more accessible.

  • Using Python to find Russian Twitter troll tweets aimed at Canada A really interesting dive into 3 million tweets that FiveThirtyEight made available for analysis.

  • PEP 572: The Walrus Operator My favourite quote from the talk: "Dictators are people too!" If you haven't followed Python governance, Guido stepped down as BDFL (Benevolent Dictator for Life) after the PEP was resolved. Dustin focused much of his talk about how we in the Python community, and more generally in tech, need to treat each other better.

  • Who's There? Building a home security system with Pi & Slack A great example of how you can get started hacking on home automation with really simple tools.

  • Froilán Irzarry's Keynote talk on the second day was really impressive.

  • You Don't Need That! Design patterns in Python My main takeaway from this was that you shouldn't try and write Python code as if it were Java or C++ :) Python has plenty of language features built-in that make many classic design patterns unnecessary or trivial to implement.

  • Numpy to PyTorch Really neat to learn about PyTorch, and leveraging the GPU to accelerate computation.

  • Flying Python - A reverse engineering dive into Python performance Made me want to investigate Balrog performance, and also look at ways we can improve Python startup time. Some neat tips about examining disassembled Python bytecode.

  • Working with Useless Machines Hilarious talk about (ab)using IoT devices.

  • Gathering Related Functionality: Patterns for Clean API Design I really liked his approach for creating clean APIs for things like class constructors. He introduced a module called variants which lets you write variants of a function / class initializer to support varying types of parameters. For example, a common pattern is to have a function that takes either a string path to a file, or a file object. Instead of having one function that supports both types of arguments, variants allows you to make distinct functions for each type, but in a way that makes it easy to share underlying functionality and also not clutter your namespace.

November 20, 2018 07:20 PM

September 20, 2018

Chris AtLee (catlee)

So long Buildbot, and thanks for all the fish

Last week, without a lot of fanfare, we shut off the last of the Buildbot infrastructure here at Mozilla.

Our primary release branches have been switched over to taskcluster for some time now. We needed to keep buildbot running to support the old ESR52 branch. With the release of Firefox 60.2.0esr earlier this month, ESR52 is now officially end-of-life, and therefore so is buildbot here at Mozilla.

Looking back in time, the first commits to our buildbot-configs repository was over 10 years ago on April 27, 2008 by Ben Hearsum: "Basic Mozilla2 configs". Buildbot usage at Mozilla actually predates that by at least two years, Ben was working on some patches in 2006.

Earlier in my career here at Mozilla, I was doing a lot of work with Buildbot, and blogged quite a bit about our experiences with it.

Buildbot served us well, especially in the early days. There really were no other CI systems at the time that could operate at Mozilla's scale.

Unfortunately, as we kept increasing the scale of our CI and release infrastructure, even buildbot started showing some problems. The main architectural limitations of buildbot we encountered were:

  1. Long lived TCP sessions had to stay connected to specific server processes. If the network blipped, or you needed to restart a server, then any jobs running on workers were interrupted.

  2. Its monolithic design meant that small components of the project were hard to develop independently from each other.

  3. The database schema used to implement the job queue became a bottleneck once we started doing hundreds of thousands of jobs a day.

On top of that, our configuration for all the various branches and platforms had grown over the years to a complex set of inheritance rules, defaults, and overrides. Only a few brave souls outside of RelEng managed to effectively make changes to these configs.

Today, much much more of the CI and release configuration lives in tree. This has many benefits including:

  1. Changes are local to the branches they land on. They ride the trains naturally. No need for ugly looooooops.

  2. Developers can self-service most of their own requests. Adding new types of tests, or even changing the compiler are possible without any involvement from RelEng!

Buildbot is dead! Long live taskcluster!

September 20, 2018 01:24 PM

So long Buildbot, and thanks for all the fish

Last week, without a lot of fanfare, we shut off the last of the Buildbot infrastructure here at Mozilla.

Our primary release branches have been switched over to taskcluster for some time now. We needed to keep buildbot running to support the old ESR52 branch. With the release of Firefox 60.2.0esr earlier this month, ESR52 is now officially end-of-life, and therefore so is buildbot here at Mozilla.

Looking back in time, the first commits to our buildbot-configs repository was over 10 years ago on April 27, 2008 by Ben Hearsum: "Basic Mozilla2 configs". Buildbot usage at Mozilla actually predates that by at least two years, Ben was working on some patches in 2006.

Earlier in my career here at Mozilla, I was doing a lot of work with Buildbot, and blogged quite a bit about our experiences with it.

Buildbot served us well, especially in the early days. There really were no other CI systems at the time that could operate at Mozilla's scale.

Unfortunately, as we kept increasing the scale of our CI and release infrastructure, even buildbot started showing some problems. The main architectural limitations of buildbot we encountered were:

  1. Long lived TCP sessions had to stay connected to specific server processes. If the network blipped, or you needed to restart a server, then any jobs running on workers were interrupted.

  2. Its monolithic design meant that small components of the project were hard to develop independently from each other.

  3. The database schema used to implement the job queue became a bottleneck once we started doing hundreds of thousands of jobs a day.

On top of that, our configuration for all the various branches and platforms had grown over the years to a complex set of inheritance rules, defaults, and overrides. Only a few brave souls outside of RelEng managed to effectively make changes to these configs.

Today, much much more of the CI and release configuration lives in tree. This has many benefits including:

  1. Changes are local to the branches they land on. They ride the trains naturally. No need for ugly looooooops.

  2. Developers can self-service most of their own requests. Adding new types of tests, or even changing the compiler are possible without any involvement from RelEng!

Buildbot is dead! Long live taskcluster!

September 20, 2018 01:24 PM

June 27, 2018

Armen Zambrano G. (@armenzg)

Workshop experience at Smashing Conf

This week I attended Toronto’s first Smashing Conf.

<figcaption>One of the many posters around the event</figcaption>

On Monday I attended one of the pre-conference workshops by Dan Mall’s “Design workflow for a multi-device world”. Dan guided us through the process of defining a problem, brainstorming objectives and key results (aka OKRs) and made us work together to build some of what we decided to tackle. We divided the whole classroom into five or six teams with 5 to 7 members each. Each team had various skillsets (e.g. designers and coders).

Development time

In my team, the “bike shed” team, we decided to rewrite TTC’s trip planning feature. We did not manage to finish the product, however, we managed to build one of the three objectives and partially complete another. The team had two people who did compositions, two people who could code and one person helping us collaborate and coordinate.

This exercise included few new things to me. For instance, I worked within a team context to create a product rather than building a feature by myself.

<figcaption>The button found in codepen(at the top) versus what I ended up with (at the bottom)</figcaption>

It was also a new experience for me to work closely with a designer. We chose to build a multi-option toggle feature to mix transit methods. The process started with writing on paper what he had in mind. I tried building a prototype from scratch to see if I understood what he wanted. I did not get it quite right the first time so wedecided to search in codepen to find similar. Once we found something we liked I started iterating on the code while he started preparing the icons for me to use. By the end of it we had something that worked but did not have enough time to complete. This is the codepen I forked and this is the unfinished feature where I left it at.

<figcaption>One of our objectives and key results</figcaption>

This exercise was a very humbling experience as I felt the pressure to produce something for a designer (Scott from Motorola services) that was right there beside me and I was “the coding expert”. I quote coding expert as I barely have a year of frontend experience. I started from a forked pen that had roughly what Scott wanted, however, I knew I was going to face a very difficult time not before long. The codepen had been written using few non-standard languages (pugjs and SCSS) instead of writing standard HTML & CSS. Another difficulty I knew I was going to face was that I did not have experience turning an image into a toggle button. The forked pen only had text inside the buttons. I deferred solving it by only dealing with text labels at first, instead I addressed other issues before integrating the icons which would require some extra research.

It was also the first time working with another coder (Sheneille Patil) on a fast paced environment. We needed to quickly figure our own development culture. Creating a GitHub repository was not an option as she was not comfortable with it. We decided to turn to codepen.io and to build features that would not conflict with each other. The final plan was to collect our different pieces of code and merge them in a single pen. We did not have enough time to get to this.

I hope you found something interesting out of this post. It’s not my tipical programming related post. I’m very grateful to SmashingConf for having lined up such great speakers and very practical workshops and for Mozilla to support my learning.

June 27, 2018 07:43 PM

June 06, 2018

Armen Zambrano G. (@armenzg)

AreWeFastYet UI refresh

For a long time Mozilla’s JS team and others have been using https://arewefastyet.com to track the JS engine performance against various benchmarks.

<figcaption>Screenshot of landing page</figcaption>

In the last little while, there’s been work moving those benchmarks to another continuous integration system and we have the metrics in Mozilla’s Perfherder. This rewrite will focus on using the new generated data.

If you’re curious on the details about the UI refresh please visit this document. Feel free to add feedback. Stay tuned for an update next month.

June 06, 2018 07:44 PM

June 04, 2018

Armen Zambrano G. (@armenzg)

Some webdev knowledge gained

Easlier this year I had to split a Koa/SPA app into two separate apps. As part of that I switched from webpack to Neutrino.

Through this work I learned a lot about full stack development (frontend, backend and deployments for both). I could write a blog post per item, however, listing it all in here is better than never getting to write a post for any of them.

Note, I’m pointing to commits that I believe have enough information to understand what I learned.

Npm packages to the rescue:

Node/backend notes:

Neutrino has been a great ally to me and here’s some knowledge on how to use it:

Heroku was my tool for deployment and here you have some specific notes:

June 04, 2018 03:24 PM

May 30, 2018

Armen Zambrano G. (@armenzg)

Neutrino: Deploying to Netlify

Neutrino is my preferred tool to kickstart a React app and Netlify is my preferred SPA deployment service.

Netlify makes it very easy to deploy your static sites, however, it needs some initial configuration.

You won’t find Neutrino as one of the tools listed in their docs, thus, adding some docs in here. We’ll see if my instructions are right and maybe ask them to include them in their docs.

When you create a new site you will connect your repository and you will be asked to fill in the following:

NOTE: I prefer yarn over npm .

In few minutes your site will be up and running. You won’t need to do anything else.

May 30, 2018 02:44 PM

May 28, 2018

Armen Zambrano G. (@armenzg)

Splitting the Firefox Health dashboard

Back in January I had to make a critical decision. I had to determine if to separate the Firefox health dashboard (formely known as Platform health) into a backend and frontend projects or to keep it together.

The intent was to make it easier to maintain the project by reducing the complexity of having code that is presentational versus processing code. I also wanted to remove the boilerplate needed for webpack and babel. It was also beneficial to have the liberty of changing packages without worrying of regressing the frontend or the backend. The only disadvantages was to have to do the work and that we might need in the future coordinated changes (or versioned APIs). We did not see the disadvantage of code being duplicated since there wasn’t any (or much — I can’t recall now) shared between the two apps.

<figcaption>Tracking Firefox’s tab close performance</figcaption>

This all came from hitting a very odd production specific issue. I thought this was all caused from the complex webpack configuration the project had. Because we were not making progress determining the root issue I decided to switch to Neutrino. Switching to Neutrino made everything easier, however, it was unclear how to make it work with the original project’s design. The original design had the frontend files being served as static assets of the Koa app. Switching to Neutrino took away webpack headaches since it makes good default configuration options for the project.

Keeping both frontend and backend apps within the same repository complicated the deployment story since there were some Heroku restrictions. I tried using subtrees, however, it still required manual intervention (see explanation). I didn’t know at the time that we could have deployed the backend to Heroku while deploying the frontend to Netlify. This would have allowed to keep both project within the same repository. Alas! We now have two repositories.

If you want to look at the code changes you can see them here.

May 28, 2018 02:33 PM

Additions to Firefox’s health dashboard

At the beginning of the month I came back from my last few weeks of parental leave (thanks Mozilla!). While I was away Sarah Clements took over some Firefox Quantum release criteria work and I’m pleased to see that she managed to tackle everything well by herself.

Some of the major changes she made was to separate the Quantum criteria page into 32-bit and 64-bit. This simplifies the graphs and allows release stakeholders to see more clearly how one specific architecture is doing.

<figcaption>Quantum release criteria focused on Windows 64-bit</figcaption>

She also added the new release criteria for Firefox’s GeckoView efforts.

<figcaption>Android’s release criteria</figcaption>

To learn more you can visit https://health.graphics to see the changes.
If you would like to contribute visit https://github.com/mozilla/firefox-health-dashboard.

May 28, 2018 01:01 PM

April 25, 2018

Chris AtLee (catlee)

Firefox release speed wins

Sylvestre wrote about how we were able to ship new releases for Nightly, Beta, Release and ESR versions of Firefox for Desktop and Android in less than a day in response to the pwn2own contest.

People commented on how much faster the Beta and Release releases were compared to the ESR release, so I wanted to dive into the releases on the different branches to understand if this really was the case, and if so, why?

Chemspill timings

                    | Firefox ESR 52.7.2 | Firefox 59.0.1  | Firefox 60.0b4
 ------------------ | ------------------ | --------------- | --------------
 Fix landed in HG   | 23:33:06           | 23:31:28        | 23:29:54
 en-US builds ready | 03:19:03 +3h45m    | 01:16:41 +1h45m | 01:16:47 +1h46m
 Updates ready      | 08:43:03 +5h42m    | 04:21:17 +3h04m | 04:41:02 +3h25m
 Total              | 9h09m              | 4h49m           | 5h11m

(All times UTC from 2018-03-15 -> 2018-03-16)

Summary

via GIPHY

We can see that Firefox 59 and 60.0b4 were significantly faster to run than ESR 52 was! What's behind this speedup?

Release Engineering have been busy migrating release automation from buildbot to taskcluster . Much of ESR52 still runs on buildbot, while Firefox 59 is mostly done in Taskcluster, and Firefox 60 is entirely done in Taskcluster.

In ESR52 the initial builds are still done in buildbot, which has been missing out on many performance gains from the build system and AWS side. Update testing is done via buildbot on slower mac minis or windows hardware.

The Firefox 59 release had much faster builds, and update verification is done in Taskcluster on fast linux machines instead of the old mac minis or windows hardware.

The Firefox 60.0b4 release also had much faster builds, and ended up running in about the same time as Firefox 59. It turns out that we hit several intermittent infrastructure failures in 60.0b4 that caused this release to be slower than it could have been. Also, because we had multiple releases running simultaneously, we did see some resource contention for tasks like signing.

For comparison, here's what 60.0b11 looks like:

                    | Firefox 60.0b11
 ------------------ | --------------- 
 Fix landed in HG   | 18:45:45
 en-US builds ready | 20:41:53 +1h56m
 Updates ready      | 22:19:30 +1h37m
 Total              | 3h33m

Wow, down to 3.5 hours!

In addition to the faster builds and faster update tests, we're seeing a lot of wins from increased parallelization that we can do now using taskcluster's much more flexible scheduling engine. There's still more we can do to speed up certain types of tasks, fix up intermittent failures, and increase parallelization. I'm curious just how fast this pipeline can be :)

April 25, 2018 05:20 PM

Firefox release speed wins

Sylvestre wrote about how we were able to ship new releases for Nightly, Beta, Release and ESR versions of Firefox for Desktop and Android in less than a day in response to the pwn2own contest.

People commented on how much faster the Beta and Release releases were compared to the ESR release, so I wanted to dive into the releases on the different branches to understand if this really was the case, and if so, why?

Chemspill timings

                    | Firefox ESR 52.7.2 | Firefox 59.0.1  | Firefox 60.0b4
 ------------------ | ------------------ | --------------- | --------------
 Fix landed in HG   | 23:33:06           | 23:31:28        | 23:29:54
 en-US builds ready | 03:19:03 +3h45m    | 01:16:41 +1h45m | 01:16:47 +1h46m
 Updates ready      | 08:43:03 +5h42m    | 04:21:17 +3h04m | 04:41:02 +3h25m
 Total              | 9h09m              | 4h49m           | 5h11m

(All times UTC from 2018-03-15 -> 2018-03-16)

Summary

via GIPHY

We can see that Firefox 59 and 60.0b4 were significantly faster to run than ESR 52 was! What's behind this speedup?

Release Engineering have been busy migrating release automation from buildbot to taskcluster . Much of ESR52 still runs on buildbot, while Firefox 59 is mostly done in Taskcluster, and Firefox 60 is entirely done in Taskcluster.

In ESR52 the initial builds are still done in buildbot, which has been missing out on many performance gains from the build system and AWS side. Update testing is done via buildbot on slower mac minis or windows hardware.

The Firefox 59 release had much faster builds, and update verification is done in Taskcluster on fast linux machines instead of the old mac minis or windows hardware.

The Firefox 60.0b4 release also had much faster builds, and ended up running in about the same time as Firefox 59. It turns out that we hit several intermittent infrastructure failures in 60.0b4 that caused this release to be slower than it could have been. Also, because we had multiple releases running simultaneously, we did see some resource contention for tasks like signing.

For comparison, here's what 60.0b11 looks like:

                    | Firefox 60.0b11
 ------------------ | --------------- 
 Fix landed in HG   | 18:45:45
 en-US builds ready | 20:41:53 +1h56m
 Updates ready      | 22:19:30 +1h37m
 Total              | 3h33m

Wow, down to 3.5 hours!

In addition to the faster builds and faster update tests, we're seeing a lot of wins from increased parallelization that we can do now using taskcluster's much more flexible scheduling engine. There's still more we can do to speed up certain types of tasks, fix up intermittent failures, and increase parallelization. I'm curious just how fast this pipeline can be :)

April 25, 2018 05:20 PM

April 20, 2018

Chris AtLee (catlee)

Taskcluster migration update: we're finished!

We're done!

Over the past few weeks we've hit a few major milestones in our project to migrate all of Firefox's CI and release automation to taskcluster.

Firefox 60 and higher are now 100% on taskcluster!

Tests

At the end of March, our Release Operations and Project Integrity teams finished migrating Windows tests onto new hardware machines, all running taskcluster. That work was later uplifted to beta so that CI automation on beta would also be completely done using taskcluster.

This marked the last usage of buildbot for Firefox CI.

Periodic updates of blocklist and pinning data

Last week we switched off the buildbot versions of the periodic update jobs. These jobs keep the in-tree versions of blocklist, HSTS and HPKP lists up to date.

These were the last buildbot jobs running on trunk branches.

Partner repacks

And to wrap things up, yesterday the final patches landed to migrate partner repacks to taskcluster. Firefox 60.0b14 was built yesterday and shipped today 100% using taskcluster.

A massive amount of work went into migrating partner repacks from buildbot to taskcluster, and I'm really proud of the whole team for pulling this off.

So, starting today, Firefox 60 and higher will be completely off taskcluster and not rely on buildbot.

It feels really good to write that :)

We've been working on migrating Firefox to taskcluster for over three years! Code archaeology is hard, but I think the first Firefox jobs to start running in Taskcluster were the Linux64 builds, done by Morgan in bug 1155749.

Into the glorious future

It's great to have migrated everything off of buildbot and onto taskcluster, and we have endless ideas for how to improve things now that we're there. First we need to spend some time cleaning up after ourselves and paying down some technical debt we've accumulated. It's a good time to start ripping out buildbot code from the tree as well.

We've got other plans to make release automation easier for other people to work with, including doing staging releases on try(!!), making the nightly release process more similar to the beta/release process, and for exposing different parts of the release process to release management so that releng doesn't have to be directly involved with the day-to-day release mechanics.

April 20, 2018 04:50 PM

Taskcluster migration update: we're finished!

We're done!

Over the past few weeks we've hit a few major milestones in our project to migrate all of Firefox's CI and release automation to taskcluster.

Firefox 60 and higher are now 100% on taskcluster!

Tests

At the end of March, our Release Operations and Project Integrity teams finished migrating Windows tests onto new hardware machines, all running taskcluster. That work was later uplifted to beta so that CI automation on beta would also be completely done using taskcluster.

This marked the last usage of buildbot for Firefox CI.

Periodic updates of blocklist and pinning data

Last week we switched off the buildbot versions of the periodic update jobs. These jobs keep the in-tree versions of blocklist, HSTS and HPKP lists up to date.

These were the last buildbot jobs running on trunk branches.

Partner repacks

And to wrap things up, yesterday the final patches landed to migrate partner repacks to taskcluster. Firefox 60.0b14 was built yesterday and shipped today 100% using taskcluster.

A massive amount of work went into migrating partner repacks from buildbot to taskcluster, and I'm really proud of the whole team for pulling this off.

So, starting today, Firefox 60 and higher will be completely off taskcluster and not rely on buildbot.

It feels really good to write that :)

We've been working on migrating Firefox to taskcluster for over three years! Code archaeology is hard, but I think the first Firefox jobs to start running in Taskcluster were the Linux64 builds, done by Morgan in bug 1155749.

Into the glorious future

It's great to have migrated everything off of buildbot and onto taskcluster, and we have endless ideas for how to improve things now that we're there. First we need to spend some time cleaning up after ourselves and paying down some technical debt we've accumulated. It's a good time to start ripping out buildbot code from the tree as well.

We've got other plans to make release automation easier for other people to work with, including doing staging releases on try(!!), making the nightly release process more similar to the beta/release process, and for exposing different parts of the release process to release management so that releng doesn't have to be directly involved with the day-to-day release mechanics.

April 20, 2018 04:50 PM

Chris Cooper (coop)

New to me: the Taskcluster team

All entities move and nothing remains still.

At this time last year, I had just moved on from Release Engineering to start managing the Sheriffs and the Developer Workflow teams. Shortly after the release of Firefox Quantum, I also inherited the Taskcluster team. The next few months were *ridiculously* busy as I tried to juggle the management responsibilities of three largely disparate groups.

By mid-January, it became clear that I could not, in fact, do it all. The Taskcluster group had the biggest ongoing need for management support, so that’s where I chose to land. This sanity-preserving move also gave a colleague, Kim Moir, the chance to step into management of the Developer Workflow team.

Meet the Team

Let me start by introducing the Taskcluster team. We are:

We are an eclectic mix of curlers, snooker players, pinball enthusiasts, and much else besides. We also write and run continous integration (CI) software at scale.

What are we doing?

Socrates gets booked
The part I understand is excellent, and so too is, I dare say, the part I do not understand…

One of the reasons why I love the Taskcluster team so much is that they have a real penchant for documentation. That includes their design and post-mortem processes. Previously, I had only managed others who were using Taskcluster…consumers of their services. The Taskcluster documentation made it really easy for me to plug-in quickly and help provide direction.

If you’re curious about what Taskcluster is at a foundational level, you should start with the tutorial.

The Taskcluster team currently has three, big efforts in progress.

1. Redeployability

Many Taskcluster team members initially joined the team with the dream of building a true, open source CI solution. Dustin has a great post explaining the impetus behind redeployability. Here’s the intro:

Taskcluster has always been open source: all of our code is on Github, and we get lots of contributions to the various repositories. Some of our libraries and other packages have seen some use outside of a Taskcluster context, too.

But today, Taskcluster is not a project that could practically be used outside of its single incarnation at Mozilla. For example, we hard-code the name taskcluster.net in a number of places, and we include our config in the source-code repositories. There’s no legal or contractual reason someone else could not run their own Taskcluster, but it would be difficult and almost certainly break next time we made a change.

The Mozilla incarnation is open to use by any Mozilla project, although our focus is obviously Firefox and Firefox-related products like Fennec. This was a practical decision: our priority is to migrate Firefox to Taskcluster, and that is an enormous project. Maintaining an abstract ability to deploy additional instances while working on this project was just too much work for a small team.

The good news is, the focus is now shifting. The migration from Buildbot to Taskcluster is nearly complete, and the remaining pieces are related to hardware deployment, largely by other teams. We are returning to work on something we’ve wanted to do for a long time: support redeployability.

We’re a little further down that path than when he first wrote about it in January, but you can read more about our efforts to make Taskcluster more widely deployable in Dustin’s blog.

2. Support for packet.net

packet.net provides some interesting services, like baremetal servers and access to ARM hardware, that other cloud providers are only starting to offer. Experiments with our existing emulator tests on the baremetal servers have shown incredible speed-ups in some cases. The promise of ARM hardware is particularly appealing for future mobile testing efforts.

Over the next few months, we plan to add support for packet.net to the Mozilla instance of Taskcluster. This lines up well with the efforts around redeployability, i.e. we need to be able to support different and/or multiple cloud providers anyway.

3. Keeping the lights on (KTLO)

While not particularly glamorous, maintenance is a fact of life for software engineers supporting code that in running in production. That said, we should actively work to minimize the amount of maintenance work we need to do.

One of the first things I did when I took over the Taskcluster team full-time was halt *all* new and ongoing work to focus on stability for the entire month of February. This was precipitated by a series of prolonged outages in January. We didn’t have an established error budget at the time, but if we had, we would have completely blown through it.

Our focus on stability had many payoffs, including more robust deployment stories for many of our services, and a new IRC channel (#taskcluster-bots) full of deployment notices and monitoring alerts. We needed to put in this stability work to buy ourselves the time to work on redeployability.

What are we *not* doing?

With all the current work on redeployability, it’s tempting to look ahead to when we can incorporate some of these improvements into the current Firefox CI setup. While we do plan to redeploy Firefox CI at some point this year to take advantage of these systemic improvements, it is not our focus…yet.


One of the other things I love about the Taskcluster team is that they are really good at supporting community contribution. If you’re interested in learning more about Taskcluster or even getting your feet wet with some bugs, please drop by the #taskcluster channel on IRC and say Hi!

April 20, 2018 01:03 PM

March 03, 2018

Chris AtLee (catlee)

Taskcluster migration update, the sequel

Firefox, now 100% buildbot-free!

First, the good news - Developer Edition 60.0b1 will be the first release in nearly 10 years done without using buildbot. This is an amazing milestone, and I'm incredibly proud of everybody who has contributed to make this possible!

Long time, no update

How did we get here? It's been, uh, almost 6 months since I last posted an update about our migration to Taskcluster.

In my last update, I described our plans for the end of 2017...

We're on track to ship builds produced in Taskcluster as part of the
56.0 release scheduled for late September. After that the only Firefox
builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release
automation. We prioritized getting nightly and CI builds migrated to
Taskcluster, however, there are still parts of the release process
still implemented in Buildbot.

We're aiming to have release automation completely migrated
off of buildbot by the end of the year. We've already seen many
benefits from migrating CI to Taskcluster, and migrating the release
process will realize many of those same benefits.

How'd we do?

We're past the end of 2017, so how are we doing?

Well, we successfully shipped 56.0 with builds produced in Taskcluster. Our big Firefox Quantum release (57.0), was also shipped with builds produced by Taskcluster.

(side note: 57 had the most complex update scenarios we've ever had to support for Firefox...a subject for another post!)

Release scheduling

Post-56.0, our release process was using Taskcluster exclusively for producing the initial builds, and all the release process scheduling. We were still using Buildbot for many of the post-build tasks, like l10n repacks, publishing updates, pushing files to S3, etc. Once again we relied on the buildbot bridge to allow us to integrate existing buildbot components with the newer taskcluster pipeline. I learned from Kim Moir that this is a great example of the strangler pattern.

In the fall of 2017, we decided to begin migrating all of the scheduling logic for release automation into taskcluster using the in-tree taskgraph scheduling system. We did this for a few reasons...

  1. Having the release scheduling logic ride the trains is much more maintainable. Previous to this we had an externally defined release pipeline in our releasetasks repo. It was hard to keep this repository in sync with changes required for beta/release and ESR branches.

  2. More importantly, having the release scheduling logic in-tree meant that we could then rely on chain-of-trust to verify artifacts produced by the release pipeline.

  3. We felt that having the complete release pipeline defined in taskcluster would make it easier for us to tackle the remaining buildbot bridge tasks in parallel.

We hit this milestone in the 58 cycle. Starting with 58.0b3, Firefox and Fennec releases were completely scheduled using the in-tree taskgraph generation. We also migrated over the l10n repacks at the same time, removing a longstanding source of problems where repacks would fail when we first got to beta due to environmental differences between taskcluster and buildbot.

No-BBB Releases

Still, as of 58, much of release automation still ran on buildbot, even if Taskcluster was doing all the scheduling.

Since December, we've been working on removing these last few pieces of buildbot from the release process. Progress was initially a bit slow, given Austin and Christmas, but we've been hard at work in the new year.

That brings us to today.

We've moved uptake monitoring, update verify (and made it 2x faster too!), update submission, final verify, bouncer submission, version bumping and tagging, balrog submission all to run in Taskcluster via various kinds of scriptworkers.

As I mentioned above, DevEdition 60.0b1 will be the first release in nearly 10 years done without using buildbot. The rest of the 60 release cycle will follow suit, and once 60 hits the release channel, only ESR52 will remain on buildbot!

March 03, 2018 12:26 PM

Taskcluster migration update, the sequel

Firefox, now 100% buildbot-free!

First, the good news - Developer Edition 60.0b1 will be the first release in nearly 10 years done without using buildbot. This is an amazing milestone, and I'm incredibly proud of everybody who has contributed to make this possible!

Long time, no update

How did we get here? It's been, uh, almost 6 months since I last posted an update about our migration to Taskcluster.

In my last update, I described our plans for the end of 2017...

We're on track to ship builds produced in Taskcluster as part of the
56.0 release scheduled for late September. After that the only Firefox
builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release
automation. We prioritized getting nightly and CI builds migrated to
Taskcluster, however, there are still parts of the release process
still implemented in Buildbot.

We're aiming to have release automation completely migrated
off of buildbot by the end of the year. We've already seen many
benefits from migrating CI to Taskcluster, and migrating the release
process will realize many of those same benefits.

How'd we do?

We're past the end of 2017, so how are we doing?

Well, we successfully shipped 56.0 with builds produced in Taskcluster. Our big Firefox Quantum release (57.0), was also shipped with builds produced by Taskcluster.

(side note: 57 had the most complex update scenarios we've ever had to support for Firefox...a subject for another post!)

Release scheduling

Post-56.0, our release process was using Taskcluster exclusively for producing the initial builds, and all the release process scheduling. We were still using Buildbot for many of the post-build tasks, like l10n repacks, publishing updates, pushing files to S3, etc. Once again we relied on the buildbot bridge to allow us to integrate existing buildbot components with the newer taskcluster pipeline. I learned from Kim Moir that this is a great example of the strangler pattern.

In the fall of 2017, we decided to begin migrating all of the scheduling logic for release automation into taskcluster using the in-tree taskgraph scheduling system. We did this for a few reasons...

  1. Having the release scheduling logic ride the trains is much more maintainable. Previous to this we had an externally defined release pipeline in our releasetasks repo. It was hard to keep this repository in sync with changes required for beta/release and ESR branches.

  2. More importantly, having the release scheduling logic in-tree meant that we could then rely on chain-of-trust to verify artifacts produced by the release pipeline.

  3. We felt that having the complete release pipeline defined in taskcluster would make it easier for us to tackle the remaining buildbot bridge tasks in parallel.

We hit this milestone in the 58 cycle. Starting with 58.0b3, Firefox and Fennec releases were completely scheduled using the in-tree taskgraph generation. We also migrated over the l10n repacks at the same time, removing a longstanding source of problems where repacks would fail when we first got to beta due to environmental differences between taskcluster and buildbot.

No-BBB Releases

Still, as of 58, much of release automation still ran on buildbot, even if Taskcluster was doing all the scheduling.

Since December, we've been working on removing these last few pieces of buildbot from the release process. Progress was initially a bit slow, given Austin and Christmas, but we've been hard at work in the new year.

That brings us to today.

We've moved uptake monitoring, update verify (and made it 2x faster too!), update submission, final verify, bouncer submission, version bumping and tagging, balrog submission all to run in Taskcluster via various kinds of scriptworkers.

As I mentioned above, DevEdition 60.0b1 will be the first release in nearly 10 years done without using buildbot. The rest of the 60 release cycle will follow suit, and once 60 hits the release channel, only ESR52 will remain on buildbot!

March 03, 2018 12:26 PM

February 27, 2018

Armen Zambrano G. (@armenzg)

Introduction to Neutrino

I discovered Neutrino in the last year and it has become my preferred tool to bootstrap any JS project.

<figcaption>Neutrino’s logo with permission from Eli</figcaption>

Here’s the definition of what Neutrino is from the project’s site:

[You can] create and build modern JavaScript applications with zero initial configuration.
Neutrino combines the power of webpack with the simplicity of presets.

For me, the main advantage Neutrino has, is that it removes the need to write webpack.config.babel.js configuration files and that starting a project is a simple wizard.

To get started it is as simple as this:

npx @neutrinojs/create-project <directory-name>

That will start a wizard that will help you select the stack you want:

https://medium.com/media/4d6a5ad4e44a77f220ead444c12dc8b2/href

Once the wizard completes you can change to that directory and start your project with npm start; That’s it! You don’t need any configuration changes. The minimum number of files for your project to start are now in place.

Neutrino is opinionated and has a bunch of good defaults that works for both production and development. You can always customize the configuration and/or create your own presets.

If you want an example of:

If you want to learn more about Neutrino, Eli Perelman (original author of the project) wrote about Neutrino at hacks.mozilla.org. You can find the official documentation at https://neutrino.js.org.

I hope you give it a try!

February 27, 2018 06:13 PM

February 16, 2018

Chris Cooper (coop)

Experiments in productivity: the shared bug queue

Maybe you have this problem too

You manage or are part of a team that is responsible for a certain functional area of code. Everyone on the team is at different points in there career. Some people have only been there a few years, or maybe even only a few months, but they’re hungry and eager to learn. Other team members have been around forever, and due to that longevity, they are go-to resources for the rest of your organization when someone needs help in that functional area. More-senior people get buried under a mountain of review requests, while those less-senior engineers who are eager to help and grow their reputation get table scraps.

This is the situation I walked into with the Developer Workflow team.

This was the first time that Mozilla had organized a majority (4) of build module peers in one group. There are still isolated build peers in other groups still, but we’ll get to that in a bit.

With apologies to Ted, he’s the elder statesman of the group, having once been the build module owner himself before handing that responsiblity off to Greg (gps), the current module owner. Ted has been around Mozilla for so long that he is a go-to resource for not only build system work but many other projects, e.g. crash analysis, he’s been involved with. In his position as module owner, Greg bears the brunt of the current review workload for the build system. He needs to weigh-in on architectural decisions, but also receives a substantial number of drive-by requests simply because he is the module owner.

Chris Manchester and Mike Shal by contrast are relatively new build peers and would frequently end up reviewing patches for each other, but not a lot else. How could we more equitably share the review load between the team without creating more work for those engineers who were already oversubscribed?

Enter the shared bug queue

When I first came up with this idea, I thought that certainly this must have been tried at some point in the history of Mozilla. I was hoping to plug into an existing model in bugzilla, but alas, such a thing did not already exist. It took a few months of back-and-forth with our reisdent Bugmaster at Mozilla, Emma, to get something setup, but by early October, we had a shared queue in place.

How does it work?

ICP

We created a fictitious meta-user, core-build-config-reviews@mozilla.bugs. Now whenever someone submits a patch to the Core::Build Config module in bugzilla, the suggested reviewer always defaults to that shared user. Everyone on the teams watches that user and pulls reviews from “their” queue.

That’s it. No, really.

Well, okay, there’s a little bit more process around it than that. One of the dangers of a shared queue is that since no specific person is being nagged for pending reviews, the queue could become a place where patches go to die. As with any defect tracking system, regular triage is critically important.

Is it working?

In short: yes, very much so.

Subjectively, it feels great. We’ve solved some tricky people problems with a pretty straightforward technical/process solution and that’s amazing. From talking to all the build peers, they feel a new collective sense of ownership of the build module and the code passing through it. The more-senior people feel they have more time to concentrate on higher level issues or deeper reviews. The less-senior people are building their reputations, both among the build peers and outside the group to review requesters.

Numerically speaking, the absolute number of review requests for the Core::Build Config module is consistent since the adoption of the shared queue. The distribution of actual reviewers has changed a lot though. Greg and Ted still end up reviewing their share of escalated requests — it’s still possible to assign reviews to specific people in this system — but Mike Shal and Chris have increased their review volume substantially. What’s even more awesome is that the build peers who are *NOT* in the Developer Workflow team are also fully onboard, regularly pulling reviews off the shared queue. Kudos to Nick Alexander, Nathan Froyd, Ralph Giles, and Mike Hommey for also embracing this new system wholeheartedly.

The need for regular triage has also provided another area of growth for the less-senior build peers. Mike Shal and Chris Manchester have done a great job of keeping that queue empty and forcing the team to triage any backlog each week in our team meeting.

Teh Future

When we were about to set this up in October, I almost pulled the plug.

Over the next six months, Mozilla is planning to switch code review tools from mozreview/splinter to phabricator. Phabricator has more modern built-in tools like Herald that would have made setting up this shared queue a little easier, and that’s why I paused…briefly

Phabricator will undoubtedly enable a host of quality-of-life improvements for developers when it is deployed, but I’m glad we didn’t wait for the new system. Mozilla engineers are already getting accustomed to the new workflow and we’re reaping the benefits *right now*.

February 16, 2018 08:42 PM

February 02, 2018

Chris Cooper (coop)

Welcome, Connor!

Connor McDavid
<figcaption>This is *not* our Connor.</figcaption>

This post is *ahem* several months overdue, but I’m happy to welcome Connor Sheehan to the team.

Connor was a two-time intern with the Mozilla release engineering team. In that capacity, he became well acquainted with some of the bottlenecks in our CI system. We’ve brought him onboard to assist gps with stabilizing and scaling our mercurial infrastructure.

Welcome, Connor!

February 02, 2018 02:46 PM

November 20, 2017

Chris Cooper (coop)

Work Week Logistics, Revisited

I’ve written before about how to be productive when distributed teams get together and was anxious to try it out on my “new” (read: six-month-old) team, Developer Workflow. As mentioned in that previous post, we just had a work week in Mountain View, so here’s a quick recap.

Process Improvements

We often optimize work week location around where the fewest people would need to travel to attend. While this does make things logistically easier, it also introduces imbalance. Some people will have traveled very far, while some people will be able to sleep in their own beds. Conversely, the local people may feel they need to go home every night in order to be with their partners/families/cats and may miss out on the informal bonding that can happen at group dinners and such.

We had originally intended to meet in San Francisco, but other conferences had jacked up hotel rates, so we decided to decamp to the Valley. I offered to have the SF residents book rooms to avoid the daily commute up and down the peninsula. They didn’t all take me up on it, but it was an opportunity to put everyone on more equal footing.

Schedule-wise, I set things up so that we had our discussion and planning pieces in the morning each day while we were still fresh and caffeinated. After lunch, we would get down to hacking on code. Ted threw together a tracking tool to help visualize the Makefile burndown. Ted is also great at facilitating meetings, keeping us on track especially later in the week as we all started to fade.

Accomplishments

So what did we actually get done? Like the old adage about station wagon full of tapes, never underestimate the review bandwidth of 4 build peers hacking in a room together for an afternoon. We accomplished quite a bit during our time together.

Aside from the 2018 planning detailed in the previous post, we also met with mobile build peer Nick Alexander and planned how to handle mobile Makefiles. The mobile version of Firefox now builds with gradle, so it was important not to step on each others toes. Another huge proportion of the remaining Makefiles involve l10n. We figured out how to work-around l10n for now, i.e. don’t break repacks, to get a tup build working, and we’ve setup a meeting with l10n team for Austin to discuss their plans for langpacks and a future that might not involve makefiles at all. The l10n stuff is hairy, and might be partially my fault (see previous comment re: cargo-culting), so thanks to my team for not shying away from it.

On a concrete level, Ted reports that we’ve removed 13 Makefiles and ~100 lines of other Makefile content in the past month, much of which happened over the past few weeks. Greg has also managed to remove big pieces of complexity from client.mk, assisted by reviews from Chris, Mike, Nick and other build peers. We’re getting into the trickier bits now, but we’re persevering.

All in all, a very successful work week with my “new” team. I continue to find subtle ways to make these get-togethers more effective.

November 20, 2017 03:14 PM

November 16, 2017

Chris Cooper (coop)

Introducing The Developer Workflow Team

I’ve neglected to write about the *other* half of my team, not for any lack of desire to do so, but simply because the code sheriffing situation was taking up so much of my time. Now that the SoftVision contractors have gained the commit access required to be fully functional sheriffs, I feel that I can shift focus a bit.

Meet the team

The other half of my team consists of 4 Firefox build system peers. My team consists of:

Justice League Unlimited

When the group was first established, we talked a lot about what we wanted to work on, what we needed to work on, and what we should be working on. Those discussions revealed the following common themes:

Based on that list of themes, we’ve adopted the moniker of “Developer Workflow.” We are all build peers, yes, but to pigeon-hole ourselves as the build system group seemed short-sighted. Our unique position at the intersection of the build system, VCS, and other services meant that our scope needed to match what people expect of us anyway.

While new to me, Developer Workflow is a logical continuation of build system tiger team organized by David Burns in 2016. This is the same effort that yielded sea change improvements such as artifact builds and sccache.

In many ways, I feel extremely fortunate to be following on the heels of that work. During the previous year, all the members of my team formed the working relationships they would need to be more successful going forward. All the hard work for me as their manager was already done! ;)

What are we doing

We had our first, dedicated work week as a team last week in Mountain View. Aside from getting to know each other a little better, during the week we hashed out exactly what our team will be focused on next year, and made substantial progress towards bootstrapping those efforts.

Next year, we’ll be tackling the following projects:

What are we *not* doing

It’s important to be explicit about things we won’t be tackling too, especially when it’s been unclear historically or where there might be different expectations.

The biggest one to call out here is github integration. Many teams at Mozilla are using github for developing standalone projects or even parts of Firefox. While we’ve had some historical involvement here and will continue to consult as necessary, other teams are better positioned to drive this work.

We are also not currently exploring moving Windows builds to WSL. This is something we experimented with in Q3 this year, but build performance is still so slow that it doesn’t warrant further action right now. We continue to follow the development of WSL and if Microsoft is able to fix filesystem performance, we may pick this back up.

November 16, 2017 02:12 PM

November 10, 2017

Armen Zambrano G. (@armenzg)

Firefox code coverage diff viewer in beta

<figcaption>Firefox’s code coverage diff viewer</figcaption>

Few months ago I decided to become a frontend developer and this is my first product: Firefox code coverage diff viewer.

The Firefox code coverage diff viewer allows determining code coverage changes for added lines per changeset.

The main purpose of this tool is to determine if Release Management can use code coverage data to help them make risk analysis about individual changesets.

marco has been working on collecting the code coverage data from Mozilla’s continuos integration and developed the backend this app uses.

The main view shows you changesets from the last ten pushes on mozilla-central that: 1) have coverage data and 2) are not merges or backouts.

<figcaption>The code coverage diff viewer only show coverage status for added lines</figcaption>

From there you can navigate to individual changesets. The diff viewer will only highlight added lines as having coverage or no coverage. There are also some added lines that will not have highlighting since it is non-code added lines.

Brief technical details

The app got promoted to beta this week and development on it will stop for this quarter. When this tool becomes clearly essential for Release Management we can reinvest on it.

It’s been a great experience working on this product with marco, ekyle and jmaher. Thank you all for your input.

November 10, 2017 07:30 PM

November 03, 2017

Armen Zambrano G. (@armenzg)

Thank you, Mozilla, for caring for me

Background story: I’ve been working with Mozilla full-time since 2009 (contributor in 2007 — intern in 2008). I’ve been working with the release engineering team, the automation team (A-team) and now within the Product Integrity organization. In all these years I’ve been blessed with great managers, smart and helpful co-workers, and enthusiastic support to explore career opportunities. It is an environment that has helped me flourish as a software engineer.

I will go straight to some of the benefits that I’ve enjoyed this year.

Parental leave

Three months at 100% of my salary. I did not earn bonus payouts during that time, however, it was worth the time I spent with my firstborn. We bonded very much during that time, I learned how to take care of my family while my wife worked, and I can proudly say that he’s a “daddy’s boy” :) (Not that I spoil him!).

Working from home 100% of the time

My favourite benefit. Period.

It really helps me as an employee, as I don’t enjoy commuting and I tend to talk a lot when I’m in the office. My family is very respectful of my work hours and I’m able to have deep-thought sessions in the comfort of my own home.

This is not a benefit that a lot of companies give, especially the bigger ones which expect you to relocate and come often to the office. I chuckle when I hear a company offer that their employees can work from home only a couple of days per week.

Wellness benefits

I appreciate that Mozilla allocaters some of their budget to pay for anything related to employee wellness (mental, spiritual & physical). Knowing that if I don’t use it I will lose it causes me to think about ways to apply the money to help me stay in shape.

Learning support/budget

This year, after a re-org and many years of doing the same work, I found myself in need of a new adventure — I get bored if I don’t feel as though I’m learning. With my manager’s support (thanks jmaher!), I embarked on a journey to become a front-end developer. Mozilla also supported me by paying for me to complete a React Nanodegree as part of the company’s learning budget.

To my great surprise, React has become rather popular inside Mozilla, and there is great need for front-end work within my org. It was also a nice surprise to see that switching to Javascript from Python was not as difficult as I thought it would be.

Thank you, Mozilla, for your continued support!

November 03, 2017 08:15 PM

October 25, 2017

Chris Cooper (coop)

Code sheriffing @ Mozilla: Past, Present, and Future

In a github world, developers have certain baseline expectations about interacting with source code and the tooling around it. These expectations can color their choices about which projects to contribute to. If Mozilla wants to compete with other companies and open source projects for developer mindshare (and code), we need to evolve the way we develop and distribute software. Code sheriffing and its associated tooling is one piece of that puzzle.

I inherited the Mozilla code sheriff team back in April. I didn’t initially think anything needed to change with sheriffing at Mozilla. Things had been “fine” for a while, so why rock the boat?

By nature, I dug into the history of my new team when I inherited them. What follows is a brief retrospective of sheriffing at Mozilla, the changes we’re undergoing right now, and my vision for how it might change in the future.

Past

Back to the Future III

I’ve been at Mozilla long enough now to remember when developers themselves acted as code sheriffs. In the beginning, every developer at Mozilla (myself included) rotated through the position1. Some developers were quite conscientious about sheriffing, others never even realized it was their turn. There was no formal training. Not surprisingly, the results were…uneven.

As the number of developers and the volume of code increased, this model became untenable. Code sheriffing as a well-defined role didn’t exist at Mozilla until 2012, initially coming as a response to the staffing increase in the lead-up to Firefox 4. At the same time, Mozilla was moving away from a “strict” waterfall development model tied to Tinderbox. Our new buildbot-based approach to CI allowed us to land more code, more quickly. Dedicated sheriffs were needed to make sense of it all. Even then, in true Mozilla fashion, sheriffing was an activity that blurred the lines between community and staff. Some of the most dedicated code sheriffs we have ever had were/are volunteers.

Whether staff or community, code sheriffs became de facto stewards of code quality. They were responsible for daily merges, selecting changesets with the lowest number of intermittent failures that would be suitable for inclusion in Nightly releases. When things broke, the sheriffs were responsible for backing out code, and even closing the development trees if the situation became sufficiently dire.

With the opening of the Mozilla office in Taipei, and the associated re-tasking of two QA resources as code sheriffs in that office, Mozilla almost had around-the-clock (24/7) coverage for code sheriffing, provided no one ever got sick or took a vacation.

We persevered in this model for a few years, and our developers understandably became accustomed to the freedom it provided them. Developers could functionally land their code and not worry about the outcome: the code sheriffs would ping them if any follow-up action was required. Fire-and-forget, if you will.

Sadly, in June 2017 our last Taipei sheriff resigned, leaving us with a glaring hole in our coverage. Even with community assistance, there were 8-10 hours per day with *no* active sheriffing. This led to an increase in tree closing events as sheriffs often needed to determine the root cause for a failure that had many commits on top of it already. Complaints started coming in about delays in landing code, and also about classification errors, e.g. permanent failures wrongly triaged as intermittent due to the time pressures of working in this mode. People were not happy, least of all the sheriffs.

This is when I realized I needed to rethink how sheriffing at Mozilla should work.

Present

Back to the Future

The knee-jerk reaction would have been to simply hire another sheriff in Taipei, but that still would have left us vulnerable to illness, vacation, and further employment changes. Luckily, another solution presented itself.

Mozilla has an established history of working with SoftVision. I enlisted their help myself a few years ago when I was working in releng to help address our buildduty problem. It came to my attention that SoftVision was creating a 24/7 support service, and I decided to give it a try. That’s where we are now.

The SoftVision sheriffing contractors started in late August. They have spent the last two months learning (and then practicing) how to classify automation failures. The harder piece is learning how to properly select mergeable changesets and perform backouts. Mozilla guards the kind of source control access required to perform these code sheriffing activities pretty closely; it’s not something we simply give away. The contractors are slowly building that trust the same as any other contributor would. We’re getting there though:

Once the SoftVision sheriffs are fully up-to-speed, they will be available 24/7 to assist developers, and to further the Mozilla mission with the usual array of merges, backouts, uplifts, and tree closures.

Right now, we are relying on the magnanimity of the former sheriffs and community sheriffs to help bridge the gap while the contractors are training up. It’s true, sheriffs throughput is still not back to the level before we lost our sheriff in Taipei, but I can see the light at the end of the tunnel.

Future

Back to the Future II

How can I be sure that light isn’t a train? Well, that’s the trick, isn’t it?

In retrospect, it was naïve of me to think that sheriffing could have existed for any length of time the way it was. Sheriffs felt enormous pressure to work longer hours than they should have because the trees needed to stay open, and “if not them, then who?” The human toll on those performing the work. whether staff or volunteer, was simply too high.

Yes, for the near-future at least, the SoftVision contractors will continue to perform merges and backouts as required in the model to which we’ve become accustomed. That work is still very operational, hands-on, and prone to burnout, and that’s where I think the biggest opportunity for change will come going forward.

Mozilla currently has two integrations branches – mozilla-inbound and autoland – in addition to mozilla-central. This makes life much harder for sheriffs because they need to merge code three-ways between the different branches. When bad code gets merged around accidentally, we are almost forced to close the trees while we recover.

The obvious change is to simplify the process and remove one of the integration branches. This might actually be feasible in the near future. With the announcement of Mozilla’s adoption of phabricator, 99.9% of code should eventually be able to land directly in the autoland repo, allowing us to decommission the mozilla-inbound repo. Once we return to a single integration branch, developer workflows can be much more streamlined, and streamlined workflows are ideal targets for automation.

My ideal future developer workflow would be:

There are no code sheriffs in that picture at all. That’s a good thing.2

There’s a gulf of tooling improvements between where we are and that potential future, but if Mozilla wants to keep increasing the pace of development and attracting the best developers, I think the tooling investment is one we need to make.


1. Hilariously, a version of that sheriffing calendar still exists, projecting sheriff duty off into the future for a bunch of developers who haven’t even been at Mozilla for years.
2. I’m not naïve enough to think we won’t need *any* sheriffs. Even Facebook’s model still needs some.

October 25, 2017 03:04 PM

August 30, 2017

Chris AtLee (catlee)

Taskcluster migration update

All your nightlies are belong to Taskcluster

In January I announced that we had just migrated Linux nightly builds to Taskcluster.

We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.

https://media.giphy.com/media/MOWPkhRAUbR7i/giphy.gif

This includes all Windows, macOS, Linux, and Android builds. You can see all the builds and repacks on Treeherder.

In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.

Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.

Windows signing

We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.

First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.

In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!

As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.

One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.

Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.

Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.

macOS performance (and why your build directory matters)

Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.

However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.

Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.

Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.

Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...

https://media.giphy.com/media/zjQrmdlR9ZCM/giphy.gif

Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.

Lessons learned

We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.

First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.

One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.

Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.

What's next?

We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.

We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.

Thanks!

Thank you for reading this far!

Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.

In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":

  • Aki Sasaki
  • Dustin Mitchell
  • Greg Arndt
  • Joel Maher
  • Johan Lorenzo
  • Justin Wood
  • Kim Moir
  • Mihai Tabara
  • Mike Shal
  • Nick Thomas
  • Rail Aliiev
  • Rob Thijssen
  • Simon Fraser
  • Wander Costa

August 30, 2017 09:40 AM

Taskcluster migration update

All your nightlies are belong to Taskcluster

In January I announced that we had just migrated Linux nightly builds to Taskcluster.

We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.

https://media.giphy.com/media/MOWPkhRAUbR7i/giphy.gif

This includes all Windows, macOS, Linux, and Android builds. You can see all the builds and repacks on Treeherder.

In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.

Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.

Windows signing

We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.

First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.

In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!

As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.

One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.

Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.

Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.

macOS performance (and why your build directory matters)

Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.

However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.

Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.

Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.

Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...

https://media.giphy.com/media/zjQrmdlR9ZCM/giphy.gif

Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.

Lessons learned

We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.

First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.

One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.

Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.

What's next?

We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.

We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.

Thanks!

Thank you for reading this far!

Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.

In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":

  • Aki Sasaki
  • Dustin Mitchell
  • Greg Arndt
  • Joel Maher
  • Johan Lorenzo
  • Justin Wood
  • Kim Moir
  • Mihai Tabara
  • Mike Shal
  • Nick Thomas
  • Rail Aliiev
  • Rob Thijssen
  • Simon Fraser
  • Wander Costa

August 30, 2017 09:40 AM

June 13, 2017

Part 4: How Mozilla publishes APKs onto Google Play Store, in a reasonably secure and automated way

The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.

This blog post is a part of a serial. Checkout the other posts:

  1. How did the project start?
  2. Presentation of the solution
  3. 5 things I would have loved knowing about Google Play
  4. What’s next? Want to contribute? & Special thanks [Here]

What’s next? Want to contribute?

The initial publication was the first big step of the project. The next big one will be to expand the release workflow to simplify what percentage of the user base is getting the latest update.

If you want to contribute, there are also a few open bugs. You might also want to take look at MozApkPublisher and pushapkscript to see how we added our checks.

Finally, if you have some stories on how you use Google Play you would like to share, questions or comments about our tools, please leave a comment.

Special thanks

A lot of people were involved in a way or another in this project. By alphabetical order:

Comments

You can read and leave comments on this Github issue.

June 13, 2017 08:56 AM

June 12, 2017

Part 3: How Mozilla publishes APKs onto Google Play Store, in a reasonably secure and automated way

The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.

This blog post is a part of a serial. Checkout the other posts:

  1. How did the project start?
  2. Presentation of the solution
  3. 5 things I would have loved knowing about Google Play [Here]
  4. What’s next? Want to contribute? & Special thanks

5 things I would have loved knowing about Google Play

This part is more oriented to personal takeaways and a couple of questions that remain unanswered.

It is easy to publish an APK that is not localized

A few checks done in pushapk_scriptworker are actually because of previous errors. The number of locales is one of them. A few weeks after Fennec Aurora was shipped to Google Play, some users started to see their browser in English, even though their phone is set in a different locale and Aurora used to be in their own locales.

Like said previously, APKs were originally uploaded via a script. This was true also for the first Aurora APKs. Furthermore, automation on Taskcluster was first tried on Aurora. When I started to roll these bits out, there was a configuration issue. Pushapk_scriptworker picked the en-US-only-APK, instead of the multi-locale one. The fix was fairly simple: just change the APK locations.

Google Play has a lot of ways to detect wrong APKs: by signatures, by package name (none of the Firefox versions share the same one), by version code, and some others. Although, it doesn’t warn about a big change in:

Of course, from one stable version to another, a lot of files may change in the archive. Asking Google to watch out for everything doesn’t seem reasonable. Although, when I started working on Google Play, it left me the feeling of being well-hardened. At that time, I thought Google Play did check the locales within an APK.

The consequence I take away: If your app has different build flavors (like single vs multi-locales), I recommend you write your own sanity checks, first.

Locales on the Play Store are independent from locales in APK

It might sound obvious after explaining of the previous issue, but this error message confused several Mozillians:

Tried to set recent changes text for APK version 12345678 for language es-US. Language is not associated with the app.

We hit it with a couple of locales, at pace of 1 per month, approximately. Like explained in the architecture part, locales are defined in an external service, stores_l10n. We have had many theories about it:

Actually, the fix ended up being simple. “Recent changes” is something we want to update on every new APK. But because the descriptions were more set in stone, they were not a part of the automated workflow. Actually, stores_l10n released had recently released new locales each time we hit the problem. That error message was actually telling us the descriptions of these new locales had never been uploaded. Once we figured this out, it became a part of the regular update workflow.

You cannot catch everything when you don’t commit transactions

The dry-run feature in MozApkPublisher, which just doesn’t commit the Google Play transaction, helps in detecting early failures. For instance: wrong version codes, wrong package names. Nevertheless, we have hit cases where dry runs went smoothly and we had to diagnose new issues at commit time.

User fractions can also be specified on other tracks (but that’s not a feature)

Fennec Release 53.0 is the first version which was entirely published via Taskcluster. Mozilla also uses the rollout track on Release only. Beta is pushed to the production one (and that is actually something we are reviewing). Sadly, there was another configuration error: even though, the user fraction was specified, the configured track was the production one. Google Play didn’t raise any error (even at commit time), starting a full-throttled release. At that time, I contacted the Google Play support, to ask if it was possible to switch back to rollout. The person was very courteous. He explained they were not able to perform that kind of action. This is why they transmitted my request to the tech team, who will follow up by email.

In parallel, we have fixed the configuration error and implemented our own check in MozApkPublisher.

There is no way to rollback to previous APKs, even if you ask a human

The previous configuration error could have remained brieve, if somebody didn’t report what seemed like an important crash, 1 hour later. At that point, the Release Management team wanted to stall updates, in order to not spread the regression too much. We were still waiting on the support’s answer, but I reached out to them again since our request became different. I told the new contact I had about the previous issue, the new one, and the fact we were running against the clock. Sadly for us, the person only gave us the option to wait on the email follow up.

About 16 hours later, we got the email with the official answer:

Unfortunately, we cannot remove the APK in question, neither can we claw back the APK from the users that have already installed this update.

If you want to stop further users from installing this APK, then you need to make another release that deactivates this APK and add the APK that you want users to install instead.

With that in mind, we have been thinking of another way to use Google Play, where the rollout track would be more extensively used.

What’s next? Want to contribute? & Special thanks

See the next post.

Comments

You can read and leave comments on this Github issue.

June 12, 2017 09:17 AM

June 07, 2017

Part 2: How Mozilla publishes APKs onto Google Play Store, in a reasonably secure and automated way

The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.

This blog post is a part of a serial. Checkout the other posts:

  1. How did the project start?
  2. Presentation of the solution [Here]
  3. 5 things I would have loved knowing about Google Play
  4. What’s next? Want to contribute? & Special thanks

Presentation of the solution

The needs

The solution

Based on Taskcluster

Mozilla, and more specifically the Release Engineering team, uses Taskcluster to implement the Firefox release workflow. The workflow can be summed up as:

  1. Build Firefox with all supported locales (languages)
  2. Sign these builds
  3. Publish them everywhere (on https://archive.mozilla.org/, on https://www.mozilla.org/firefox/, via updates, etc.)

Each step is defined by its own set of tasks. These tasks are processed by specialized workers (represented by worker types). Those workers basically run a script against parameters given in the task definition.

Therefore, publishing to Google Play was a matter of creating a new Taskcluster task, which will be processed by a dedicated script and executed by its own worker type.

With some extra-security features

The aforementioned script must be bootstrapped to be integrated to the rest of Taskcluster. There are several ways to bootstrap scripts for Taskcluster. One of them is to create a docker image which Taskcluster pulls and run.

However, because of the needs stated above, we decided to go with a security-focused framework: scriptworker. Scriptworker was initially created to perform one of the most critical operation security-wise: sign builds. The framework has some great interesting features:

How pieces are wired together
0. Overview

Here’s a general view of how things are wired together:

Architecture overview

1. Task creation

There are many ways to submit the definition of a task to Taskcluster. For example, you can:

Each of them was used at some point, but the ultimate solution relies on the last one. The taskgraph is a graph generator which, depending on given parameters, creates a graph of builds, tests and deployment tasks. Taskgraph generation is run on Taskcluster too, under what we commonly call “the decision task”. This solution benefits from being on hg.mozilla.org: it is versioned and only vouched people are able to modify it.

Moreover, taskgraph generates what is necessary for scriptworker to validate the task definitions and artifacts. To do so, taskgraph:

2. Scriptworker and new tasks

Scriptworker polls tasks from Taskcluster queue. That is actually one of the great things about Taskcluster: workers don’t have to open inbound (listening) ports. This reduces the potential surface of attack. Fetching new tasks is done via this bit of REST API which workers can poll. Speaking of which, workers are authenticated to Taskcluster, which prevents them from claiming a task that it isn’t meant to take.

Secure download of artifacts is done by the “Chain of Trust” feature of scriptworker. Once set up, if you define upstreamArtifacts within the task definition, scriptworker will:

  1. Make sure the current task and its dependencies have not changed since its definition. This is done by comparing the JSON the taskgraph generated and the actual definition.
  2. Check the signatures of every dependency, by looking at a special artifact Chain of Trust creates. This helps to verify no rogue worker processed a upstream task.
  3. Download artifacts on the worker and verify the checksums.

If all goes well, scriptworker will call pushapkscript

3. Pushapkscript and APKs

Here starts the Android-specific bits. Pushapkscript performs some extra checks on the APKs:

Pushapkscript knows about the location of the Google Play credentials (P12 certificates). It finally gives all the files (checked APKs and credentials) to MozApkPublisher.

4. MozApkPublisher, locales and Google Play

To be honest, MozApkPublisher could have been implemented within pushapkscript, but the split exists for historical reasons and still has a meaning today: this was the script Release Management used before this project got started. It also remains a way to let a human publish, in case of emergency.

It checks that APKs are multi-locale. We serve the same APK, which includes (almost) every locale in it. That’s a verification Google doesn’t do.

It also fetches the latest strings to display on the Play Store (like the localized descriptions). These strings are then posted on Google Play, alongside the APKs.

MozApkPublisher provides a dry-run mode thanks to the transaction mechanism exposed by Google’s API. Nothing is effectively published until the transaction is committed.

5. Pushapk_scriptworker: Scriptworker, Pushapkscript, and MozApkPublisher on the same machine

The 3 pieces live on the same Amazon EC2 instance, under the name pushapk_scriptworker. The configuration of this instance is managed by Puppet. The entire Puppet configuration is public on hg.mozilla.org, with the exception of secrets (Tascluster credentials, P12 certificates) which are encrypted on a seperate machine. Like the main Firefox repository, only vouched people can submit changes to the Puppet configuration.

5 things I would have loved knowing about Google Play

See the next post.

Comments

You can read and leave comments on this Github issue.

June 07, 2017 12:24 PM

June 06, 2017

Part 1: How Mozilla publishes APKs onto Google Play Store, in a reasonably secure and automated way

The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.

This blog post is a part of a serial. Checkout the other posts:

  1. How did the project start? [Here]
  2. Presentation of the solution
  3. 5 things I would have loved knowing about Google Play
  4. What’s next? Want to contribute? & Special thanks

How did the project start?

Mozilla ships Firefox every day

This is true for desktop (Windows, Linux, Mac) and Android. However, we don’t ship that often to every user. We have different channels, receiving updates at different frequencies:

About Firefox Aurora

You may have heard, Firefox Aurora has been discontinued in April 2017. Although, these blog posts will talk about it. The main reason is: Most of the experiments were done on Aurora, before it was stopped.

Today, the Android users who were on Aurora have been migrated to Nightly. New users are also given Nightly.

Why do we need Firefox for Android on app stores?

Unlike Firefox for desktop, Android apps have to be uploaded onto application stores (like Google Play Store). Otherwise, they have very low visibility. For instance, Firefox for Android Aurora (codenamed “Fennec Aurora”) was not on Google Play until September 2016, but it was downloadable from our official website (now redirected to Nightly). After we started publishing Aurora on Google Play, we increased our number of users by 5x.

Why are we automating the publication today?

Google didn’t offer a way to script a publication on Play Store, before July 2014. It had to be done manually, from their website. Around that time, a few people from Release Management implemented a first script. One person from the Release Management team ran it every time Beta or Release was ready, from his/her own machine. With Aurora being out, we now have several APKs (one per processor architecture/Android API level, which translates to 2 at the moment: one for x86 processors, the other for ARM) to publish each day.

The daily frequency was new for Fennec. It led to 2 concerns:

  1. A human has to repeat the same task every day.
  2. Pushing every day from a workstation increases the surface of security threats.

That is why we decided to make APK publication a part of the automated release workflow.

Presentation of the solution

See the next post.

Comments

You can read and leave comments on this Github issue.

June 06, 2017 01:25 PM

May 17, 2017

Kim Moir (kmoir)

New blog location

I moved my blog to WordPress.

New location is here https://kimmoir.blog/

May 17, 2017 09:12 PM

March 13, 2017

Chris Cooper (coop)

Shameless self (release) promotion: Firefox 53.0b1 from TaskCluster

You may recall two short months ago when we moved Linux and Android nightlies from buildbot to TaskCluster. Due to the train model, this put us (release engineering) on a clock: either we’d be ready to release a beta version of Firefox 53 for Linux and Android using release promotion in TaskCluster, or we’d need to hold back our work for at least the next cycle, causing uplift headaches galore.

I’m happy to report that we were able to successfully release Firefox 53.0b1 for Linux and Android from TaskCluster last week. This is impressive for 3 reasons:

  1. Mac and Windows builds were still promoted from buildbot, so we were able to seamlessly integrate the artifacts of two different continuous integration (CI) platforms.
  2. The process whereby nightly builds are generated has always been different from how we generate release builds. Firefox 53.0b1 represents the first time a beta build was generated using the same taskgraph we use for a nightly, thereby reducing the delta between CI builds and release builds. More work to be done here, for sure.
  3. Nobody noticed. With all the changes under the hood, this may be the most impressive achievement of all.

A round of thanks to Aki, Johan, Kim, and Mihai who worked hard to get the pieces in place for Android, and a special shout-out to Rail who handled the Linux beta while also dealing with the uplift requirements for ESR52. Of course, thanks to everyone else who has helped with the migration thus far. All of that foundational work is starting to pay off.

Much more to do, but I look forward to updating you about Mac and Windows progress soon.

March 13, 2017 07:17 PM

February 21, 2017

Chris Cooper (coop)

RelEng & RelOps highlights - February 21, 2017

It’s been a while. How are you?

Modernize infrastructure:

We finally closed the 9-year-old bug requesting that we redirect all HTTP traffic to hg.mozilla.org to HTTPS! Many thanks to everyone who helped ensure that automation and other tools continued to work normally. Not every day you get to close bugs that are older than my kids. https://bugzilla.mozilla.org/show_bug.cgi?id=450645

The new TreeStatus page (https://mozilla-releng.net/treestatus) was finally released by garbas with a proxy in place of old url.

Improve Release Pipeline:

Initial work on Uplift dashboard has been done by bastien and released to production by garbas. https://shipit.mozilla-releng.net/release-dashboard

Releng had a workweek in Toronto to plan how release promotion will work in a TaskCluster world. With the uplift for Firefox 52 rapidly approaching (see Release below), we came up with a multi-phase plan that should allow us to release the Linux and Android versions of Firefox 52 from TaskCluster, with the Mac and Windows versions still being created by buildbot.

Improve CI Pipeline:

Alin and Sebastian disabled Windows 10 tests on our CI. Windows 10 tests will be reappearing later this year once we move datacentres and acquire new hardware to support them. https://bugzilla.mozilla.org/show_bug.cgi?id=1330999

Andrei and Relops converted some Windows talos machines to run Linux64 to reduce wait times on this platform. https://bugzilla.mozilla.org/show_bug.cgi?id=1337452

There are some upcoming deadlines involving datacentre moves that, while not currently looming, are definitely focusing our efforts in the TaskCluster migration. As part of the aforementioned workweek, we targeted the next platform that needs to migrate, Mac OS X. We are currently breaking out the packaging and signing steps for Mac so that they can be done on Linux. That work can then be re-used for l10n repacks *and* release promotion.

Operational:

Since most of our Linux64 builds and tests have migrated to TaskCluster, Alin was able to shut down many of our Linux buildbot masters. This will reduce our monthly AWS bill and the complexity of our operational environment. https://bugzilla.mozilla.org/show_bug.cgi?id=1335435

Hal ran our first “hard close” Tree Closing Window (TCW) in quite a while on Saturday, February 11 (https://bugzilla.mozilla.org/show_bug.cgi?id=1324148). It ran about an hour longer than planned due to some strange interactions deep in the back end, which is why it was a “hard close.” The issue may be related to occasional “database glitches” we have seen in the past. This time IT got some data, and have raised a case with our load balancer vendor.

Release:

We are deep in the beta cycle for Firefox 52, with beta 8 coming out this week. Firefox 52 is an important milestone release because it signals the start of another ESR cycle.

See you again soon!

February 21, 2017 04:18 PM

February 17, 2017

Chris Cooper (coop)

Being productive when distributed teams get together, take 2

Salmon jumping<figcaption>Every year, hundreds of release engineers swim upstream because they’re built that way.</figcaption>

Last week, we (Mozilla release engineering) had a workweek in Toronto to jumpstart progress on the TaskCluster (TC) migration. After the success of our previous workweek for release promotion, we were anxious to try the same format once again and see if we could realize any improvements.

Prior preparation prevents panic

We followed all of the recommendations in the Logistics section of Jordan’s post to great success.

Keeping developers fed & watered is an integral part of any workweek. If you ever want to burn a lot of karma, try building consensus between 10+ hungry software developers about where to eat tonight, and then finding a venue that will accommodate you all. Never again; plan that shit in advance. Another upshot of advance planning is that you can also often go to nicer places that cost the same or less. Someone on your team is a (closet) foodie, or is at least a local. If it’s not you, ask that person to help you with the planning.

What stage are you at?

The workweek in Vancouver benefitted from two things:

  1. A week of planning at the All-Hands in Orlando the month before; and,
  2. Rail flying out to Vancouver a week early to organize much of the work to be done.

For this workweek, it turned out we were still at the planning stage, but that’s totally fine! Never underestimate the power of getting people on the same page. Yes, we did do *some* hacking during the week. Frankly, I think it’s easier to do the hacking bit remotely, but nothing beats a bunch of engineers in a room in front of a whiteboard for planning purposes. As a very distributed team, we rarely have that luxury.

Go with it

…which brings me to my final observation. Because we are a very distributed team, opportunities to collaborate in person are infrequent at best. When you do manage to get a bunch of people together in the same room, you really do need to go with discussions and digressions as they develop.

This is not to say that you shouldn’t facilitate those discussions, timeboxing them as necessary. If I have one nit to pick with Jordan’s post it’s that the “Operations” role would be better described as a facilitator. As a people manager for many years now, this is second-nature to me, but having someone who understands the problem space enough to know “when to say when” and keep people on track is key to getting the most out of your time together.


By and large, everything worked out well in Toronto. It feels like we have a really solid format for workweeks going forward.

February 17, 2017 10:48 PM

January 20, 2017

Chris Cooper (coop)

Nightlies in TaskCluster - go team!

As catlee has already mentioned, yesterday we shipped the first nightly builds for Linux and Android off our next-gen Mozilla continuous integration (CI) system known as TaskCluster. I eventually want to talk more about why this important and how we got to here, but for now I’d like to highlight some of the people who made this possible.

Thanks to Aki’s meticulous work planning and executing on a new chain of trust (CoT) model, the nightly builds we now ship on TaskCluster are arguably more secure than our betas and releases. Don’t worry though, we’re hard at work porting the chain of trust to our release pipeline. Jordan and Mihai tag-teamed the work to get the chain-of-trust-enabled workers doing important things like serving updates and putting binaries in the proper spots. Kim did the lion’s share of the work getting our task graphs sorted to tie together the disparate pieces. Callek wrangled all of the l10n bits. On the testing side, gbrown did some heroic work getting reliable test images setup for our Linux platforms. Finally, I’d be remiss if I didn’t also call out Dustin who kept us all on track with his migration tracker and who provided a great deal of general TaskCluster platform support.

Truly it was a team effort, and thanks to all of you for making this particular milestone happen. Onward to Mac, Windows, and release promotion!

January 20, 2017 07:21 PM

Chris AtLee (catlee)

Nightly builds from Taskcluster

Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.

74851712.jpg

We now have a much more secure, resilient, and hackable nightly build and release process.

It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.

The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.

Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.

Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.

This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.

Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!

January 20, 2017 01:35 PM

Nightly builds from Taskcluster

Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.

74851712.jpg

We now have a much more secure, resilient, and hackable nightly build and release process.

It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.

The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.

Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.

Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.

This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.

Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!

January 20, 2017 01:35 PM

December 28, 2016

Chris AtLee (catlee)

2016 RelEng Retrospective

As 2016 winds down, I wanted to take some time to highlight all the work our Release Engineering team has done this year. Personally, I really enjoy writing these retrospective posts. I think it's good to spend some time remembering how far we've come in a year. It's really easy to forget what you did last month, and 6 months ago seems like ancient history!

People!

We added four people to our team this year!

Aki (:aki) re-joined us in January and has been working hard on developing a security model for Taskcluster for sensitive tasks like signing and publishing binaries.

Rok (:garbas) started in February and has been working on modernizing our web application framework development and deployment processes.

Johan (:jlorenzo) started in August and has been improving our release automation, Balrog, and automatically publishing Android builds to the Google Play Store.

Simon (:sfraser) started in October and has been improving monitoring of our production systems, as well as getting his feet wet with our partial update generation system.

Releases

This year we released 104 desktop versions of Firefox, and 58 android versions (including Beta, Release and ESR branches).

5 of those releases were just in the week prior to our all hands meeting in Hawaii!

Several other releases this year were special for particular reasons, and required special efforts on our part. We continued to provide SHA-1 signed installers for Windows XP users. We also produced a special 47.0.2 release in order to try and rescue users stuck on 47. We've never shipped a point release for a previous release branch before! We've also generated partial updates to try and help users on 43.0.1 and 47.0.2 get faster updates to the latest version of Firefox.

Release promotion

We couldn't have shipped so many releases so quickly last week if it weren't for release promotion. Previous to Firefox 46, our release process would generate completely new builds after CI was finished. This wasted a lot of time, and also meant we weren't shipping the exact binaries we had tested. Today, we ship the same builds that CI has generated and tested. This saves a ton of time (up to 8 hours!), and gives us a lot more confidence in the quality of the release.

This is one of those major kinds of changes that really transforms how we approach doing releases. I can't really remember what it was like doing releases prior to release promotion!

We also added support in Shipit to allow starting a release before all the en-US builds are done. This lets our Release Management team kick off a release early, assuming all the builds pass. It saves a person having to wait around watching Treeherder for the coveted green builds.

Windows in AWS

This year we completed our migration to AWS for Windows builds. 100% of our Windows builds are now done in AWS. This means that we now have a much faster and more scalable Windows build platform.

In addition, we also migrated most of the Windows 7 unittests to run in AWS. Previously these were running on dedicated hardware in our datacentre. By moving these tests to AWS, we again get a much more scalable test platform, but we also freed up hardware capacity for other test platforms (e.g. Windows XP).

Taskcluster

One of our major focus areas this year was migrating our infrastructure from Buildbot to Taskcluster. As of today, we have:

  • Fully migrated Linux64 and Android debug builds and tests
  • Builds for all other platforms operating as Tier2
  • Linux64 and Android nightly builds, l10n repacks and updates operating as Tier2
  • Tons of security design & implementation work

Balrog

Scheduled Changes in Balrog means that now we can have machines set background update rate to 0% 24 hours after release, instead of having a human do it.

Balrog itself was migrated from our datacentre in SCL3 into AWS. We now have a much more flexible deployment pipeline.

Balrog has also been one of our best projects for getting volunteer contributions! Many of the work done this year was done by contributors!

RIP

Being able to shut off old, crufty and deprecated stuff is an important part of staying agile. This year we were finally able to develop an end of life plan for Windows XP. In addition, we discontinued support for OSX 10.6-10.8, systems without SSE2, and 32-bit OSX systems. Not having to support these old platforms simplifies managing our infrastructure, and also makes product development easier.

We also shut down all the panda mobile testing infrastructure and legacy vcs-sync.

What's next?

2017 is looking like it's going to be another interesting (and busy!) year for RelEng.

Our top priority is to finish the migration to Taskcluster. Hopefully by the end of 2017, the only thing left on buildbot will be the ESR52 branch. This will require some big changes to our release automation, especially for Fennec.

We're also planning to provide some automated processes to assist with the rest of the release process. Releases still involve a lot of human to human handoffs, and places where humans are responsible for triggering automation. We'd like to provide a platform to be able to manage these handoffs more reliably, and allow different pieces of automation to coordinate more effectively.

December 28, 2016 07:35 PM

2016 RelEng Retrospective

As 2016 winds down, I wanted to take some time to highlight all the work our Release Engineering team has done this year. Personally, I really enjoy writing these retrospective posts. I think it's good to spend some time remembering how far we've come in a year. It's really easy to forget what you did last month, and 6 months ago seems like ancient history!

People!

We added four people to our team this year!

Aki (:aki) re-joined us in January and has been working hard on developing a security model for Taskcluster for sensitive tasks like signing and publishing binaries.

Rok (:garbas) started in February and has been working on modernizing our web application framework development and deployment processes.

Johan (:jlorenzo) started in August and has been improving our release automation, Balrog, and automatically publishing Android builds to the Google Play Store.

Simon (:sfraser) started in October and has been improving monitoring of our production systems, as well as getting his feet wet with our partial update generation system.

Releases

This year we released 104 desktop versions of Firefox, and 58 android versions (including Beta, Release and ESR branches).

5 of those releases were just in the week prior to our all hands meeting in Hawaii!

Several other releases this year were special for particular reasons, and required special efforts on our part. We continued to provide SHA-1 signed installers for Windows XP users. We also produced a special 47.0.2 release in order to try and rescue users stuck on 47. We've never shipped a point release for a previous release branch before! We've also generated partial updates to try and help users on 43.0.1 and 47.0.2 get faster updates to the latest version of Firefox.

Release promotion

We couldn't have shipped so many releases so quickly last week if it weren't for release promotion. Previous to Firefox 46, our release process would generate completely new builds after CI was finished. This wasted a lot of time, and also meant we weren't shipping the exact binaries we had tested. Today, we ship the same builds that CI has generated and tested. This saves a ton of time (up to 8 hours!), and gives us a lot more confidence in the quality of the release.

This is one of those major kinds of changes that really transforms how we approach doing releases. I can't really remember what it was like doing releases prior to release promotion!

We also added support in Shipit to allow starting a release before all the en-US builds are done. This lets our Release Management team kick off a release early, assuming all the builds pass. It saves a person having to wait around watching Treeherder for the coveted green builds.

Windows in AWS

This year we completed our migration to AWS for Windows builds. 100% of our Windows builds are now done in AWS. This means that we now have a much faster and more scalable Windows build platform.

In addition, we also migrated most of the Windows 7 unittests to run in AWS. Previously these were running on dedicated hardware in our datacentre. By moving these tests to AWS, we again get a much more scalable test platform, but we also freed up hardware capacity for other test platforms (e.g. Windows XP).

Taskcluster

One of our major focus areas this year was migrating our infrastructure from Buildbot to Taskcluster. As of today, we have:

  • Fully migrated Linux64 and Android debug builds and tests
  • Builds for all other platforms operating as Tier2
  • Linux64 and Android nightly builds, l10n repacks and updates operating as Tier2
  • Tons of security design & implementation work

Balrog

Scheduled Changes in Balrog means that now we can have machines set background update rate to 0% 24 hours after release, instead of having a human do it.

Balrog itself was migrated from our datacentre in SCL3 into AWS. We now have a much more flexible deployment pipeline.

Balrog has also been one of our best projects for getting volunteer contributions! Many of the work done this year was done by contributors!

RIP

Being able to shut off old, crufty and deprecated stuff is an important part of staying agile. This year we were finally able to develop an end of life plan for Windows XP. In addition, we discontinued support for OSX 10.6-10.8, systems without SSE2, and 32-bit OSX systems. Not having to support these old platforms simplifies managing our infrastructure, and also makes product development easier.

We also shut down all the panda mobile testing infrastructure and legacy vcs-sync.

What's next?

2017 is looking like it's going to be another interesting (and busy!) year for RelEng.

Our top priority is to finish the migration to Taskcluster. Hopefully by the end of 2017, the only thing left on buildbot will be the ESR52 branch. This will require some big changes to our release automation, especially for Fennec.

We're also planning to provide some automated processes to assist with the rest of the release process. Releases still involve a lot of human to human handoffs, and places where humans are responsible for triggering automation. We'd like to provide a platform to be able to manage these handoffs more reliably, and allow different pieces of automation to coordinate more effectively.

December 28, 2016 07:35 PM

October 21, 2016

Hal Wine (hwine)

Using Auto Increment Fields to Your Advantage

Using Auto Increment Fields to Your Advantage

I just found, and read, Clément Delafargue’s post “Why Auto Increment Is A Terrible Idea” (via @CoreRamiro). I agree that an opaque primary key is very nice and clean from an information architecture viewpoint.

However, in practice, a serial (or monotonically increasing) key can be handy to have around. I was reminded of this during a recent situation where we (app developers & ops) needed to be highly confident that a replica was consistent before performing a failover. (None of us had access to the back end to see what the DB thought the replication lag was.)

Read more...

October 21, 2016 07:00 AM

September 30, 2016

Kim Moir (kmoir)

Beyond the Code 2016 recap

I've had the opportunity to attend the Beyond the Code conference for the past two years.  This year, the venue moved to a location in Toronto, the last two events had been held in Ottawa.  The conference is organized by Shopify who again managed to have a really great speaker line up this year on a variety of interesting topics.  It was a two track conference so I'll summarize some of the talks I attended.  

The conference started off with Anna Lambert of Shopify welcoming everyone to the conference.





The first speaker was Atlee Clark, Director of App and Developer relations at Shopify who discussed the wheel of diversity.


The wheel of diversity is a way of mapping the characteristics that you're born with (age, gender, gender expression, race or ethnicity, national origin, mental/physical ability), along with those that you acquire through life (appearance, education, political belief, religion, income, language and communication skills, work experience, family,  organizational role).  When you look at your team, you can map how diverse it is by colour.  (Of course, some of these characteristics are personal and might not be shared with others).  You can see how diverse the team is by mapping different characteristics with different colours.  If you map your team and it's mostly the same colour, then you probably will not bring different perspectives together when you work because you all have similar backgrounds and life experiences.  This is especially important when developing products. 



This wheel also applies to hiring too.  You want to have different perspectives when you're interviewing someone.  Atlee mentioned when she was hiring for a new role, she mapped out the characteristics of the people who would be conducting the hiring interviews and found there was a lot of yellow.


So she switched up the team that would be conducting the interviews to include people with more diverse perspectives.

She finished by stating that this is just a tool, keep it simple, and practice makes it better. 

The next talk was by Erica Joy, who is a build and release engineer at Slack, as well as a diversity advocate.  I have to admit, when I saw she was going to speak at Beyond the Code, I immediately pulled out my credit card and purchased a conference ticket.  She is one of my tech heroes.  Not only did she build the build and release pipeline at Slack from the ground up, she is an amazing writer and advocate for change in the tech industry.   I highly recommend reading everything she has written on Medium, her chapter in Lean Out and all her discussions on twitter.  So fantastic.

Her talk at the conference was "Building a Diverse Corporate Culture: Diversity and Inclusion in Tech".  She talked about how literally thousands of companies say they value inclusion and diversity.  However, few talk about what they are willing to give up to order to achieve it.  Are you willing to give up your window seat with a great view?   Something else so that others can be paid fairly?  She mentioned that change is never free.  People need both mentorship and sponsorship in in order to progress in their career.





I really liked her discussion around hiring and referrals.  She stated that when you're hire people you already know you're probably excluding equally or better qualified that you don't know.  By default, women of colour are underpaid.

Pay gap for white woman, African American women and Hispanic women compared to a white man in the United States.

Some companies have referral system to give larger referral bonuses to people who are underrepresented in tech, she gave the example of Intel which has this in place.  This is a way to incentivize your referral system so you don't just hire all your white friends.  

The average white American has 91 white friends and one black friend so it's not very likely that they will refer non-white people. Not sure what the numbers are like in Canada but I'd guess that they are quite similar.
  
In addition, don't ask people to work for free, to speak at conferences or do diversity and inclusion work.  Her words were "We can't pay rent with exposure".

Spend time talking to diversity and inclusion experts.  There are people that have spent their entire lives conducting research in this area and you can learn from their expertise.  Meritocracy is a myth, we are just lucky to be in the right place in the right time.  She mentioned that her colleague Duretti Hirpa at Slack points out the need for accomplices, not allies. People that will actually speak up for others.  So people feeling pain or facing a difficult work environment don't have to do all the work of fighting for change. 




In most companies, there aren't escalation paths for human issues either.  If a person is making sexist or racist remarks, shouldn't that be a firing offense? 

If people were really working hard on diversity and inclusion, we would see more women and people of colour on boards and in leadership positions.  But we don't.

She closed with a quote from Beyonce:

"If everything was perfect, you would never learn and you would never grow"

💜💜💜

The next talk I attended was by Coraline Ada Ehmke, who is an application engineer at Github.  Her talk was about the "Broken Promise of Open Source".  Open source has the core principals of the free exchange of ideas, success through collaboration, shared ownership and meritocracy.


However, meritocracy is a myth.  Currently, only 6% of Github users are women.  The environment can be toxic, which drives a lot of people away.  She mentioned that we don't have numbers for diversity in open source other than women, but Github plans to do a survey soon to try to acquire more data.


Gabriel Fayant from Assembly of Seven Generation's talk was entitled "Walking in Both Worlds, traditional ways of being and the world of technology".  I found this quite interesting, she talked about traditional ceremonies and how they promote the idea of living in the moment, and thus looking at your phone during a drum ceremony isn't living the full experience.  A question from the audience from someone who worked in the engineering faculty at the University of Toronto was how we can work with indigenous communities to share our knowledge of the technology and make youth both producers of tech, not just consumers. 

If everything was perfect, you would never learn and you would never grow.
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html

f everything was perfect, you would never learn and you would never grow.
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
The next talk was by Sandi Metz, entitled "Madame Santi tells your future".  This was a totally fascinating look at the history of printing text from scrolls all the way to computers.

She gave the same talk at another conference earlier so you watch it here.  It described the progression of printing technology from 7000 years ago until today.  Each new technology disrupted the previous one, and it was difficult for those who worked on the previous technology to make the jump to work on the new one. 

So according to Sandi, what is your future?

The last talk I attended was by Sabrina Geremia of Google Canada.  She talked about the factors that encourage a girl to consider computer science (encouragement, career perception, self-perception and academic exposure.)


I found that this talk was interesting but it focused a bit too much on the pipeline argument - that the major problem is that girls are not enrolling in CS courses.  If you look at all the problems with environment, culture, lack of pay equity and opportunities for promotion due to bias, maybe choosing a career where there is more diversity is a better choice.  For instance, law, accounting and medicine have much better numbers for these issues, despite there still being an imbalance.

At the end of the day, there was a panel to discuss diversity issues:

Moderator: Ariti Sharma, Shopify, Panelists: Mohammed Asaduallah, Format, Katie Krepps, Capital One Canada, Lateesha Thomas, Dev Bootcamp, Ramya Raghavan, Google, Kara Melton, TWG, Gladstone Grant, Microsoft Canada
Some of my notes from the panel

Compared to the previous two iterations of this conference, it seemed that this time it focused a lot more on solutions to have more diversity and inclusion in your company. The previous two conferences I attended seemed to focus more on technical talks by diverse speakers.


As a side note, there were a lot of Shopify folks in attendance because they ran the conference.  They sent a bus of people from their head office in Ottawa to attend it.  I was really struck at how diverse some of the teams were.  I met group of women who described themselves as a team of "five badass women developers" 💯 As someone who has been the only woman on her team for most of her career, this was beautiful to see and gave me hope for the future of our industry.   I've visited the Ottawa Shopify office several times (Mr. Releng works there) and I know that the representation of of their office doesn't match the demographics of the Beyond the Code attendees which tended to be more women and people of colour.  But still, it is refreshing to see a company making a real effort to make their culture inclusive.  I've read that it is easier to make your culture inclusive from the start, rather than trying to make difficult culture changes years later when your teams are all homogeneous. So kudos to them for setting an example for other companies.

Thank you Shopify for organizing this conference, I learned a lot and I look forward to the next one!

September 30, 2016 01:10 PM

August 22, 2016

Hal Wine (hwine)

Py Bay 2016 - a First Report

Py Bay 2016 - a First Report

PyBay held their first local Python conference this last weekend (Friday, August 19 through Sunday, August 21). What a great event! I just wanted to get down some first impressions - I hope to do more after the slides and videos are up.

Read more...

August 22, 2016 07:00 AM

July 29, 2016

Kim Moir (kmoir)

Ottawa Python Authors Meetup: Artificial Intelligence with Python

Last night, I attended my first Ottawa Python Authors Meetup.  It was the first time that I had attended despite wanting to attend for a long time.  (Mr. Releng also works with Python and thus every time there's a meetup, we discuss who gets to go and who gets to stay home and take care of little Releng.  It depends on if the talk to more relevant to our work interests.)

The venue was across the street from Confederation Park aka land of Pokemon.


I really enjoyed it.  The people I chatted with were very friendly and welcoming. Of course, I ran into some people I used to work with, as is with any tech event in Ottawa it seems. Nice to catch up!

The venue had the Canada Council for the Arts as a tenant, thus the quintessentially Canadian art.


The speaker that night was Emily Daniels, developer from Halogen Software who spoke on Artificial Intelligence with Python. (Slides here, github repo here).  She mentioned that she writes Java during the day but works on fun projects in Python at night.  She started the talk by going through some examples of artificial intelligence on the web.  Perhaps the most interesting one I found was a recurrent neural network called Benjamin which generates movie script ideas and was trained on existing sci-fi movies and movie scripts.  Also, a short film called Sunspring was made of one of the generated scripts.  The dialogue is kind of stilted but it is interesting concept.

 After the examples, Emily then moved on to how it all works. 

Deep learning is a type of machine learning that drives meaning out of data using a hierarchy of multiple layers that mimics the neural networks of our brain.

She then spoke about a project she wrote to create generative poetry from a RNN (recurrent neural network).  It was based on a RNN tutorial that she heavily refactored to meet her needs.  She went through the code that she developed to generate artificial prose from the works of H.G. Wells and Jane Austen.  She talked about how she cleaned up the text to remove EOL delimiters, page breaks, chapters numbers and so on. And then it took a week to train it with the data.

She then talked about another example which used data from Jack Kerouac and Virginia Woolf novels, which she posts some of the results to twitter.


She also created a twitter account which posts generated text from her RNN that consumes the content of Walt Whitman and Emily Dickinson. (I should mention at this point that she chose these authors for her projects because copyrights have expired on these works and they are available on the Gutenberg project)

After the talk, she field a number of audience questions which were really insightful. There were discussions on the inherent bias in the data because it was written by humans that are sexist and racist.  She mentioned that she doesn't post the results of the model automatically to twitter because some of them are really inappropriate since these novels since they learned from text that humans wrote who are inherently biased.

One thing I found really interesting is that Emily mentioned that she felt a need to ensure that the algorithms and data continue to exist, and that they were faithfully backed up.  I began to think about all the Amazon instances that Mozilla releng had automatically killed that day as our capacity had peaked and declined.  And of the great joy I feel ripping out code when we deprecate a platform.  I personally feel no emotional attachment to bring down machines or deleting used code.
 
Perhaps the sense of a need for a caretaker for these recurrent neural networks and the data they create is related to the fact that the algorithms that output text that is a simulacrum for the work of an author that we enjoy reading.  And perhaps that is why we maybe we aren't as attached to a ephemeral pool of build machines as we are are to our phones.  Because the phone provides a sense human of connection to the larger world when we may be sitting alone.

Thank you Emily for the very interesting talk, to the Ottawa Python Authors Group for organizing the meetup, and Shopify for sponsoring the venue.  Looking forward to the next one!

Further reading

July 29, 2016 07:31 PM

Eclipse Committer Emeritus

I received this very kind email in my inbox this morning.

"David Williams has expired your commit rights to the
eclipse.platform.releng project.  The reason for this change is:

We have all known this day would come, but it does not make it any easier.
It has taken me four years to accept that Kim is no longer helping us with
Eclipse. That is how large her impact was, both on myself and Eclipse as a
whole. And that is just the beginning of why I am designating her as
"Committer Emeritus". Without her, I humbly suggest that Eclipse would not
have gone very far. Git shows her active from 2003 to 2012 -- longer than
most! She is (still!) user number one on the build machine. (In Unix terms,
that is UID 500). The original admin, when "Eclipse" was just the Eclipse
Project.

She was not only dedicated to her job as a release engineer she was
passionate about doing all she could to make other committer's jobs easier
so they could focus on their code and specialties. She did (and still does)
know that release engineering is a field of its own; a specialized
profession (not something to "tack on" at the end) that just anyone can do)
 and good, committed release engineers are critical to the success of any
project.

For anyone reading this that did not know Kim, it is not too late: you can
follow her blog at

http://relengofthenerds.blogspot.com/

You will see that she is still passionate about release engineering and
influential in her field.

And, besides all that, she was (I assume still is :) a well-rounded, nice
person, that was easy to work with! (Well, except she likes running for
exercise. :)

Thanks, Kim, for all that you gave to Eclipse and my personal thanks for
all that you taught me over the years (and I mean before I even tried to
fill your shoes in the Platform).

We all appreciate your enormous contribution to the success of Eclipse and
happy to see your successes continuing.

To honor your contributions to the project, David Williams has nominated
you for Committer Emeritus status."


Thank you David! I really appreciate your kind words.  I learned so much working with everyone in the Eclipse community.  I had the intention to contribute to Eclipse when I left IBM but really felt that I have given all I had to give.  Few people have the chance to contribute to two fantastic open source communities during their career.  I'm lucky to have that opportunity.


My IBM friends made this neat Eclipse poster when I left.  The Mozilla dino displays my IRC handle.

July 29, 2016 07:06 PM

July 18, 2016

Hal Wine (hwine)

Legacy vcs-sync is dead! Long live vcs-sync!

Legacy vcs-sync is dead! Long live vcs-sync!

tl;dr: No need to panic - modern vcs-sync will continue to support the gecko-dev & gecko-projects repositories.

Today’s the day to celebrate! No more bash scripts running in screen sessions providing dvcs conversion experiences. Woot!!!

I’ll do a historical retrospective in a bit. Right now, it’s time to PARTY!!!!!

July 18, 2016 07:00 AM

July 12, 2016

Rail Alliev (rail)

Thoughts about partial updates on demand

Firefox has it's own built-in update system. The update system supports 2 types of updates: complete and incremental. Completes can be applied to any older version, unless there are some incompatible changes in the MAR format. Incremental updates can be applied only to a release they were generated for.

Usually for the beta and release channels we generate incremental updates against 3-4 versions. This way we try to minimize bandwidth consumption for our end users and increase the number of users on the latest version. For Nightly and Developer Edition builds we generate 5 incremental updates using funsize.

Both methods assume that we know ahead of time what versions should be used for incremental updates. For releases and betas we use ADI stats to be as precise as possible. However, these methods are static and don't use real-time data.

The idea to generate incremental updates on demand has been around for ages. Some of the challenges are:

  • Acquiring real-time (or close to real-time) data for making decisions on incremental update versions
  • Size of the incremental updates. If the size is very close to the size of the corresponding complete, there is reason to serve incremental updates. One of the reasons is that the that the updater tries to use the incremental update first, and then falls back to the complete in case if something goes wrong. In this case the updater downloads both the incremental and the complete.

Ben and I talked about this today and to recap some of the ideas we had, I'll put them here.

  • We still want to "pre-seed" most possible incremental updates before we publish any updates
  • Whenever Balrog serves a complete-only update, it should generate a structured log entry and/or an event to be consumed by some service, which should contain all information required to generate a incremental update.
  • The "new system" should be able to decide if we want to discard incremental update generation, based on the size. These decisions should be stored, so we don't try to generate incremental update again next time. This information may be stored in Balrog to prevent further events/logs.
  • Before publishing the incremental update, we should test if they can be applied without issues, similar to the update verify tests we run for releases, but without hitting Balrog. After they pass this test, we can publish them to Balrog and check if Balrog returns expected XML with partial info in it.
  • Minimize the amount of served completes, if we plan to generate incremental updates. One of the ideas was to modify the client to support responses like "Come in 5 minutes, I may have something for you!"

The only remaining thing is to implement all these changes. :)

July 12, 2016 07:03 PM

Hal Wine (hwine)

End of an Experiment

End of an Experiment

tl;dr: We’ll be shutting down the Firefox mirrors on Bitbucket.

A long time ago we started an experiment to see if there was any support for developing Mozilla products on social coding sites. Well, the community-at-large has spoken, with the results many predicted:

  • YES!!! when the social coding site is GitHub
  • No, when the social coding site is Bitbucket
Read more...

July 12, 2016 07:00 AM

June 09, 2016

Chris AtLee (catlee)

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.

Monday

Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.

Tuesday

Guido's Keynote

(video)

Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.

Click

(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)

Wednesday

Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.

Gilectomy

(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote

(video)

If you watch only one video from PyCon, watch this. It's just incredible.

June 09, 2016 07:39 PM

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.

Monday

Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.

Tuesday

Guido's Keynote

(video)

Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.

Click

(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)

Wednesday

Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.

Gilectomy

(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote

(video)

If you watch only one video from PyCon, watch this. It's just incredible.

June 09, 2016 07:39 PM

June 07, 2016

Kim Moir (kmoir)

Submissions for Releng 2016: due by July 1, 2016

The CFP for Releng 2016 is open!  The workshop will be held November 18, 2016 in Seattle.  It will be held in conjunction with FSE 2016.  (Foundations of Software Engineering ACM conference)
Picture by howardignatius- Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/howardignatius/14482954049/sizes/l
If you've done something like

We'd like to encourage people new to speaking to apply, as well as those from underrepresented groups in tech. We'd love to hear from some new voices and new companies ! 


Submissions are due July 1, 2016. If you have questions on of the submission process, topics to submit, or anything else, I'm happy to help!  I'm kmoir and I work at mozilla.com or contact me on twitter. Submit early and often!

June 07, 2016 12:59 AM

June 03, 2016

Kim Moir (kmoir)

DevOpsDays Toronto recap

Last week I attended DevOpsDays Toronto.  It was my first time attending a DevOpsDays event and it was quite interesting.  It was held at CBC's Glenn Gould studios which is a quick walk from the Toronto Island airport where I landed after an hour flight from Ottawa.  This blog post is an overview of some of the talks at the conference.
 
Glenn Gould Studios, CBC, Toronto.  

Statue of Glenn Gould outside the CBC studios that bear his name.

Day 1


The day started out with an introduction from the organizers and a brief overview of history of DevOps days. They also made a point about reminding everyone that they had agreed to the code of conduct when they bought their ticket. I found this explicit mention of the code of conduct quite refreshing.


The first talk of the day was John Willis,  evangelist at Docker.  He gave an overview of the state of enterprise devops. I found this a fresh perspective because I really don't know what happens in enterprises with respect to DevOps since I have been working in open source communities for so long.  John providing an overview of what DevOps encompasses.


DevOps is a continuous feedback loop.


He talked a lot about how empathy is so important in our jobs.  He mentions that at Netflix has a slide deck that describes company culture.  He doesn't know if this is still the case, but it he had heard that if you hadn't read the company culture deck and show up for an interview at Netflix, you would be automatically disqualified for further interviews.  Etsy and Spotify have similar open documents describing their culture.

Here he discusses the research by Christina Maslach on the six sources of burnout.
Christina Maslach
Christina Maslach

He gave us some reading to do.  I've read the "Release It!" book which is excellent and has some fascinating stories of software failure in it, I've added the other books to my already long reading list.

The rugged manifesto and realizing that the code you write will always be under attack by malicious authors.  ICE stands for Inclusivity, Complexity and Empathy.

He stated that it's a long standing mantra that you can have two of either fast, cheap or good but recent research shows that today we can many changes quickly, and if there is a failure the mean time to recovery is short.

He left us with some more books to read.

The second talk was a really interesting talk by Hany Fahim, CEO of VM Farms.  It was a short mystery novella describing how VM Farms servers suddenly experienced a huge traffic spike when the Brazilian government banned Whatsapp  as a result of a legal order. I love a good war story.

 Hany discussed one day VMfarms suddenly saw a huge increase in traffic. 

This was a really important point.  When your system is failing to scale, it's important to decide if it's a valid increase in traffic or malicious.


Looking on twitter, they found that a court case in Brazil had recently ruled that Whatsup would be blocked for 48 hours.  Users started circumventing this block via VPN.  Looking at their logs, they determined that most of the traffic was resolving to ip addresses from Brazil and  that there was a large connection time during SSL handshakes.
  

The government of Brazil encouraged the use of open source software versus Windows, and thus the users became more technically literate, and able to circumvent blocks via VPN.


In conclusion, making changes to use multi-core HAProxy fixed a lot of issues. Also, twitter was and continues to be a great source of information on activity that is happening in other countries. Whatsapp was returned to service and then banned a second time, and their servers were able to keep up with the demand.

After lunch, we were back to to more talks.  The organizers came on stage for a while to discuss the afternoon's agenda.  They also remarked that one individual had violated the code of conduct and had been removed from the conference.  So, the conference had a code of conduct and steps were taken if it was violated.

Next up, Bridget Kromhout from Pivotal gave a talk entitled Containers will not Fix your Broken Culture.
I first saw Bridget speak at Beyond the Code in Ottawa in 2014 about scaling the streaming services for Drama Fever on AWS.  At the time, I was moving our mobile test infrastructure to AWS so I was quite enthralled with her talk because 1) it was excellent 2) I had never seen another woman give a talk about scaling services on AWS.  Representation matters.

The summary of the talk last week was that no matter what tools you adopt, you need to communicate with each other about the cultural changes are required to implement new services.  A new microservices architecture is great, but if these teams that are implementing these services are not talking to each other, the implementation will not succeed.

Bridget pointing out that the technology we choose to implement is often about what is fashionable.


Shoutout to Jennifer Davis' and Katherine Daniel's Effective DevOps book. (note -  I've read it on Safari online and it is excellent.  The chapter on hiring is especially good)

Loved this poster about the wall of confusion between development and operations.  

In the afternoon, there were were lightning talks and then open spaces. Open spaces are free flowing discussions where the topic is voted upon ahead of time.  I attended ones on infrastructure automation, CI/CD at scale and my personal favourite, horror stories.  I do love hearing how distributed system can go down and how to recover.  I found that the conversations were useful but it seemed like some of them were dominated by a few voices.  I think it would be better if the person that suggested to topic for the open space also volunteered to moderate the discussion.

Day 2

The second day started out with a fantastic talk by John Arthorne of Shopify speaking on scaling their deployment pipeline.  As a side note, John and I worked together for more than a decade on Eclipse while we both worked at IBM so it was great to catch up with him after the talk. 



He started by giving some key platform characteristics.  Stores on Shopify have flash sales that have traffic spikes so they need to be able to scale for these bursts of traffic. 

From commit to deploy in 10 minutes.  Everyone can deploy. This has two purposes: Make sure the developer stays involved in the deploy process.  If it only takes 10 minutes, they can watch to make sure that their deploy succeeds. If it takes longer, they might move on to another task.  Another advantage of this quick deploy process is that it can delight customers with the speed of deployment.  They also deploy in small batches to ensure that the mean time to recover is small if the change needs to be rolled back.
 
BuildKite is a third party build and test orchestration service.  They wrote a tool called Scrooge that monitors the number of EC2 nodes based on current demand to reduce their AWS bills.  (Similar to what Mozilla releng does with cloud-tools)


Shopify uses a open source orchestration tool called ShipIt.  I was sitting next to my colleague Armen at the conference and he started chuckling at this point because at Mozilla we also wrote an application called ship-it which release management uses to kick off Firefox releases.   Shopify also has a overall view of the ship it deployment process which allows developers to see the percentages of nodes where their change has been deployed. One of the questions after the talk was why they use AWS for their deployment pipeline when they have use machines in data centres for their actual customers. Answer: They use AWS where resilency is not an issue. 
 
Building containers is computationally expensive. He noted that a lot of engineering resources went into optimizing the layers in the Docker containers. To isolate changes to the smallest layer.  They build service called Locutus to build the containers on commit, and push to a registry. It employs caching to make the builds smaller. 

One key point that John also mentioned is that they had a team dedicated to optimizing their deployment pipeline.  It is unreasonable to expect that developers working on the core Shopify platform to also optimize the pipeline.

In the afternoon , there were a series of lightning talks. Roderick Randolph from Capital One gave an amazing talk about Supporting Developers through DevOps.


It was an interesting perspective.  I've seen quite a few talks about bringing devops culture and practices to the operations side of the house, but the perspective of teaching developers about it is discussed less often.



He emphasized the need to empower developers to use DevOp practices by giving them tools, and showing them how to use them.  For instance, if they needed to run docker to test something, walk them through it so they will know how to do it next time. 





The final talk I'll mention is by Will Weaver.  He talks about how it is hard to show prospective clients how he had CI and tests experience when that experience is not open to the public.  So he implemented tests and CI for his dotfiles on github. 


He had excellent advice on how to work on projects outside of work to showcase skills for future employers.




Diversity and Inclusion


As an aside, whenever I'm at a conference I note the number of people in the "not a white guy" group. This conference had an all men organizing committee but not all white men.  (I recognize the fact that not all diversity is visible i.e. mental health, gender identity, sexual orientation, immigration status etc) They was only one woman speaker, but there were a few non-white speakers.  There were very few women attendees. I'm not sure what the process was to reach out to potential speakers other than the CFP. 



 There were slides that showed diverse developers which was refreshing.



Loved Roderick's ops vs dev slide.

I learned a lot at the conference and am thankful for all the time that the speakers took to prepare their talks.  I enjoyed all the conversations I had learning about the challenges people face in the organizations implementing continuous integration and deployment. It also made me appreciate the culture of relentless automation, continuous integration and deployment that we have at Mozilla.

I don't know who said this during the conference but I really liked it

Shipping is the heartbeat of your company

It was interesting to learn how all these people are making their companies heart beat stronger via DevOps practices and tools.

June 03, 2016 05:39 PM

May 13, 2016

Kim Moir (kmoir)

Welcome Mozilla Releng summer interns

We're delighted to have Francis Kang and Connor Sheehan join the Mozilla release engineering team as summer interns.  Francis is studying at the University of Toronto while Connor attends McMaster University in Hamilton, Ontario.  We'll have another intern (Anthony) join us later on in the summer who will be working from our San Francisco office.

Francis and Connor will be working on implementing some new features in release promotion as well as  migrating some builds to taskcluster.  I'll be mentoring Francis,  while Rail will be mentoring Connor.  If you are in the Toronto office, please drop by to say hi to them.  Or welcome them on irc as fkang or sheehan. 

Kim, Francis, Connor and Rail
They are both already off to a great start and have pull requests merged into production that fixed some release promotion issues.  Their code was used in the Firefox 47.0 beta 5 release promotion that we ran last night so their first week was quite productive.


Mentoring an intern provides an opportunity to see the systems we run from a fresh perspective.  They both have lots of great questions which makes us revisit why design decisions were made, could we do things better?   Like all teaching roles, I always find that I learn a tremendous amount from the experience, and hope they have fun learning real world software engineering concepts with respect to running large distributed systems.

Welcome to Mozilla!

May 13, 2016 03:20 PM

April 27, 2016

Rail Alliev (rail)

Firefox 46.0 and SHA512SUMS

In my previous post I introduced the new release process we have been adopting in the 46.0 release cycle.

Release build promotion has been in production since Firefox 46.0 Beta 1. We have discovered some minor issues; some of them are already fixed, some still waiting.

One of the visible bugs is Bug 1260892. We generate a big SHA512SUMS file, which should contain all important checksums. With numerous changes to the process the file doesn't represent all required files anymore. Some files are missing, some have different names.

We are working on fixing the bug, but you can use the following work around to verify the files.

For example, if you want to verify http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe, you need use the following 2 files:

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc

Example commands:

# download all required files
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/KEY
# Import Mozilla Releng key into a temporary GPG directory
$ mkdir .tmp-gpg-home && chmod 700 .tmp-gpg-home
$ gpg --homedir .tmp-gpg-home --import KEY
# verify the signature of the checksums file
$ gpg --homedir .tmp-gpg-home --verify firefox-46.0.checksums.asc && echo "OK" || echo "Not OK"
# calculate the SHA512 checksum of the file
$ sha512sum "Firefox Setup 46.0.exe"
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479  Firefox Setup 46.0.exe
# lookup for the checksum in the checksums file
$ grep c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 firefox-46.0.checksums
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 sha512 46275456 install/sea/firefox-46.0.ach.win64.installer.exe

This is just a temporary work around and the bug will be fixed ASAP.

April 27, 2016 04:47 PM

April 23, 2016

Hal Wine (hwine)

Enterprise Software Writers R US

Enterprise Software Writers R US

Someone just accused me of writing Enterprise Software!!!!!

Well, the “someone” is Mahmoud Hashemi from PayPal, and I heard him on the Talk Python To Me podcast (episode 54). That whole episode is quite interesting - go listen to it.

Read more...

April 23, 2016 07:00 AM

April 05, 2016

Rail Alliev (rail)

Release Build Promotion Overview

Hello from Release Engineering! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Release Build Promotion.

What is Release Build Promotion?

Release build promotion (or "build promotion", or "release promotion" for short), is the latest release pipeline for Firefox being developed by Release Engineering at Mozilla.

Release build promotion starts with the builds produced and tested by CI (e.g. on mozilla-beta or mozilla-release). We take these builds, and use them as the basis to generate all our l10n repacks, partial updates, etc. that are required to release Firefox. We "promote" the CI builds to the release channel.

How is this different?

The previous release pipeline also started with builds produced and tested by CI. However, when it came time to do a release, we would create an entirely new set of builds with slightly different build configuration. These builds would not get the regular CI testing.

Release build promotion improves the process by removing the second set of builds. This drastically improves the total time to do a release, and also increases our confidence in our products since we now are shipping exactly what's been tested. We also improve visibility of the release process; all the tasks that make up the release are now reported to Treeherder along with the corresponding CI builds.

Current status

Release build promotion is in use for Firefox desktop starting with the 46 beta cycle. ESR and release branches have not yet been switched over.

Firefox for Android is also not yet handled. We plan to have this ready for Firefox 47.

Some figures

One of the major reasons of this project was our release end-to-end times. I pulled some data to compare:

  • One of the Firefox 45 betas took almost 12 hours
  • One of the Firefox 46 betas took less than 3 hours

What's next?

  • Support Firefox for Android
  • Support release and ESR branches
  • Extend this process back to the aurora and nightly channels

Can I contribute?

Yes! We still have a lot of things to do and welcome everyone to contribute.

  • Bug 1253369 - Notifications on release promotion events.
  • (No bug yet) Redesign and modernize Ship-it to reflect the new release work flow. This will include new UI, multiple sign-offs, new release-runner, etc.
  • Tracking bug

More information

For more information, please refer to these other resources about build promotion:

There will be multiple blog posts regarding this project. You have probably seen Jordan's blog on how to be productive when distributed teams get together. It covers some of our experience we had during the project sprint week in Vancouver.

April 05, 2016 01:34 PM

March 07, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - March 4, 2016

It was a busy week with many releases in flight, as well as preparation for running beta 1 with release promotion next week.  We also are in the process of adding more capacity to certain test platform pools to lower wait times given all the new e10s tests that have been enabled.

Improve Release Pipeline:
Everyone gets a release promotion!  Source: http://i.imgur.com/WMmqSDI.jpg

Improve CI Pipeline:

Release:

The releases calendar is getting busier as we get closer to the end of the cycle. Many releases were shipped or are still in-flight:
As always, you can find more specific release details in our post-mortem minutes:
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-02 
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-09

Operational:

Until next time!

March 07, 2016 03:56 PM

February 29, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - February 26, 2016

It was a busy week for release engineering as several team members travelled to the Vancouver office to sprint on the release promotion project. The goal of the release promotion project is to promote continuous integration builds to release channels, allowing us to ship releases much more quickly.



Improve Release Pipeline:


Improve CI Pipeline:

Release:

Operational:

February 29, 2016 08:22 PM

January 21, 2016

Rail Alliev (rail)

Rebooting productivity

Every new year gives you an opportunity to sit back, relax, have some scotch and re-think the passed year. Holidays give you enough free time. Even if you decide to not take a vacation around the holidays, it's usually calm and peaceful.

This time, I found myself thinking mostly about productivity, being effective, feeling busy, overwhelmed with work and other related topics.

When I started at Mozilla (almost 6 years ago!), I tried to apply all my GTD and time management knowledge and techniques. Working remotely and in a different time zone was an advantage - I had close to zero interruptions. It worked perfect.

Last year I realized that my productivity skills had faded away somehow. 40h+ workweeks, working on weekends, delivering goals in the last week of quarter don't sound like good signs. Instead of being productive I felt busy.

"Every crisis is an opportunity". Time to make a step back and reboot myself. Burning out at work is not a good idea. :)

Here are some ideas/tips that I wrote down for myself you may found useful.

Concentration

  • Task #1: make a daily plan. No plan - no work.
  • Don't start your day by reading emails. Get one (little) thing done first - THEN check your email.
  • Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
  • Meetings are time consuming, so "Set a goal for each meeting". Consider skipping a meeting if you don't have any goal set, unless it's a beer-and-tell meeting! :)
  • Constantly ask yourself if what you're working on is important.
  • 3-4 times a day ask yourself whether you are doing something towards your goal or just finding something else to keep you busy. If you want to look busy, take your phone and walk around the office with some papers in your hand. Everybody will think that you are a busy person! This way you can take a break and look busy at the same time!
  • Take breaks! Pomodoro technique has this option built-in. Taking breaks helps not only to avoid RSI, but also keeps your brain sane and gives you time to ask yourself the questions mentioned above. I use Workrave on my laptop, but you can use a real kitchen timer instead.
  • Wear headphones, especially at office. Noise cancelling ones are even better. White noise, nature sounds, or instrumental music are your friends.

(Home) Office

  • Make sure you enjoy your work environment. Why on the earth would you spend your valuable time working without joy?!
  • De-clutter and organize your desk. Less things around - less distractions.
  • Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them. Your health is more important and expensive. Thanks to mhoye for this advice!

Other

  • Don't check email every 30 seconds. If there is an emergency, they will call you! :)
  • Reward yourself at a certain time. "I'm going to have a chocolate at 11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you are Pavlov's dog too!
  • Don't try to read everything NOW. Save it for later and read in a batch.
  • Capture all creative ideas. You can delete them later. ;)
  • Prepare for next task before break. Make sure you know what's next, so you can think about it during the break.

This is my list of things that I try to use everyday. Looking forward to see improvements!

I would appreciate your thoughts this topic. Feel free to comment or send a private email.

Happy Productive New Year!

January 21, 2016 02:06 AM

January 08, 2016

Kim Moir (kmoir)

Tips from a resume nerd

Before I begin this post a few caveats:
I'm kind of a resume and interview nerd.  I like helping friends fix their resumes and write amazing cover letters. In the past year I've helped a few (non-Mozilla) friends fix up their resumes, write cover letters, prepare for interviews as they search for new jobs.  This post will discuss some things I've found to be helpful in this process.

Picture by GotCredit - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/jakerust/16223669794/sizes/l

Preparation
Everyone tends to jump into looking at job descriptions and making their resume look pretty. Another scenario is that people have a sudden realization that they need to get out of their current position and find a new job NOW and frantically start applying for anything that matches their qualifications.  Before you do that, take a step back and make a list of things that are important to you.  For example, when I applied at Mozilla, my list was something like this

People spend a lot of time at work. Life is too short to be unhappy every day.  Writing a list of what is important serves as a checklist to when you are looking at job descriptions and immediately weed out the ones that don't match your list.  

Picture by Mufidah Kassalias - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/mufidahkassalias/10519774073/sizes/o/
 
People tend focus a lot on the technical skills they want to use or new ones you want to learn.  You should also think about what kind of culture where you want to work.  Do the goals and ethics of the organization align with your own? Who will you be working with? Will you enjoy working with this team?  Are you interested in remote work or do you want to work in an office? How will a long commute impact or relocation your quality of life? What is the typical career progression of someone in this role? Are there both management and technical tracks for advancement?


Picture by mugley - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/mugley/4221455156/sizes/o/


To summarize, itemize the skills you'd like to use or learn, the culture of the company and the team and why you want to work there.

Cover letter

Your cover letter should succinctly map your existing skills to the role you are applying for and convey enthusiasm and interest.  You don't need to have a long story about how you worked on a project at your current job that has no relevance to your potential new employer.  Teams that are looking to hire have problems to solve.  Your cover letter needs to paint a picture that your have the skills to solve them.

Picture by Jim Bauer - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/lens-cap/10320891856/sizes/l


Refactoring your resume

Developers have a lot of opportunities these days, but if you intend to move from another industry, into a tech company, it can be more tricky.  The important thing is to convey the skills you have in a a way that people can see they can be applied to the problems they want to hire you to fix. 

Many people describe their skills and accomplishments in a way that is too company specific.  They may have a list of acronyms and product names on their resume that are unlikely to be known by people outside the company.  When describing the work you did in a particular role, describe the work that you did in a that is measurable way that highlights the skills you have.  An excellent example of a resume that describes the skills that without going into company specific detail is here. (Julie Pagano also has a terrific post about how she approached her new job search.)

Another tip is to leave out general skills that are very common.  For instance, if you are a technical writer, omit the fact that you know how to use Windows and Word and focus on highlighting your skills and accomplishments. 


Non-technical interview preparation

Every job has different technical requirements and there are many books and blog posts on how to prepare for this aspect of the interview process. So I'm going to just cover the non-technical aspects.

When I interview someone, I like to hear lots of questions.  Questions about the work we do and upcoming projects.  This indicates that have taken the time to research the team, company and work that we do.  It also shows that enthusiasm and interest.

Here is a list suggestions to prepare for interviews

1.  Research the company make a list of relevant questions
Not every company is open about the work that they do, but most will be have some public information that you can use to formulate questions during the interviews.  Do you know anyone you can have coffee or skype with to who works for the company and can provide insight? What products/services do the company produce? Is the product nearing end of life?  If so, what will it be replaced by? What is the companies market share, is it declining, stable or experiencing growth? Who are their main competitors? What are some of the challenges they face going forward? How will this team help address these challenges?

2.  Prepare a list of questions for every person that interviews you ahead of time
Many companies will give you the list of names of people who will interview you.
Have they recently given talks? Watch the videos online or read the slides.
Does the team have github or other open repositories?  What are recent projects are they working on? Do they have a blog or are active on twitter? If so, read it and formulate some questions to bring to the interview.
Do they use open bug tracking tools?  If so, look at the bugs that have recent activity and add them to the list of questions for your interview. 
A friend of mine read the book of a person that interviewed him had written and asked questions about the book in the interview.  That's serious interview preparation!

Photo by https://www.flickr.com/photos/wocintechchat/ https://www.flickr.com/photos/wocintechchat/22506109386/sizes/l


3. Team dynamics and tools
Is the team growing or are you hiring to replace somebody who left?
What's the onboarding process like? Will you have a mentor?
How is this group viewed by the rest of the company? You want to be in a role where you can make a valuable contribution.  Joining a team where their role is not valued by the company or not funded adequately is a recipe for disappointment.
What does a typical day look like?  What hours do people usually work?
What tools do people use? Are there prescribed tools or are you free to use what you'd like?

4.  Diversity and Inclusion
If you're a member of an underrepresented group in tech, the numbers are lousy in this industry with some notable exceptions. And I say that while recognizing that I'm personally in the group that is the lowest common denominator for diversity in tech. 

The entire thread on this tweet is excellent  https://twitter.com/radiomorillo/status/589158122108932096


I don't really have good advice for this area other than do your research to ensure you're not entering a toxic environment.  If you look around the office where you're being interviewed and nobody looks like you, it's time for further investigation.   Look at the company's website - is the management team page white guys all the way down?  Does the company support diverse conferences, scholarships or internships? Ask on a mailing list like devchix if others have experience working at this company and what it's like for underrepresented groups. If you ask in the interview why there aren't more diverse people in the office and they say something like "well, we only hire on merit" this is a giant red flag. If the answer is along the lines of "yes, we realize this and these are the steps we are taking to rectify this situation",  this is a more encouraging response.

A final piece of advice, ensure that you meet with your manager that you're going to report to as part of your hiring process.  You want to ensure that you have rapport with them and can envision a productive working relationship. 

What advice do you have for people preparing to find a new job?

Further reading

Katherine Daniels gave at really great talk at Beyond the Code 2014 about how to effectively start a new job.  Press start: Beginning a New Adventure Job
She is also the co-author of Effective Devops which has fantastic chapter on hiring.
Erica Joy writes amazing articles about the tech industry and diversity.
Cate Huston has some beautiful posts on how to conduct technical interviews and how to be a better interviewer
Camille Fournier's blog is excellent reading on career progression and engineering management.
Mozilla is hiring!

January 08, 2016 08:37 PM

December 10, 2015

Nick Thomas (nthomas)

Updates for Nightly on Windows

You may have noticed that Windows has had no updates for Nightly for the last week or so. We’ve had a few issues with signing the binaries as part of moving from a SHA-1 certificate to SHA-2. This needs to be done because Windows won’t accept SHA-1 signed binaries from January 1 2016 (this is tracked in bug 1079858).

Updates are now re-enabled, and the update path looks like this

older builds  →  20151209095500  →  latest Nightly

Some people may have been seeing UAC prompts to run the updater, and there could be one more of those when updating to the 20151209095500 build (which is also the last SHA-1 signed build). Updates from that build should not cause any UAC prompts.

December 10, 2015 03:57 PM

December 03, 2015

Hal Wine (hwine)

Tuning Legacy vcs-sync for 2x profit!

Tuning Legacy vcs-sync for 2x profit!

One of the challenges of maintaining a legacy system is deciding how much effort should be invested in improvements. Since modern vcs-sync is “right around the corner”, I have been avoiding looking at improvements to legacy (which is still the production version for all build farm use cases).

While adding another gaia branch, I noticed that the conversion path for active branches was both highly variable and frustratingly long. It usually took 40 minutes for a commit to an active branch to trigger a build farm build. And worse, that time could easily be 60 minutes if the stars didn’t align properly. (Actually, that’s the conversion time for git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to generate the trigger.)

The full details are in bug 1226805, but a simple rearrangement of the jobs removed the 50% variability in the times and cut the average time by 50% as well. That’s a savings of 20-40 minutes per gaia push!

Moral: don’t take your eye off the legacy systems – there still can be some gold waiting to be found!

December 03, 2015 08:00 AM

December 01, 2015

Chris AtLee (catlee)

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

December 01, 2015 09:31 PM

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

December 01, 2015 09:31 PM

November 27, 2015

Chris AtLee (catlee)

Firefox builds on the Taskcluster Index

RIP FTP?

You have have heard rumblings that FTP is going away...

61319299.jpg

Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???

65722041.jpg

This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the
gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:

gecko.v2.mozilla-central.nightly.revision.47b49b0d32360fab04b11ff9120970979c426911.firefox.win64-opt

Or the date:

gecko.v2.mozilla-central.nightly.2015.11.27.latest.firefox.win64-opt

The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

[1]A historical name referring back to the time when we used the FTP prototol to serve these files. Today, the files are available only via HTTP(S)
[2]in fact, all Firefox builds right now are currently uploaded to S3. we've just had to implement some compatibility layers to make S3 appear in many ways like the old FTP service.
[3]yes, you need to know the version number...for now. we're considering stripping that from the filenames. if you have thoughts on this, please get in touch!
[4]ignore the warning on the right about "Task not found" - that just means there are no tasks with that exact route; kind of like an empty directory

November 27, 2015 09:21 PM

November 24, 2015

Kim Moir (kmoir)

USENIX Release Engineering Summit 2015 recap

November 13th, I attended the USENIX Release Engineering Summit in Washington, DC.  This summit was along side the larger LISA conference at the same venue. Thanks to Dinah McNutt, Gareth Bowles, Chris Cooper,  Dan Tehranian and John O'Duinn for organizing.



I gave two talks at the summit.  One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.

Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/tim_norris/2600844073/sizes/o/

Scaling mobile testing on AWS: Emulators all the way down from Kim Moir

I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city.  We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l


Distributed Systems at Scale: Reducing the Fail from Kim Moir

In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems.  In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
I'd also like to thank all the members of Mozilla releng/ateam who reviewed my slides and provided feedback before I gave the presentations.
The attendees of the summit attended the same keynote as the LISA attendees.  Jez Humble, well known for his Continuous Delivery and Lean Enterprise books provided a keynote on Lean Configuration Management which I really enjoyed. (Older version of slides from another conference, are available here and here.)



In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release".  He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success.  For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".

Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing.  Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.


http://www.slideshare.net/jezhumble/architecting-for-continuous-delivery-54192503


At the release engineering summit itself,   Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company.  This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.


Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!


Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.


For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US. 

Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines.   Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster

There were many more great talks, hopefully more slides will be up soon!

November 24, 2015 03:57 PM

November 16, 2015

Nick Thomas (nthomas)

The latest on firefox/releases/latest

The primary way to download Firefox is at www.mozilla.org, but Mozilla’s Release Engineering team has also maintained directories like

https://ftp.mozilla.org/pub/firefox/releases/latest/

to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.

Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:

https://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US

Modifying the product, os, and/or lang allow other combinations. This is described in the README.txt files for beta, release, and esr, as well as the Thunderbird equivalents release and beta.

Please adapt your scripts to use download.mozilla.org links. We hope it will help you simplify at the same time, as scraping to determine the current version is no longer necessary.

PS. We’ve also removed some latest- directories which were old and crufty, eg firefox/releases/latest-3.6.

November 16, 2015 11:37 PM

November 13, 2015

Hal Wine (hwine)

Complexity & * Practices

Complexity & * Practices

I was fortunate enough to be able to attend Dev Ops Days Silicon Valley this year. One of the main talks was given by Jason Hand, and he made some great points. I wanted to highlight two of them in this post:

  1. Post Mortems are really learning events, so you should hold them when things go right, right? RIGHT!! (Seriously, why wouldn’t you want to spot your best ideas and repeat them?)
  2. Systems are hard – if you’re pushing the envelope, you’re teetering on the line between complexity and chaos. And we’re all pushing the envelope these days - either by getting fancy or getting lean.

Post Mortems as Learning Events

Our industry has talked a lot about “Blameless Post Mortems”, and techniques for holding them. Well, we can call them “blameless” all we want, but if we only hold them when things go wrong, folks will get the message loud and clear.

If they are truly blameless learning events, then you would also hold them when things go right. And go meh. Radical idea? Not really - why else would sports teams study game films when they win? (This point was also made in a great Ignite by Katie Rose: GridIronOps - go read her slides.)

My $0.02 is - this would also give us a chance to celebrate success. That is something we do not do enough, and we all know the dedication and hard work it takes to not have things go sideways.

And, by the way, terminology matters during the learning event. The person who is accountable for an operation is just that: capable of giving an account of the operation. Accountability is not responsibility.

Terminology and Systems – Setting the right expectations

Part way through Jason’s talk, he has this awesome slide about how system complexity relates to monitoring which relates to problem resolution. Go look at slide 19 - here’s some of what I find amazing in that slide:

  • It is not a straight line with a destination. Your most stable system can suddenly display inexplicable behavior due to any number of environmental reasons. And you’re back in the chaotic world with all that implies.
  • Systems can progress out of chaos, but that is an uphill battle. Knowing which stage a system is in (roughly) informs the approach to problem resolution.
  • Note the wording choices: “known” vs “unknowable” – for all but the “obvious” case, it will be confusing. That is a property of the system, not a matter of staff competency.

While not in his slide, Jason spoke to how each level really has different expectations. Or should have, but often the appropriate expectation is not set. Here’s how he related each level to industry terms.

Best Practices:

The only level with enough certainty to be able to expect the “best” is the known and familiar one. This is the “obvious” one, because we’ve all done exactly this before over a long enough time period to fully characterize the system, its boundaries, and abnormal behavior.

Here, cause and effect are tightly linked. Automation (in real time) is possible.

Good Practices:

Once we back away from such certainty, it is only realistic to have less certainty in our responses. With the increased uncertainty, the linkage of cause and effect is more tenuous.

Even if we have all the event history and logs in front of us, more analysis is needed before appropriate corrective action can be determined. Even with automation, there is a latency to the response.

Emergent Practices:

Okay, now we are pushing the envelope. The system is complex, and we are still learning. We may not have all the data at hand, and may need to poke the system to see what parts are stuck.

Cause and effect should be related, but how will not be visible until afterwards. There is much to learn.

Novel Practices:
For chaotic systems, everything is new. A lot is truly unknowable because that situation has never occurred before. Many parts of the system are effectively black boxes. Thus resolution will often be a process of trying something, waiting to see the results, and responding to the new conditions.
Next Steps

There is so much more in that diagram I want to explore. The connecting of problem resolution behavior to complexity level feels very powerful.

<hand_waving caffeine_level=”deprived”>

My experience tells me that many of these subjective terms are highly context sensitive, and in no way absolute. Problem resolution at 0300 local with a bad case of the flu just has a way of making “obvious” systems appear quite complex or even chaotic.

By observing the behavior of someone trying to resolve a problem, you may be able to get a sense of how that person views that system at that time. If that isn’t the consensus view, then there is a gap. And gaps can be bridged with training or documentation or experience.

</hand_waving>

November 13, 2015 08:00 AM

October 23, 2015

Nick Thomas (nthomas)

Updates disabled for Android Nightly and Aurora

Due to a bug with the new ftp server we’ve had to disable updates for

They’ll resume just as soon as we can get the fix landed.

Update (Oct 25th): Updates are re-enabled, thanks to Mike Shal for the fix.

October 23, 2015 08:39 AM

October 21, 2015

Nick Thomas (nthomas)

Try Server – please use up-to-date code to avoid upload failures

Today we started serving an important set of directories on ftp.mozilla.org using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.

Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.

October 21, 2015 10:02 AM

October 02, 2015

Hal Wine (hwine)

duo MFA & viscosity no-cell setup

duo MFA & viscosity no-cell setup

The Duo application is nice if you have a supported mobile device, and it’s usable even when you you have no cell connection via TOTP. However, getting Viscosity to allow both choices took some work for me.

For various reasons, I don’t want to always use the Duo application, so would like for Viscosity to alway prompt for password. (I had already saved a password - a fresh install likely would not have that issue.) That took a bit of work, and some web searches.

  1. Disable any saved passwords for Viscosity. On a Mac, this means opening up “Keychain Access” application, searching for “Viscosity” and deleting any associated entries.

  2. Ask Viscosity to save the “user name” field (optional). I really don’t need this, as my setup uses a certificate to identify me. So it doesn’t matter what I type in the field. But, I like hints, so I told Viscosity to save just the user name field:

    defaults write com.viscosityvpn.Viscosity RememberUsername -bool true

With the above, you’ll be prompted every time. You have to put “something” in the user name field, so I chose to put “push or TOTP” to remind me of the valid values. You can put anything there, just do not check the “Remember details in my Keychain” toggle.

October 02, 2015 07:00 AM

September 26, 2015

Kim Moir (kmoir)

The mystery of high pending counts

In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android).  High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs.  The time developers have to wait for their test results is longer than ideal.


Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests.  However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches.  This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.


The work to determine the cause of high pending counts is always an interesting mystery.
Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem.  We have determined that since the end of August a number of new test jobs were enabled that increased the compute time per push on Windows by 13% or 2.5 hours per push.  Most of these new test jobs are for e10s
Increase in seconds that new jobs added to the total compute time per push.  (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)
The e10s initiative is an important initiative for Mozilla to make Firefox performance and security even better.  However, since new e10s and old tests will continue to run in parallel, we need to get creative on how to have acceptable wait times given the limitations of our current Windows tests pools.  (All of our Windows test run on bare metal in our datacentre, not on Amazon).
 
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives: 

To reduce Linux pending counts:
  • Added 200 new instances to the tst-emulator64 pool (run Android test jobs on Linux emulators) (bug 1204756)
  • In process of adding more Linux32 and Linux64 buildbot masters (bug 1205409) which will allow us to expand our capacity more

Ongoing work to reduce the Windows pending counts:


How can you help? 

Please be considerate when invoking try pushes and only select the platforms that you explicitly require to test.  Each try push for all platforms and all tests invokes over 800 jobs.

September 26, 2015 12:34 AM

September 22, 2015

Hal Wine (hwine)

Using Password Store

Using Password Store

Password Store (aka “pass”) is a very handy wrapper for dealing with pgp encrypted secrets. It greatly simplifies securely working with multiple secrets. This is still true even if you happen to keep your encrypted secrets in non-password-store managed repositories, although that setup isn’t covered in the docs. I’ll show my setup here. (See the Password Store page for usage: “pass show -c <spam>” & “pass search <eggs>” are among my favorites.)

Short version:
  1. Have gpg installed on your machine.

  2. Install Password Store on your machine. There are OS specific instructions. Be sure to enable tab completion for your shell!

  3. Setup a local password store. Scroll down in the usage section to “Setting it up” for instructions.

  4. Clone your secrets repositories to your normal location. Do not clone inside of ~/.password-store/.

  5. Set up symlinks inside of ~/.password-store/ to directories inside your clone of the secrets repository. I did:

    ln -s ~/path/to/secrets-git/passwords rePasswords
    ln -s ~/path/to/secrets-git/keys reKeys
  6. Enjoy command line search and retrieval of all your secrets. (Use the regular method for your separate secrets repository to add and update secrets.)

Rationale:

  • By using symlinks, pass will not allow me to create or update secrets in the other repositories. That prevents mistakes, as the process is different for each of those alternate stores.
  • I prefer to have just one tree of secrets to search, rather than the “multiple configuration” approach documented on the Password Store site.
  • By using symlinks, I can control the global namespace, and use names that make sense to me.
  • I’ve migrated from using KeePassX to using pass for my personal secret management. That is my “main” password-store setup (backed by a git repo).

Notes:

  • If you’d prefer a GUI, there’s qtpass which also works with the above setup.

September 22, 2015 07:00 AM

August 03, 2015

Rail Alliev (rail)

Funsize enabled for Nightly builds

Keep calm and update Firefox

Note: this post has been sitting in the drafts queue for some reason. Better to publish it. :)

As of Tuesday, Aug 4, 2015 Funsize has been enabled on mozilla-central. From now on all Nightly builds builds will get updates for builds up to 4 days in the past (for yesterday, for the day before yesterday, etc). This should make people who don't run their nightlies every day happier.

Firefox Developer Edition partial updates will be enabled after 42.0 hits mozilla-aurora, but early adopters can use the aurora-funsize channel.

Partial updates as a part of builds and L10N repacks on nightly will be disabled as soon as Bug 1173459 is resolved.

As a bonus, you can take a look at the presentation I gave in Whistler during the work week.

Reporting Issues

If you see any issues, please report them to Bugzilla.

August 03, 2015 09:22 PM

July 30, 2015

Hal Wine (hwine)

Decoding Hashed known_hosts Files

Decoding Hashed known_hosts Files

tl;dr: You might find this gist handy if you enable HashKnownHosts

Modern ssh comes with the option to obfuscate the hosts it can connect to, by enabling the HashKnownHosts option. Modern server installs have that as a default. This is a good thing.

The obfuscation occurs by hashing the first field of the known_hosts file - this field contains the hostname,port and IP address used to connect to a host. Presumably, there is a private ssh key on the host used to make the connection, so this process makes it harder for an attacker to utilize those private keys if the server is ever compromised.

Super! Nifty! Now how do I audit those files? Some services have multiple IP addresses that serve a host, so some updates and changes are legitimate. But which ones? It’s a one way hash, so you can’t decode.

Well, if you had an unhashed copy of the file, you could match host keys and determine the host name & IP. [1] You might just have such a file on your laptop (at least I don’t hash keys locally). [2] (Or build a special file by connecting to the hosts you expect with the options “-o HashKnownHosts=no -o UserKnownHostsFile=/path/to/new_master”.)

I through together a quick python script to do the matching, and it’s at this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.

Bonus Tip: https://github.com/defunkt/gist

Is a very nice way to manage gists from the command line.

Footnotes

[1]A lie - you’ll only get the host name and IP’s that you have connected to while building your reference known_hosts file.
[2]I use other measures to keep my local private keys unusable.

July 30, 2015 07:00 AM