May 20, 2020
May 15, 2020
In the last few months I’ve worked with contributors who wanted to be selected to work on Treeherder during this year’s Google Summer of Code. The initial proposal was to improve various Treeherder developer ergonomics (read: make Treeherder development easier). I’ve had three very active contributors that have helped to make a big difference (in alphabetical order): Shubham, Shubhank and Suyash.
In this post I would like to thank them publicly for all the work they have accomplished as well as list some of what has been accomplished. There’s also listed some work from Kyle who tackled the initial work of allowing normal Python development outside of Docker (more about this later).
After all, I won’t be participating in GSoC due to burn-out and because this project is mostly completed (thanks to our contributors!). Nevertheless, two of the contributors managed to get selected to help with Treeherder (Suyash) and Firefox Accounts (Shubham) for GSoC. Congratulations!
Some of the developer ergonomics improvements that landed this year are:
- Support running Treeherder & tests outside of Docker. Thanks to Kyle we can now set up a Python virtualenv outside of Docker and interact with all dependent services (mysql, redis and rabbitmq). This is incredibly useful to run tests and the backend code outside of Docker and to help your IDE install all Python packages in order to better analyze and integrate with your code (e.g., add breakpoints from your IDE). See PR here.
- Support manual ingestion of data. Before, you could only ingest data when you would set up the Pulse ingestion. This mean that you could only ingest real-time data (and all of it!) and you could not ingest data from the past. Now, you can ingest pushes, tasks and even Github PRs. See documentation.
- Add pre-commit hooks to catch linting issues. Prior to this, linting issues would require you to remember to run a script with all the linters or Travis to let you know. You can now get the linters to execute automatically on modified files (instead of all files in the repo), shortening the linting-feedback cycle. See hooks in pre-commit file
- Use Poetry to generate the docs. Serving locally the Treeherder docs is now as simple as running “poetry install && poetry run mkdocs serve.” No more spinning up Docker containers or creating and activating virtualenvs. We also get to introduce Poetry as a modern dependency and virtualenv manager. See code in pyproject.toml file
- Automatic syntax formatting. The black pre-commit hook now formats files that the developer touches. No need to fix the syntax after Travis fails with linting issues.
- Ability to run the same tests as Travis locally. In order to reduce differences between what Travis tests remotely and what we test locally, we introduced tox. The Travis code simplifies, the tox code can even automate starting the Docker containers and it removed a bash script that was trying to do what tox does (Windows users cannot execute bash scripts).
- Share Pulse credentials with random queue names. In the past we required users to set up an account with Pulse Guardian and generate their own PULSE_URL in order to ingest data. Last year, Dustin gave me the idea that we can share Pulse credentials; however, each consumer must ingest from dynamically generated queue names. This was initially added to support Heroku Review Apps, however, this works as well for local consumers. This means that a developer ingesting data would not be taking away Pulse messages from the queue of another developer.
- Automatically delete Pulse queues. Since we started using shared credentials with random queue names, every time a developer started ingesting data locally it would leave some queues behind in Pulse. When the local consumers stopped, these queues would overgrow and send my team and I alerts about it. With this change, the queues would automatically be destroyed when the consumers ceased to consume.
- Docker set up to automatically ingest data. This is useful since ingesting data locally required various steps in order to make it work. Now, the Docker set up ingests data without manual intervention.
- Use pip-compile to generate requirement files with hashes. Before, when we needed to update or add a Python package, we also had to add the hashes manually. With pip-compile, we can generate the requirement files with all hashes and subdepencies automatically. You can see the documentation here.
There’s many more changes that got fixed by our contributors, however, I won’t cover all of them. You can see the complete list in here.
Thank you for reading this far and thanks again to our contributors for making development on Treeherder easier!
January 31, 2020
For over a decade Mozilla has been using IRC to publicly chat with anyone interested to join the community. Recently, we’ve launched a replacement for it by creating a Mozilla community Matrix instance. I will be focusing on simply documenting what the process looks like to join in as a community member (without an LDAP account/Mozilla email address). For the background of the process you can read it here. Follow along the photos and what each caption says.
If you have managed to get this far, Welcome to Mozilla’s Matrix! 😄
NOTE: If there’s an official page documenting the process I’m not aware of it. I will add it once it is published.
January 30, 2020
Web performance issue — reoccurrence
In June we discovered that Treeherder’s UI slowdowns were due to database slow downs (For full details you can read this post). After a couple of months of investigations, we did various changes to the RDS set up. The changes that made the most significant impact were doubling the DB size to double our IOPS cap and adding Heroku auto-scaling for web nodes. Alternatively, we could have used Provisioned IOPS instead of General SSD storage to double the IOPS but the cost was over $1,000/month more.
Looking back, we made the mistake of not involving AWS from the beginning (I didn’t know we could have used their help). The AWS support team would have looked at the database and would have likely recommended the parameter changes required for a write intensive workload (the changes they recommended during our November outage — see bug 1597136 for details). For the next four months we did not have any issues, however, their help would have saved a lot of time and it would have prevented the major outage we had in November.
There were some good things that came out of these two episodes: the team has learned how to better handle DB issues, there’s improvements we can do to prevent future incidents (see bug 1599095), we created an escalation path and we worked closely as a team to go through the crisis (thanks bobm, camd, dividehex, ekyle, fubar, habib, kthiessen & sclements for your help!).
January 17, 2020
This post focuses on the work I accomplished as part of the Treeherder team during the last half of last year.
Some work I already blogged about:
Deprecated the taskcluster-treeherder service
Create performance dashboard for AWSY
I September I created a performance dashboard for the Are We Slim Yet project (see work in bug 1550235). This was not complicated because all I had to do was to branch off the existing Firefox performance dashboard I wrote last year and deploy it on a different Netlify app.
One thing I did not do at the time is to create different configs between the two branches. This would make it easier to merge changes from the master branch to the awsy branch without conflicts.
On the bright side, Divyanshu came along and fixed the issue! We can now use an env variable to start AWFY & another for AWSY. No need for two different branches!
Other work I did was to create a compare_pushes.py script that allows you to compare two different instances of Treeherder to determine if the ingestion of jobs for a push is different.
I added a management command to ingest all Taskcluster tasks from a push, an individual task/push or all tasks associated to a Github PR. Up until this point, the only way to ingest this data would be by having the data ingestion pipeline set up locally before those tasks started showing up in the Pulse exchanges. This script is extremely useful for local development since you can test ingesting data with precise control and having
I maintain a regular schedule (twice a week) to release Treeherder production deployments. This guarantees that Treeherder would not have too many changes being deployed to production all at once (I remember the odd day when we had gone 3 weeks without any code being promoted).
I created a process to share secrets safely within the team instead of using a Google Doc.
December 23, 2019
Reducing Treeherder’s time-to-deploy
Up until September we had been using code merges from the master branch to the production one to cause production deployments.
A merge to production would trigger few automatic steps:
- The code would get tested in the Travis CI (10 minutes or more)
- Upon success the code would be built by Heroku (few minutes)
- Upon success a Heroku release would happen (less than a minute)
If a regression was to be found on production we would either `git revert` a change out of all merged changes OR use Heroku’s rollback feature to the last known working state (without using Git).
Using `git revert` to get us back into a good state would be very slow since it would take 15–20 minutes to run through Travis, a Heroku build and a Heroku release.
On the other hand, Heroku’s rollback feature would be an immediate step as it would skip steps 1 and 2. Rolling back is possible because a previous build of a commit would still be available and only the release step would be needed .
The procedural change I proposed was to use Heroku’s promotion feature (similar to Heroku’s rollback feature). This would reuse a build from the staging app with the production app. The promotion process is a one-click button event that only executes the release step since steps 1 & 2 had already run on the staging app. Promotions would take less than a minute to be live.
This change made day to day deployments a less involved process since all deployments would take less that a minute. I’ve been quite satisfied with the change since a deployment requires less waiting around to validate a deployment.
December 11, 2019
In bug 1566207 I added support for Heroku Review Apps (link to official docs). This feature allows creating a full Treeherder deployment (backend, frontend and data ingestion pipeline) for a pull request. This gives Treeherder engineers the ability to have their own deployment without having to compete over the Treeherder prototype app (a shared deployment). This is important as the number of engineers and contributors increases.
Once created you get a complete Heroku environment with add-ons and workers configured and the deployment for it.
Looking back, there are few new features that came out of the work, however, Heroku Review apps are not used as widely as I would have hoped for.
One of the benefits that came out of this project is that incidentally we solved a long standing Pulse consumption feature request (Thanks Dustin for the idea!) In order to consume data from Pulse, we need to provide a username and a password. There’s no guest or unauthenticated method for Pulse messages consumption. If credentials were to be shared we would have multiple consumers for the same queues with each consumer competing for the same set of messages. The solution to this problem is by creating queues dynamically with a couple of environment variables (see code). This means that each Heroku Review App will use the same set of credentials, however, consumer from different queues, thus, not competing for messages. This solution also solves the same problem for local development. The local development set up will be able to share credentials across Treeherder developers (each having their own queues).
Another good thing that came out of the project is that we can reduce the number of tasks the pipeline consumes. This is important for the Heroku Review App (as well as the local development set up) because we don’t need to set up too many Heroku workers to process all the data that comes out of Taskcluster (Firefox’s CI). In Heroku, we have close to 20 workers to handle the load. In order to keep the cost down for a Heroku Review App (workers + add-ons) I decided to limit the ingestion to autoland & android-components. This is accomplished with the PROJECTS_TO_INGEST environment variable which can be changed after the app is created. The day is nearer when the local ingestion pipeline could be started automatically without bringing your laptop to its knees.
One last advantage is that we can test different versions of MySql by simply changing one line for the JawsDB Heroku add-on. This is important because it will remove coordinating Treeherder RDS/Terraform changes with dividehex before we’re fully ready. We can also modify other add-ons, however, changing the MySql version is the most significant we can tweak.
Unfortunately, the Heroku Review Apps are not used as much as I would have wished for. Treeherder devs tend to borrow the Heroku treeherder-prototype app instead of creating a Heroku Review App. I know it will be useful in the future since I have experienced at least once where three of us wanted to use the shared Heroku app.
On a separate note, the configuration of the add-ons is not as advertised. You don’t have full control of what plans your add-ons get configured with. For instance, I could specify the Tiger plan for the CloudAMQP add-on yet get the Lemur plan instead. I found out that permitting certain add-ons needs to be requested via a Heroku support ticket. This is because I’m not a Mozilla Heroku org manager but one of the developers using the account. It seems that there are some default plan for each add-on configured at the org level. Fortunately, the Heroku folks were very quick to fix this for me.
Overall, I’m quite satisifed with how simple it is to set up Heroku Review Apps for your Heroku pipeline. Good job Heroku team!
October 24, 2019
See the video there, if it doesn’t display up here.
It’s a fairly common practice to build and test every time someone makes a change in the code. In the industry, we call this process “Continuous Integration” (CI). Another good practice is to automate the deployment of your builds to end-users. This process is called “Continuous Deployment” (CD). There are several CI/CD products on the market. Mozilla has used some of them for years and still uses them in some projects. About 6 years ago, Mozilla needed a CI/CD product that did more than what was available and started Taskcluster. If you’re interested in knowing more about why we made Taskcluster, please let me know in the comments and I’ll write a dedicated post.
Since then, we’ve entirely migrated many Android projects like Firefox, Firefox Focus, Firefox for Amazon’s Fire TV, Android-Components, and the upcoming Firefox Preview to Taskcluster. The rationale for migrating off another CI/CD was usually the following: we want to ensure what we’re shipping to end-users comes from Mozilla, while still providing the development team an easy and consistent way to configure their tasks.
Taskcluster and task definitions
A task is an arbitrary piece of code that is executed by Taskcluster. We tell Taskcluster to execute a task by submitting some configuration, that we call the “task definition”. A task definition contains data about version control repository it deals with or what commands to run. As your project grows, it will likely need more than a single task to perform all actions. In this case, you need a graph of task definitions. You can submit a graph in different ways: either by manually entering each task on the task creator, by defining them all in a .taskcluster.yml file in your repository by defining a single task in .taskcluster.yml and letting this task generate the rest of the graph. We call this single task a “decision task”. It can be implemented: either by using one of the Taskcluster libraries (the Python one for instance), or by using a higher-level framework: taskgraph.
The above graph submission options are ordered by complexity. You may not want to start with option 3 immediately. Although, if your CI/CD pipeline deals with multiple platforms, multiple types of tests, and/or multiple types of pipelines, you may want solution 3b. More on this below.
Solution 3a: What’s inside the decision task
Solution 3b: Inside the decision task
Taskgraph was originally built for the most complex project, Firefox itself, and it was strongly tied to it. So, when a simpler project - Firefox Focus - came up a year and a half ago, we - the Development Team and Release Engineering - agreed on going with solution 2 and later on then 3a. Firefox Focus went great and became the base of a lot of Android projects. They have grown quite big since Firefox Focus started. The way CI/CD is configured there has grown on top of a code which wasn’t meant to be that big, and which was duplicated across projects.
Taskcluster and people
Moreover, people started to come from our main repository (where Firefox for Android is) to our other Android projects. With each project, they had to figure out how things work and Release Engineering has had to be involved in many changes, so we were losing the “provide the development team an easy way to configure their tasks” feature.
Here enters taskgraph
We know Android projects are becoming more and more complex, in terms of CI/CD. We know taskgraph is able to support thousands of jobs by splitting the complexity in small and dedicated files. We know Firefox developers have more or less knowledge about taskgraph to be able to add/modify their job without compromising the rest of the graph. So we decided to make taskgraph a more generic framework so Android projects can reuse it.
Taskgraph, the tour
In a few words
Like its name infers, taskgraph generates graphs of tasks for all types of events. Events can be “someone just pushed a new commit on the repository” or “someone just triggered a release”. Taskgraph knows what type of event happened and submits the corresponding graph to Taskcluster.
- It’s fast! Tens of thousands of tasks are generated in under a minute
- It deals with default values. Oftentimes you have to provide the same data for each task, taskgraph knows about the right defaults, so you can focus on what’s really important to your task
- It validates data before submitting anything to Taskcluster.
- It’s deterministic. You’ll get the exact same result by reusing a set of parameters from a run on another machine.
- It’s both configuration-oriented and programming-oriented. Want to avoid repeating the same configuration lines over and over? You can write a python function in a separate file
On Android projects, when we first implemented our simple solution based off of taskcluster libraries, we did have feature #1 (easy, when you deal with a handful of tasks), but we started to wait minutes to get a hundred tasks submitted. At some point, we needed feature #2, so we had to implement our own default values. We’ve never had feature #3 and #4. Depending on the Android project, feature #5 was more or less implemented.
Other features I like
- Docker images! We can create our build environments directly on Taskcluster as docker images and have our build tasks using them. We don’t need to publish the images somewhere, like on Docker hub.
- Cached tasks. Sometimes you just want to rebuild something when a subset of the code changes (for instance a Dockerfile). Taskgraph knows where to find these tasks and reuse them in the graph it submits
- Graphs that depend on other graphs. Cached tasks are for single tasks, but you can reuse entire graphs. This is useful in Firefox. We generate all the shippable builds whenever a push to the repository happens. Then, if we think they’re good enough, we promote them to be actually shipped.
To me, re-using taskgraph instead of reimplementing a new framework is today a huge win just for the sake of these features, even for small projects. We didn’t do it a year and a half ago, because taskgraph wasn’t a self-served module. Tom Prince put a lot of effort to make taskgraph generic enough to serve other projects than Firefox.
Introduction to taskgraph
Data flow, part 1
Hacking taskgraph is usually dealing with that part of the data flow
In order to help taskgraph to tell if a task should end up in a graph, tasks are grouped in “kinds”. Kinds group tasks that are similar.
For instance: You’d like to have 2 compilation tasks, one for a debug build and the other for an optimized one. Chances are both builds are generated with the same command, modulo a flag. In taskgraph, you will define both tasks under the same kind.
Each kind is stored in its own kind.yml file. It’s the configuration-oriented part of taskgraph. You usually put raw data in them. There are some bits of simple logic that you can put in them.
job-defaults provides some default values for all jobs defined in the kind, it can be useful if the build tasks are generated with the same command.
The loader is in charge of taking every task in the kind, applying
job-defaults and finding what are the right upstream dependencies. The dependency links are what make the graphs of tasks.
For instance: In addition to having common values, both of your build tasks may depend on Docker images, where your full build environment is stored. The loader will output the 2 build tasks and set the docker image as their dependencies,
Here comes the programming oriented part of taskgraph. The configuration defined in kind.yml is usually defined to be as simple as possible. Transforms take them and translate them into actual task definitions. Transforms can be shared among kinds, which gives a way to factorize common logic. Some transforms validate their input, ensuring the dataflow is sane. For example: For the sake of cost, Taskcluster enforces tasks to be deleted after a certain amount of time. The last transform ensures this value is valid and if it’s not defined (which usually happens) sets it to a year from now.
Data flow, part 2
You may not need it, but that’s how optimization happens!
Taskgraph always generates the full graph, then it filters out what’s not needed. That’s what the target task phase does. For instance: If you want to ship a nightly build, target_tasks will only select the right tasks to build, sign, and publish this build.
Some tasks may have been already scheduled. If so, taskgraph takes them out of the target tasks. For example: If the nightly build task was made at the same time as the debug one, taskgraph can reuse the task and just submit the signing and publishing tasks.
At this point, taskgraph knows exactly what subgraph to submit. It delegates the submission to taskcluster via one of its client libraries. Tasks are then generated.
I really enjoyed “taskgraph-ifying” the most important mobile projects. We have leveraged a codebase that can scale while being able to handle edge cases. We’ve been able to factorize some of the common logic, which wasn’t possible with our initial solution. There’s still some improvement we can, and Mitch, Tom and I are working on improving the quality of life. Moreover, having taskgraph enables a better release workflow for Firefox Preview: we may have finer grained permission on who can ship a release. It requires more work and more blog posts, but taskgraph is a necessary stepping stone.
- The official documentation for Firefox
- How tos depending on what you want to do
- “Task Configuration at Scale”, a video that explains the scaling problems taskgraph solves
- The Gecko Hacker’s Guide to Taskcluster
Need any help?
If you’re stuck on something or Taskcluster isn’t behaving as it should, feel free to reach out to Release Engineering in #releaseduty-mobile (Slack).
- Tom Prince for extracting taskgraph out of the Firefox codebase, and converting the very first Android project: Reference-Browser
- Mitch Hentges for converting Firefox-TV and Application-Services, in addition to reviewing my patches
- Mihai Tabara for swiftly reviewing my other patches
- Jordan Lund for coordinating this project, including its communication
- Sebastian Kaspari for helping me to get the patches landed and for reporting any regression he noticed
- Richard Pappalardo and Kate Glazko for making sure the UI tests and the nimbledroid submission are working fine
- Paul Adenot for helping understand and fix the audio in the 3-minute-video.
- The reviewers of this article and the video.
- And anyone I missed above
You can read and leave comments on this Github issue.
September 13, 2019
Back in July and August, I was looking into a performance issue in Treeherder . Treeherder is a Django app running on Heroku with a MySql database via RDS. This post will cover some knowledge gained while investigating the performance issue and the solutions for it.
NOTE: Some details have been skipped to help the readability of this post. It’s a long read as it is!
Treeherder is a public site mainly used by Mozilla staff. It’s used to determine if engineers have introduced code regressions on Firefox and other products. The performance issue that I investigated would make the site unusable for a long period of time (a few minutes to 20 minutes) multiple times per week. An outage like this would require blocking engineers from pushing new code since it would be practially impossible to determine the health of the code tree during an outage. In other words, the outages would keep “the trees” closed for business. You can see the tracking bug for this work here.
The smoking gun
On June 18th during Mozilla’s All Hands conference, I received a performance alert and decided to investigate it. I decided to use New Relic which was my first time using it and it also was my first time investigating a performance issue of a complex web site. New Relic made it easy and intiutive to get to what I wanted to see.
The UI slow downs came from API slow downs (and timeouts) due to database slow downs. The API that was most affected was JobsViewSet API which is heavily used by the front-end developers. The spike shown on the graph above was rather anomoulous. After some investigation I found that a developer unintentionally pushed code with a command that would trigger an absurd number of performance jobs. A developer normally would request one performance job per code push rather than ten. As these jobs finished (very close together in time) their performance data would be inserted into the database and make the DB crawl.
Since I was new to the team and the code-base, I tried to get input from the rest of my coworkers. We discussed using Django’s bulk_create to reduce the impact on the DB. I was not completely satisfied with the solution because we did not yet understand the root issue. From my Release Engineering years I remembered that you need to find the root issue or you’re just putting a band-aid on that will fall off sooner or later. Treeherder’s infrastructure had a limitation somewhere and a code change might only solve the problem temporarily. We would hit a different performance issue down the road. A fix at the root of the problem was required.
I knew I needed proper insight as to what was happening plus an understanding of how each part of the data ingestion pipeline worked together. In order to know these things I needed metrics, and New Relic helped me to create a custom dashboard.
Similar set-up to test fixes
I made sure that the Heroku and RDS set-up between production and stage were as similar as possible. This is important if you want to try changes on stage first, measure it, and compare it with production.
For instance, I requested EC2 type instance changes plus upgrading to the current EC M5 instance types. I can’t find the exact Heroku changes that I produced, but I made the various ingestion workers to be similar in type and in number.
I had a very primitive knowledge of MySql at scale and I knew that I would have to lean on others to understand the potential solution. I want to thank dividehex, coop and ckolos for all their time spent listening and all the knowledge they shared with me.
The cap you didn’t know you have
After reading a lot of documentation about Amazon’s RDS set-up I determined that slow downs in the database were related to IOPS spikes. Amazon gives you 3 IOPS per Gb and with a storage of 1 Terabyte we had 3,000 IOPS as our baseline. The graph below shows that at times we would get above that max baseline.
To increase the IOPS baseline we could either increase the storage size or switch from General SSD to Provisioned IOPS storage. The cost of the different storage type was much higher so we decided to double our storage, thus, doubling our IOPS baseline. You can see in the graph below that we’re constantly above our previous baseline. This change helped Treeherder’s performance a lot.
In order to prevent getting into such a state in the future, I also created a CloudWatch alert. We would get alerted if the combined IOPS is greater than 5,700 IOPS for 6 datapoints within 10 minutes.
One of the problems with Treeherder’s UI is that it hits the backend quite heavily. The load depends on the number of users using the site, the number of pushes that are in view and the number of jobs that each push has determines the load on the backend.
Fortunately, Heroku allows auto scaling for web nodes. This required upgrading from the Standard 2x nodes to the Performance nodes. Configuring the auto scaling is very simple as you can see in the screenshot below. All you have to do is define the minimum and maximum number of nodes, plus the threshold after which you want new nodes to be spun up.
Troubleshooting this problem was quite a learning experience. I learned a lot about the project, the monitoring tools available, the RDS set up, Treeherder’s data pipeline, the value of collaboration and the importance of measuring.
I don’t want to end this post without mentioning that this was not excruciating because of the great New Relic set up. This is something that Ed Morley accomplished while at Mozilla and we should be very greatful that he did.
August 21, 2019
Frontend security — thoughts on Snyk
I can’t remember why but few months ago I started looking into keeping my various React projects secure. Here’s some of what I discovered (more to come). I hope some will be valuable to you.
A while ago I discovered Snyk and I hooked it up my various projects with it. Snyk sends me a weekly security summary with the breakdown of various security issues across all of my projects.
Snyk also gives me context about the particular security issues found:
It also analyzes my dependencies on a per-PR level:
Other features that I’ve tried from Snyk:
- It sends you an email when there’s a vulnerable package (no need to wait for the weekly report)
- Open PRs upgrading vulnerable packages when possible
- Patch your code while there’s no published package with a fix
The above features I have tried and I decided not to use them for the following reasons (listed in the same order as above):
- As a developer I already get enough interruptions in a week. I don’t need to be notified for every single security issue in my dependency tree. My projects don’t deal with anything sensitive, thus, I’m OK with waiting to deal with them at the beginning of the week
- The PR opened by Snyk does not work well with Yarn since it does not update the yarn.lock file, thus, requirying me to fetch the PR, run yarn install and push it back (This wastes my time)
- The feature to patch your code (Runtime protection or snyk protect) adds a very high set up cost (1–2 minutes) everytime you need to run yarn install. This is because it analyzes all your dependencies and patches your code in-situ. This gets on the way of my development workflow.
Overall I’m very satisfied with Snyk and I highly recommend using it.
In the following posts I’m thinking of speaking on:
- How Renovate can help reduce the burden of keeping your projects up-to-date (reducing security work later on)
- Differences between GitHub’s security tab (DependaBot) and Snyk
- npm audit, yarn audit & snyk test
NOTE: This post is not sponsored by Snyk. I love what they do, I root for them and I hope they soon fix the issues I mention above.
June 13, 2019
The GPG key used to sign the Firefox release manifests is expiring soon, and so we're going to be switching over to new key shortly.
The new GPG subkey's fingerprint is
097B 3130 77AE 62A0 2F84 DA4D F1A6 668F BB7D 572E, and it expires 2021-05-29.
The public key can be fetched from KEY files from Firefox 68 beta releases, or from below. This can be used to validate existing releases signed with the current key, or future releases signed with the new key.
-----BEGIN PGP PUBLIC KEY BLOCK----- mQINBFWpQAQBEAC+9wVlwGLy8ILCybLesuB3KkHHK+Yt1F1PJaI30X448ttGzxCz PQpH6BoA73uzcTReVjfCFGvM4ij6qVV2SNaTxmNBrL1uVeEUsCuGduDUQMQYRGxR tWq5rCH48LnltKPamPiEBzrgFL3i5bYEUHO7M0lATEknG7Iaz697K/ssHREZfuuc B4GNxXMgswZ7GTZO3VBDVEw5GwU3sUvww93TwMC29lIPCux445AxZPKr5sOVEsEn dUB2oDMsSAoS/dZcl8F4otqfR1pXg618cU06omvq5yguWLDRV327BLmezYK0prD3 P+7qwEp8MTVmxlbkrClS5j5pR47FrJGdyupNKqLzK+7hok5kBxhsdMsdTZLd4tVR jXf04isVO3iFFf/GKuwscOi1+ZYeB3l3sAqgFUWnjbpbHxfslTmo7BgvmjZvAH5Z asaewF3wA06biCDJdcSkC9GmFPmN5DS5/Dkjwfj8+dZAttuSKfmQQnypUPaJ2sBu blnJ6INpvYgsEZjV6CFG1EiDJDPu2Zxap8ep0iRMbBBZnpfZTn7SKAcurDJptxin CRclTcdOdi1iSZ35LZW0R2FKNnGL33u1IhxU9HRLw3XuljXCOZ84RLn6M+PBc1eZ suv1TA+Mn111yD3uDv/u/edZ/xeJccF6bYcMvUgRRZh0sgZ0ZT4b0Q6YcQARAQAB tC9Nb3ppbGxhIFNvZnR3YXJlIFJlbGVhc2VzIDxyZWxlYXNlQG1vemlsbGEuY29t PohGBBARAgAGBQJVrP9LAAoJEHYlQD1/DRWxU2QAoOOFRbkbIU1zKP2i3jy/6VKH kYEgAJ9N6f9Gmjm1/vtSrvjjlxWzzQQrkIhGBBARAgAGBQJVrTrjAAoJEMNOV0fi PdZ3BbkAoJUNHEqNv9dioaGMEIpiFtDjEm44AJ9UinMTfAYsL9yb15SdJWe/56VC coheBBARCAAGBQJWBldjAAoJEAJasBBrF+oerNYA/13MQehk3AfkljGi252/cU6i 1VOFpCuOeT7lK2c5unGcAP0WZjIDJgaHijtrF4MKCZbUnz37Vxm0OcU8qcGkYUwH i4heBBARCgAGBQJVrSz+AAoJEPCp59zTnkUulAYA/31nYhIpb7sVigone8OvFO19 xtkR9/vy5+iKeYCVlvZtAP9rZ85ymuNYNqX06t+ruDqG2RfdUhJ6aD5IND+KD5ve 7IkBHAQQAQIABgUCVaz9fgAKCRCzxalYUIpD8muMB/sH58bMSzzF9zTXRropldw7 Vbj9VrRD7NyoX4OlDArtvdLqgPm0JUoP2gXINeSuVPpOfC676yVnBEMjIfqEjq09 vcbwayS+Ncx4vQh2BmzDUNLE3SlnRn2bEWr9SQL/pOYUDUgmY5a0UIf/WKtBapsP E+Zan51ezYSEfxDNfUpA4T2/9iWwJ2ZOy0yIfLdHyvumuyiekJrfrMaF4L9Q0OnJ wp1PwkvN4IVwhZeYDtIJN4nRcJK5LrwU7B97uef2hqBBll7/qCHl5y4Khb0csFan Ig+pQLPUJdIiYtzoFtlgykB61pxqtU9rqGKW02JzEUT8DdPUXxmMBy6A8oGeBRH/ iQEcBBABAgAGBQJVrRdcAAoJEGVzgtv/JREKQJgH/3nD/3/SumL7nG2g7Y1HQqWp hUbn40XWvjZcHq3uBUn1QYXeZ5X56SANLM2t+uirGnNaZXW3cxEl5IyZVLbmcLWE BlVAcp2Bf3FXFbdJK59f+M+y2+jZT9feTyrw+EtLoiGTxgkLdJyMyI0xGmQhMx5V 1ex1CxhZK2JPjzCVYriBI0wIbmKi90YNMQoSsdMhYmX9bHl6XWS9TCDWsqj25FLY JL+WeVXpjO0NjRwEE6pc/qldeJYG5Vbf0snGxIerXe+l5D8Yd4PEAnpj58+5pXeo GYZn3WjX8eTFMAEU+QhLKWQ+j/Y8Kijge7fUxnSNBZ2KEnuDN/4Hv/DrCFLv14CJ ARwEEAECAAYFAlWtZVoACgkQ5DJ8bD4CmcBzsAf/RMqDdVHggQHc0/YLt1f/vY9Y 7QQ6HwnDrtcNxxErSVcMguD8K6Oxir0TMSh+/YuZAW8K4KSgEURwZqz4na8/eOxj 8bluNmlcAseQDHswqU6CyB95Woy3BocihH7L0eDXZOMzsa33vRQHBMioLxIbpnVt VbFR1z7tmyfjcOrzP32xo5QoPoczKX26luMBjAvbw1FC0is2INnmUSYM4uH7iFZu XGPFYxcAqODqy5ys3MoPa4oZ71d0HoiRil1+s0Y+2ByddZ19pE2TXp4ZXNYNUj/2 aRj8b4sTjR4rqhHIx/vfoK+VCNy/skFUZOyPdbbymE0stTRSJ1gr9CZLcBWYF4kB HAQQAQIABgUCVcFZcAAKCRCJFz+VfFX5XqApB/938p+CJiDRnh2o7eDWnjSyAu7F WmWGkOQnjI/kraKx1vojsYnKRXD6mjq1QJ8Hsp4taJnLQjcokNTUiST4m/e4ZJEx PWuJKkwlralWGH6NpqYcgWPajSYb0eYQC4YqS0kfyzolrHdKI8Y4NGEU7yy5zsHw WkHt/mpNQMrYnXwyWdIrc03X/OXo51dJyshJDRw3InREyBblFJcLvArNHz219wMr XAicPytw4wfPpVrmDx6GrZcI8q8ECWCjwSXXv7hRpEuFLSy5XPhMc+wYBJjNlUoi FBAF/7zENd3rMn9SCQLiIFYe0ubmO+bpeGy7TizbxOaCIfgUouyy0BQXNuJBiQEc BBABAgAGBQJV0hrqAAoJEK18uZ+CSLoPzEIH/1D6sJMNAJtZCRGhJXvv6SYhv4pU VNyDF9FnUvRsovliojoe4IkuBTWKhPGrxbiD5IO/izr38shqNhhm9JE2/SQZHObY Pi+lyfDKbJgImTNxmS4F7JHnRLr37VxK1sVvuNkynJnqvCcp1g5xwNIx1rKcka3i uqJj6toM8XQfgsTHH1rUkWHbUV3QwNzXm+yhFm2s6QzxBooPzmFn8AY7CXD4pvcM R+M0Zy+e42nngd8lzRnmTBVig4pRq0GCMulFG+XjeVQZFpoIIxo2k1lczbRmGttO NdGWSjxBUxReoTbSwM3C/50NrobycGQgY0gd6LGtWtU8/uEfklEy2NluxYWJARwE EAEIAAYFAlWtAUYACgkQVu5xjc4OFUs0OAf+LM0dyyvUFGdXfJDpP2xMknXzsHAX WFEtH5jein58mv6dD3fTVcCouo1vMQH3WFFSLYZvwtNnHGrSBqFbNKqZ0ATQ5tcY aWsSZ+MVJJMXJDXFG/Oihg1nNOM33VdfV0RGPKP1I4cEROxms3TUFkHW3cSCgMzs 8I1OxfSoLrm6da8EN+2ct2InqzdQL2yisyTyrdmXoNpwXDxApKYkvVHQ4+9eJI5m 0ZAr0mBjIeJdATcw4/lIVKTrV7UhrChxiffYJcz4SSC1crmr+2Fzw53CyAsAmYal UHep3Yr05oQ4oJRX9X3VrY/yELHwwxXaxCAdwwHbbXAMhZsPk9Mc20J6BokBHAQQ AQgABgUCVa0isQAKCRCj1lIXO3Y+j6ZeB/91Q9/qr5oMWgOMsix8kflBLw2f/t+t RR0SWDw90bG1npJB6nq5Hl+Bz4/A4SWFTFrrrlZi1Enjn1FYBiZuHaSQ/+loYF/2 dbQDbBKShfIk3J0lxqfKPAfKopRsEuxckC8YW1thGxt5eQQ8zkJoqBFTBzwiXOj3 /ncJkX9q9krgUlfTSVmrT9nx0hjyNQQXrghsmBtpR7WCS7G7vNRGCNUorhtviUvL +ze1F7TTSGspVsVxo2ghmz5WT/cD9MV1gcVjojYmksh5JIl39jCHr9hl8aRId/Of zsN+TKuBcpAxDkm9BCAps7oY8FlLKDFZTtHa000AkodKHT88nwnvKuqPiQEcBBAB CAAGBQJVrTkDAAoJEPbQ92HczOykK9YH/0MARo3HlYXeS2bDqM/lwK/rQcPCCyYk e6wbICjncbCOjgXHqG/lBhClNs7hp/7gqkUaR7H5tmeI4lalP40mSHHnnFvMD3Tc yhn350igK0bgrjWQDaYxhKlHT3vIXd/C24/vRSAxmqIKbP+IoXOyt2GMTQq8GOm2 dgYRaTkwyHnGWnMaibctX8D4oCYR0/D4YJqPkfqobf8+1ZfP5GaMbSxE/Jwdo0kJ a4vPjEzFXbygAbncapzdwN6zgel2zh885rz7B7vIpMr/Y7eV85Q68qdyyhLe8cL8 Y18YPzpFf+/PZNbgYxouafvnFwBhPQwg0gUF/+1eM3UE2ua+saSTGduJARwEEAEK AAYFAlWtCVsACgkQM0LhtmejiGMovwf8CfYJHNbwiwSMUoP4n7FrmElhBtxvlbnC MZKz08v+lFsfS3wU1LUN69GqirfF0vkQRSlSBp7niCLHQCfSoqHMLgxF0P2xgXLj aYM/t/rxXDawJmW18G04dqFrtCPZTbwMT2PsPHTiWQdaN0e50lXk9Vo+l6VbwQMg 4zH7icZadeJgQooxFalHYFVXUVeex9t8/YdanFVrHFa3tao6azBTSUkJvZtIu14S fxigDWIIwsx0xpVfJf3a/xC6HY3Q1a3NeBz3i6DwaK5wYqijZKl0WVdULKyqU98o F6y0mUv3d2o/p07Cqgeo6xxMkHqu83OLa2a0C7tYPLgL4EFc2FtikYkCHAQQAQIA BgUCVaz7KAAKCRCWO3gxCjexfKxrD/4npm1rB7+pPlotbqK37Mur7egPbVSAzVNU /zUKPAuGUeP3C64YN77ETx1kDuS+meAqMDHFc9Bf8HivPbtj6QcK96U5KstbmSh1 Ow9YiQtxJgxGjg/CzREgZAFcjy0MhoklyPsFhv07s6MLOJMSM/krEN5nqjifQ0Wd mTk02FLoHVWcLdjfgMiPiSjGbU3k7luvjPyRNzk831szE5mfa74rEYh4TBklse+2 uB4DFQ/3oHZ1Sj6OBK6ujmNKQjIP7Cl+jmjr7+QK0OJcRaj/8AckDA5qXTZACh1S 2syCDDMnX0V+dTxGCIoWOK+tt9mLohMzpEeD4NIX4qdpbbCRzeYZMHSomyBIsbA6 B+/ftDE7W1N0/FtJ9adkkCynKULvh2CH5c5hgOOL22M+2spnywRoeJRUWU7hBM5O UH3JjA4Tu4j/cwp7dD7QzZrzmC9f5LQJ3OelejvVowWPQd3/tky4o1q6wlmFqAcA gtu97UwgBOSR9sJPGDlt1iC91UYAiBQQAA7ya8uXUS84mCQwTlr8j+YrowvEHK4I xpPREytT1LzzV/4Am4ndDFtujy83QjL0qaIIim1xIwoEosd4yidhpczw7f3b9dQp uBIFeQuhM7JsxP4tmE7S6k6GlEmqa3INPVaPGnsUGS7+xSMlcJXLtimPCSQvFma9 YiGV5vtLy4kCHAQQAQIABgUCVaz8uAAKCRASy06X4H5n0dg0D/9QoxIh9LRt1jor 7OHG4xKUjKiXxn/KeQNlJnxI55dlWIvJEJGheFjaDomzKBYuxmm2Ejx+eV5CHDLU YsLFYwWf8+JGOP75Ueglgr8A0/bdsL63KX6NP2DCg8XR4Z1aeei3WMY7p/qMWpqb QoAv9c3p49Ss2jSNuthWsRR6vbQ9iwze2oaUaA44WKQyhhbCwBU4SHYjlKCLqIBh /HXZFhZ4rDfuWgPBKvYU1nnOPF0jJRCco3Vgx3T9F+LZ3zo5UPt1Xapr3hMVS9ia Jyl1w4z2miApUaZuHPuWKuO4CJ1GF1mS5T6vG8gB3Ts5zdtBF2xQIkCz+SM7vW/2 i/82oq6P8EuLHEhrQPR4oTjXIvXdEJ9kgbjqcj8Xk+8teEOnuwh6iEhay9i/bf0D 3Jd+roFN5dnWPxhOVjzrI3fwlK1/ylsZYqUYBEzt7Wj0MdhjeKssI5YICcqYXXjB ttMw4B7DZXPFXzz3kHB56jZ/II4YUjpLO85Jo5A9SV+aIqa0mvCt6DvVWy/rhfxf oUdqNlhX11gkVLaA7xxgn/NqPOf+h5hVO2mwWkmart9YHKMZ3ukCdke65ITL/nsY Sm2ZhG7OYjaCfu9jPWtkBstOEWyT9q4JTdViR7wN3eMefEG6rb49rxOYvGJu+cTV kp3SCpl0w1j+tPj4tkj7ENzPMXdnuYkCHAQQAQIABgUCVa0s4gAKCRCKsTKWOgZT euMyEACKOySKAd/xDcPcHg7Prvdws04Z8DIR0dY2qUlbRVx2jTmIXyry63CqbOJF bDg9uk5x0+lSotvrWtZ+NKSrg9VM6vyV4cc2P9rhqIBi3wO2elzAmpOaS2KKOjQ+ 2fS/xqh91ElJUu09xXQXJ0vMrqgui+zN1YBDiJV0WOmm90Mm2NPiihcWZmBmDorO qMQabwbjBLi0yUVHgAlkilY3mAB4tmEKDeN+4pYSAAhXAll9U+nyoVMgwMJscZya zOp4MqMbmFjyr4p5AGzv+OOJtjtCNKT6oW9Y+URLY0YKeOsPk0v5PlbQCVBlLeSB sNZudKav/Gvo7Mvz5uLTcneBFb+haYIiXO/FQm4uBHkzdNFLgaph81Wzh62AhbtB lfBOj/lbzN3k/xRwo64QU+2Z9GOhFlhjfROquY70FCQcspwNuqCdZybnkdpF2Qrr 6Pi0qKR/Xb9Vd7PW0/gKQdwwlYTiDemgA21mYeJrYw873/7U/+kLFRvmPAEX4IOI OEN6XVjxvu78REi6CmXxOoYnH4aRSXDRyi1nsGjB43AtfAMMNCUigDgFP4sUsZAG 1RAoxBhOsO/g9S5wx8H3rKITCXDjQh2SYeBwHFcU03EMcyzEQhbZNighN+aRKGIi bteRxISiKU+kcWaHolemeo6wGF87QXEpJaQ2OwIoIxQYvDDmQokCHAQQAQgABgUC Vaz/8QAKCRA/8xuvEEv54t06D/9n1Nyn2QSUN1mXd7pomoaka+I2ogDbQpu9iuFq bkqfcH3UuG8yTKlPp9lYDBs0IEfG85Js6iVxJIultocrcDmOyDkyEsnYbdel/tn3 X4yqD8eI6ImRoCE+gnQ3LoEIHuODfJoosM/jAHANs4fsla4/u5CZDXaaq7pYXGiT t7ndsfmLiCa7dAg7bVFfJagsnL/VjlfeWM9nW01rDL9LPxSN4tq7ZKXWZDonFZYJ 4unsK/Cn6Pqco4Wb+FUOWCcWt8in1pgeNHZ9WnAgXG999/3iCbbQTLB6uVwY4Ax5 P7VApnLVXV6QFVf7bN1DxE8kZk+pfLGcuD1LJSF0skE80M17kAt+iV+fam8EYzeG dG6cY6w+srndaMaq9ddiHIiQkR35SjJAGnrNRj8ooUr/vKOBnFfuwJLA2MOUVPZ8 HWB+WXW8qhihw9CXa38Hdt4o5knMGRIyTWEF0TQDtRGQ6hisVBN3OxJRXBj7/QgC G/GoYpweGKcsMU43p57TzbnXVVUytJsLFyexOGNzrUIxgDVPEvTUnNvdAihNZPdb W3YdFkP9pdwOyDpQwebXELUx1kp4ql0laueex4L1v+0a6rDYQeK1gOq5UGY+THRS gB2xsHl5zeryfgnjlUkUlxKuumz+9FI2fRtSpxmWllJkRF2oFMGRuLPGAWe8nHvf gkuGVokCHAQQAQgABgUCVa0bowAKCRCVY0f2+/OkFWKREACZ9TOmzvY6mrfWVEdl dcYPj8cU/1LJhGdbNo5YYMx+A72nchxGXepHA65OEK+f6rFMeZFPwpQPy6Sj3MhT 623H/PECfeG87WcLOyJbfc3i9T5jvxS+ztG6abYI2J/50oMvjUWdWkDX3VvdPc0Z Z+KC+oHvx9a/9Yki48m4CEKglgVsrRW/b9AXZQCj07bB0GjQQtkqY/m1Z8m4ttzx fO7OBo/jHNF2An4/4gUDirXNDj0UdB5FYFJaTEUCneIj2x0fk1r4u6na8tINhiZ0 M7IgjnDlBD5jwzvwG+3kYE6TnYp9Mfeg2MPC13tp7jrJatLLutrOzvmSVLGLXbkh 9w+v+vx7qO3TxZUNlFqTmYs+vI2V/9j7KYV7Ttoind6Io7X9ImnYrvd8JOyVcO38 67MplKnrnqHJvFStE+JcHEcw5aRw+WVmoFd/obGc34V3K62T977QQGOkrTYDEdje KADfjXXZkZMZc0IvzLBOJ1XB45+PKqJYCcJJS8Xr55+NGCDaaUPWDpkNGIqmX2n9 kYROMKG6uWkZIqG0JlZkga3THSJIvLiy6uoOvDC4GoQ9JnTwpGv6r1Hwcg+4DCOr YKOoPKMMU24vHx2FtRRUgCXtr2cmi2ymHlUrtz8EXS4tblic8lixcbvPUqLEvbJ2 gfWQvjXNd1whYE/wfvI9WBTEIokCHAQQAQgABgUCVa0b3wAKCRC8FzAbSRs/IQhX EADiKbCnsN/+Plllxn6SQHACEU75ackx+Q02XiD/u+wUptYUGmJi4aaW9f6mgzed OxYK4S+/dCiFtkcYlL+FjaR0C7G6tMjrDgW+8nQCTPUNQA0gX2B8n06a7Zmdv3Eb V/PIJJwTNSBp/dqKbvPKnRquOOpH+ayZ3awKOq/LlWBErbW1gB+FabN0lCe0iUIQ TF9OH3GC4QsMtIrePueBmVrVPcHATV2Vw9UPqX1uX/tlXm5eai06oVT7V0FwUbg0 o1eacblNXvHciHpe33zZIKkGBWwSjDVcU9/SN+U8GfoMYmyCma4iN3KaCklpzBkJ iQZtNKPAB5KJti8LDUxFi2sJd3sqWaZDGFhO+/PKhBKpqIhAzx1ppd11zLgh0eg6 gQlXN8D8ELISRvQqGGNNZdChEFdzGElg5SMfmeEd37OaX4wceLLV0v7EA0doHMVo 0enFhSwU3YwtwxbiukKc7H/ylG7+jvntjY+z7KktRsY/FkklrbrNhddMBQMMSAQU Uz1GJ+6NUKmzXjqxFuuh3OAhqNzhJyABZWQcNMph+rogEslkenwoHV9gWRWtS3CM ybJkKkbsWpYhMZNY6hFtgCwida7NPs8369v+yTTE6TU/NIlXUKYIf2LMqtOpEBTj aN3jKpUi5DeE3zBeh6iVKUrfCXbt8O0rYQPNWGSW+MZ2t4kCHAQQAQgABgUCVvA4 GwAKCRBE9G4UbQI5XfS9D/9XPK7jg0lmsNZ2sDIyeAw5n6ohSR5F20ocTMAVeXqN 7VkvJdNpIqHJa13EP408DgTy9BsSptym/OQGE6B82BU7FZTEL6eMHnGGDg+5ktx9 +b73xLedzK75ti6ED+QuA4kDYcvW8hASht0zRcmFUzwbtuEopJ1Lk1R3oFLwCAov lhduC45nANWrTK5U+D1U2obl5PAvx+9mEfgvojlGH/C/WD74W+cQZFH7t4+muRza mckLyPftnTxjNF/lpYIm7z0QOwvzBYj+PJ09wYueK00RE5+i9Ff8DrjtVSXsziQv SjJuUlv0kVvM8r3th4zBBNRhA4cinwqxhgqO4G+r2r9Gv0M2nKKOnWmyF+MSIRnh gONOQZe5a7kQxKVWkLicS2IGUpPeQyTWaqZzYXsD+Dm6DXD57vYTURtUkwO0CDON zT5XiS1HG1MZrw+V/Jai4HAvpF5WkTJXPc1Lv75BxJj3wOAw4MzEWCCdr/N/dt5/ +ULpEaSQfIg4L4iEj6rvabQyN0KbOxIDx+pPQ81izfj36wIrDqhyCNIdmVH/yARl tkL4XDEl/pt7Y3t6jqFhy057lektowClWcPeq3DoL0LFYnjNPpYvIjRIAXdhaYiA u2ViF8WdGzQ5tFeI7u3PQUG5NcPe+WOPOru3wMMrUhLgLHkCdNkjivP79qIPSTkC GYkCHAQQAQgABgUCVvA48gAKCRC3hu8lqKOJoLRMEACmlyePsyE5CH7JALOWPDjT f+ERbn+JUTKF+QS0XyWclA/BIK8qmGWfgH38T9nocFnkw17D3GP8msv8ll+T4TzW 9Kz9+GCUJcHzdsWj99npyeqG5tw+VfJctIBjsnX3mf4N0idvNrkAG5olbpR5UdsY Yz62HstLqxibOg4zWhTyYvO6CjnszZrRJk0TYZON4cXN14WYq2OTrMaElx0My8o1 qVBnK58pIRzv72PmvQqUk5ZjhUyp9gxjqqCJDz0hVK61ZuGP6iKK8KCLTfSxeat0 5LAbz8aC58qlg5DVktevHOjBgnTa8B7BgJ7bQ9PLMa3lF4H1eSiR9+8ecpzEfGHI LoeIDIYH7z7J/S0mTgV3u5brOMYO+mE9CEfps85tVVoyJrIR8mGEdtE2YmdQpdFz YIYvRfq9tnXZjVsAAsC20Smw0LnjhYzAt9QJwZ9pFMXUTg6lC5xT+6LNrEY+JR3w C16q36bcbCNj0cBv1A3x6OI5OQfpexhLPDgoDiI+qozJIdj8MzJ8W6KU1Z3yb3dq ACk77yv37rGO6uduSHnSti26c/cUIy6XZBbXBdobE9O3tr8hwvTQ1FXBmYnBrdiz U6tgxEA5czRC9HOkdk6y6ocbjmONpF6MxkpJAvTMk7IqC2/hisbV9x4utla+7tmN ZU137QGcaK2AGQablVAy4YkCHAQQAQgABgUCVvCMigAKCRCkhaDtUbi3xAU7D/9g UPZSJ8pbZV9TLaKD57Bc7B78HNV/B438ib4dI33iihMTBHnCB1giPE9X54QoV8AS xrO/xveS1kkj78jERqUcED6ZHhMLb9SWs6CxUKdMdgovnIlFUc+t05D5mb6STi+z NihwO0JI+n79qhETy73WLpC7RR0aMx7zYcbqp3NWPptcf1kVGJZGx+QbEHfVye98 T5pkH5Wp+7LSlup6AldQT/oifxdGxLXbECTnwozRvyMpAaphoEHrET1YOmKnmw/J yi6DLpTb3XvSf5Tntzr7HklCEcL9FvYCoHxiXWawLhuPhSyrFYeYtF1ypmzTgaJW yuTZ8sN9J+y7Tbchk/I6FpX+3YoTgPCcC7hv1Krs803N/3KuyBEvhzg7NYRikzO3 fxXlBG0RMm+662E7KlERU24izbWhGiYwl34+MaxrIO4oDvF79LEN7y0+SjL4V0B9 689d+HI1ZfS9O1xkOlW6y0QyagOzsTOUF12s2mWydFmipbYnIwsSsu6Nzk3yO4M+ qYABJXJ3tIFQPTd7xqmPNlJ8mFtmzHDhb3Pv6sRNFLLujYM9cJpuNMbAHWdohz1b jBT9pZQ3zWpll5wotUvGmJd6hTAXdUgmZ7lh7Uq6axClMmiLe1WYntcNpb04PyyE m2+GU5x123UTiSX2LGKa4t+HNSM8nJL8BJiGk80xVIkCHAQQAQoABgUCVa0OAwAK CRDDvTXkbdRdpVR+D/4/37e8WqKOHNPteQu42sj0ZOfcqyVMA9TQ578F0s9MwoQu qfVhXGSWevOctuMv2qTBjBfFjkdPrKR5L4LNAgMsu1epHU0DPcRZUCbh1P7Gpolm Z8KgnjT5Wpl1AcuOCaP08VMrt/e/JndTHp6btn6HsLVtryNhlL7oaeYbDr6/ovHN GHVIVSZgGP9f4Y8FiDpyfKav71vYLBMxtzM7lc3eFT1S10XhSW6k+8S5XldYWkLD riRXDE85C+9QndpOoQaIICp3ye3JVnUxa1qhvsYj9uPt1M6hKiBSoXdplrB+hQc+ nqLNN3jxpGdmGmwrjtjqMhocMIguEqgARJOek3XKOppEhu+IcnJgU4edARJNLsBa uiVBWY/6mZOFlZq6H48tVyziS2n/oIpi+aCc/fQeGs9zMTtFUohPfYtTcy9PecXM OYpSu4p4tQ07oucnxfBkRUgTdM5VwX7YwTcRwp9XhHACUEGBhrwMH8Iz+sK2jLF3 FhJGkef1vFs0vqSf4I8DBFkYAKF848YyEcGHeINQloi3v0Kr2PpBxlRh+GPWwi++ QPKXQFzlTiyVtMzoo/lpmAWUJwj0dbAbH/mohtvWtA1WPHC2JRZ52JLThhpDrK3t //Jdt2WHE91cMx7/2B0PK4O8/j7UVlsOJXpVPsGX5SFCeTB/iS4JtIwWN275zIkC MwQQAQgAHRYhBFnKni0qMx3iUaokJ18Dx2fCR6TVBQJZDvZCAAoJEF8Dx2fCR6TV oGkQAIjqaQ7tpdhDJ6ORNtLIt0TsWg0jg2rpoq+9Au36+UYBMuBJ3Py/tAsZ3cqQ lig7lJiQqOuQZkbg1vcY4Kdad7AGa8Kq3sLn8h2XUlNU90X0KAwdCTA/YXxODlfU CD2hl4vJEoH/FZtfUsaLNHLmz0brKGrWvChq00j5bPfp90KYKqamGb3a4/LG4DHL 4lmEBtP++YA0YqUQ3laOvKune2YwSGe4nKRarZnFiIn2OnH9w0vKN/x9IMGEtc5M bQVgGtmT5km3DUuXMDforshue6c7ao4nMOC96ajkWYZhybqHJgLOrEGPVUkOaEe7 s1kx4ye9Ph3w/LXEE8Y8VFiZorkA/8PTtx0M9hrCVkDp0w8YTzFJ9DFutrImuPT6 +mNIk+0NQeuDsv492m/JXGLw/LRl97TmHpKME+vDd5NBLo4OShlDKHwPszYcpSJT G9+5++csR95al3tWnuGX9V0/dO1s7Mv0f/z07nLB/tL+hEpqqA5aRiGzdx/KOrPZ uhCTyfA3b2wvOblwf4A/E1yO7uzPTuSWnx1E14iZuaCPyZPXEh3XSYCLEnQ05jy5 0uGXCDVR+xiE/5i/L3IxyhJk6zn5GOW5b8Taq5s/dFS3zWiFS6l0zQ1VQmJH8jdG LoBFvdVLZoAa1bihLo+nJVPR2RauWnxWoWk1NQoT3l02Lk6DiQI4BBMBAgAiBQJV qUAEAhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRBht7Um2Y8DU1CqD/9G vr9Xu4uqsjDHRQWSfI0lqxElmFSRjF0awsPXzM7Q1rxV7dCxik4LeiOmpoVTOmqb oo2/x5d938q7uPdYav2Q+RuNk2CG/LpXku9rgmTE7oszEqQliqKoXajUZ91rw19w rTwYXLgLQvzM3CUAO+Z0yjjfza2Yc0ZtNN+3sF5VpGsT3Fb14aYZDaNg6yPFvkyx p0B1lS4rwgL3lkeVQNHeAf0qqF9tBankGj3bgqK/5/YlTM2usb3x46bVBvwX2t4/ NnYM5hEnI57inwamX6SiMJc2e2QmBzAnVrXJETrDL1HOl4GUJ6hC4tL3Yw2d7515 BlSyRNkWhhdRp1/q9t1+ovSe48Ip2X2WF5/VA3ATfQhHKa3p+EkIV98VCMZ14x9K IIeBwjyJyFBuvOEEIYZHdsAdqf1zYRtD6m6obcBrRiNfoNsYmNY4joDrVupI96ks IxVpepXaZkQhplZ1mQ4eOdGtToIl1cb/4PibVgFnBgzrR4mQ27h4wzAwWdGweJZ/ tuGoqm3C6TwfIganajiPyKqsVFUkRsr9y12EDcfUCUq6D182t/AJ+qE0JIGO73tX TdTbqPTgkyf2etnZQQZum3L7w41NvfxZfn+gLrUGDBXwqLjovDJvt8iZTPPyMTze mOHuzf40Iq+9sf5V9PXZ/5X9+ymE3cTAbAk9MLd9fbkCDQRVqUD0ARAAr/Prvt+m hVSPjNDPSDrTBVZ/7XLaUZvyIVggKa+snJoStrlJGTKKFgDVaYTOE3hP/+0fDdQh 97rjr4aRjd4hBbaNj0MzZdoSWYw3yT+/nidufmgPus0TIJMVO8I6rl3vgcfW/D3o vNrLW/LjkTuM9a+p+D1J7woCfMSWiFMmOLPKFT7RBuY8edCVjyA6RP9K9Gj1sURS eqNaHR9Gr4rW10s+FwUHWxxzbmIWqH0gApQYO6vyND5IMcKOBCWQU6Detuq1pQ6d Uc+iF+sEz3Rk3C6d4WBBjtkVJSJ0KKan8Q3gJefOCMNhdRQDjZLwbzr4bgoAkLba BFCjiZxWZ6HAdMfSCV8uZQrtMS7b0DUpY0vdH9Htl3JqOOkK9RorYDQBuPdkTYFI NsmtWVsFV/LmR891mOF3fBRaoVoMeJVwiZyNlFY+dyWWFzLp+GoTLcQtmuR7OkmO cBGxWSKPcZfPqhf4dVQud7bDR2RNfJ1Hqa5kj8Z422sseYDwHf/T9OWWYvLwKGZh lUgpnzO3WCGrd/6EVNeC1mKXt4F7BmADov4Rdcrp1mPXiVt7oIxLaS6eBNf2y1TW zjYj5ZFuKqIukDEJfqpwsE5asnCw56nae+7luGs8em1J9GEXhWzXG15UVyQJaFwu B1iL8l7VcEQz4ABVrSTUWLLAKDsyqUbq2gsAEQEAAYkERAQYAQIADwUCValA9AIb AgUJA8JnAAIpCRBht7Um2Y8DU8FdIAQZAQIABgUCValA9AAKCRAcacTlXpkF2y/F D/oDrZm143Rv9NV9InnVJ0brpqbB7aulFfhR1LDuJ/GjeqGAQgJCZdHlzT2pfCXX swUlYzcWEatvGcDkoaB5Ya2qs+6nhBk8pT6XYRrZAtIlKIGrlCqoSBm9HXguGv+E IaEECr2z/Funx9so0mP+5aJn65M9u3lPmuAonj6DcHoM07WsfsXvQ4ut3fabFmzi lLGeAdEDKIw8Hn3JBUOxUyFrQlOoL4/3qK1TO+cidz/2bATQQyIG2kNOSgHBslU+ e6/7sWOQ4ufmzm7dEsf197zPXGdXR88LT+d2uU2K4GkCffNUKxZqy9bXxXPwr4JB jxLDQnDvl50GAWjPZAwXEd8Okwl5+8xp0HuZ217WUqT8ib0oUUfwh2H1vrMPRr/4 6i6O6THpCkV8BWF7axPYIibaeYwC4BkjZwK3tIL5ESf2f0xK4hbE3xhMTeqABQHo Xd5rQ7SEaUuX7PlQ59fRs0Cz55vH8/o9zMm0PN6qmZFvRBeqjnklZcu+ZdP9+CMX t81NMuzIK1X7EfpkUoam8YkYkwcCkRvPZrSHLXZFkfnx4jW543dPOfycjnv6hhKy oXD9CBx0ZcOicsYmw9XMilBGD3b8ZdK6RYX4ywKNU6KUdFJjXB88+Ynv6QxDit1e mMCHA1glzV9/k36iYLEIqgWBiwJeUUIcUqzgnBFtN13cyS6oEACUGUiPKbw3IkgG W19ZyS6FBNfgGIGW0Y82Br0KlCyaXnX0R4+4u2h7kfR9NSnhRhsvRnPIkiZATa7D +Ew1nfpsDTnti0c6g/gVw9TC/rCyXkkLztRHVcWEBdvnFJTSp2LeFaHSGbvvZfoI GUzyUzoa1P98NmRIY1cxBoizVf8729/zAaD4fAslxoK/JsjjDvDUrRHtaNZmUle6 0Jl/yFFzR3zxb+pJliigoP2rZLt+ipomHJIhoXXWwfkRO9U/egJ8ZUhWEpZvROna Nc9eVct5EBADxL7gHWjlceIz4ndI1eE9AdEZDdUZwOfjmK2DcXjFBfZC+jhJXjY0 xh3pPKQz90h9DIkM5WDcJPf6ep+MKSd/3hI2/JmmscQ+alwN6x6g8zDySMo3APA9 cUvEFGe0+CepVcNw03jU4faSrHiMXsUuVGbA2kHaYVUfzF5W5GbuHZZlGxoSiq+K +HNG0RJUDa6bkSDvrcJVNw1iUrowP+LLwnNsy5kGuU4evnwcoN1w7LVbTPaq4RIa iqvAD33kiA9q//UNKnK4k81z+hRNaWGliyGpgqh+V7MDIqPfT5TMLdH+ZjTeuLrN S8KBcc2BmUpSwzdUReTqHmgO5peeIcsvO7GNMFWsgucZiAdIVE/zQv+SfP6jhS+r jCPs0eeu5zl8/V+gXFE2wy3jTJEl9bkCDQRZS9m1ARAAvh1Nh4GgjpTFZy7uQRFz 5PPXdZTBI+Y4hTpF2heoFzZDI6SLyz64Ooglum3ZglQ9ac+ChTSsO36aw4b22kCM 9WDmkcl7wf21fG9o8gJDVjFjDWbwTWREaKjgS6s/Yb8f9gje/BGySojxynTi3zyT UN94q9dhVjfiQ79UzXZdN9FyyIx2YO5tOo09hTWSZg16oxP47Mj1ATaS6UIrQMcM nOp0kuc6SufXPSWsUA+g2lW0dmHgPvIHwUfcjWqT2elF01e9KOFe7im29G6zOS2M Rx8cr6KRg/eNWpHh5aI4quRUhYk4Kw4ohQTbs9ed0YttS4PMK+sq6xHpb28X6Zgr WnelPY9hfwcR4m7Ot3VQUG8JY9/aTlFCoeTgkhop+MCUI+dJeY8depIa0PTzdEmE WRvPhTTv+CUdZ6v4z5LD6FhP+/5c6FCbcIb89Rp5fa53oYV5/KZf+0DUVgmpXFU7 J7ZrGgDeU7vIzmwr8kcx0vtsVm1dVwYLACpTaaQPbISQUDM8sEcqKAqD7hWKaxNs b2M85L6q2/rnHq4g46yJzdR3b8EH+V9u+mUi9DIljDwcpvw7ReRQ9wPdDWLynngl IeGImbjYfr324yaIl4vNORAkbsoCkS/qc5v6MvKvYNle5fzb9S9kCbNZmD9c5/bH Pjj9ENeQvzrl2pFh6dc1o5cAEQEAAYkEcgQYAQgAJhYhBBTyZoLQkWzdgeN7bWG3 tSbZjwNTBQJZS9m1AhsCBQkDwmcAAkAJEGG3tSbZjwNTwXQgBBkBCAAdFiEE3OrF 2WE1uRxOpnKru769uyTG81UFAllL2bUACgkQu769uyTG81UFUw//bW5T7w2k8ukG fpIcm0gB98VgxKenSCmU6N+Ii0DwcNtzW+pmVWl2TbHIXDpvuD69ODWBDMXu6gBk rVzNEsK3uhzGe0tWA+5I7Vke3iEkbll7VRQlIOrw+n5NMvjeuDqKsMt1gMEEdgRK ddYApEAi49vV7XnqkB2lLKfAnf6o/KqPm8MuQ+u0xYanupZCldwdpcx5rybj79Es 0iO9Gh/+3qOtR6ubOz3Vn78Lc3y6AP9pmtdOI2QX8foGK4hNmgHSP6uPLh/ERC9N ir0Lc2hoEhHEkQ8CnEaccp70r03VkEQuMJQJPUyRsGZ/gIm0SAm9JJxWHXJk2/5N UN83pHAX0LA4zxtWs4fVW5f8v9eIhFFPTZ4au+/cS9D4GFx4mlY34awcpAzrny2t ntGEejY9HSJv4PuFZCmtyS2q61N9EU8yuBwVM9cp5HntzG+OT4HYugtI6ibehM0S 1Roy4ETwT+Ns41ffhCwdYMp8tzdeksQ35s7rkB9OJHj+q2dkGaV0FQb3FutbSpxb P4zk/dLqyxuivdUPHGtf4W/qklxzCWBg0VDFA7PwatmEXRxTjx77RelTY0V7K54d DyVv3Jh2+FzuaQZzzuIhv4gtqHntaqLnYl3h/QNLbOTE3ppvn9RUSR983Bd+M3Qh bbwZrgG1m+hdUZUmji+wbK0wV0xHNEH+4BAAjbVzdNOs7hMvjY1wVDRFjvICVorN dNdU3ELy/9BAoiwOs2+zjDXmsX+3YtdzwKvdpQ24O0TvH4Vo3BkvKkJ75EU7LroA bYQ2423m1MY3eaBslmX7TUJ3XE+k7OZF8AmcftgP4nhC4IQSCtoBc9+ncyGN4da1 BpYO7b19tO0/HST8GHSrEcU9bGGdimS2eNkSgybA8wF6K0K9yvrpTNSZ7OBVlzQf En8s70Gyzs/d6C/rTA+defnv3AMaciuINSEdFyfYq4wjt5PikvgceMAAkH/z69xT Ng+6q3FQt/lyK7xX5qPMe2oFyDA1H+Cb/uL7ioo+jXh9gF+0fk8OP2IPzxYhBful pVtgclmOuaekzaKeIv8NFW7GoA9OghziExePxg95OpL/VyQ7PJiAUj1pFovFk5HS 6ejVZNEGJ/A5zLc1PBIcr/phu0luqhXAhImsZS6858GWQllWULNWw8bX5Blo8Avc fFVdq9iAK7aHN7g45ZR7Ze6qKHDyFv4XWuE/rj9C2mM/GAstvU0gGmbo6B1mNGMJ uX3Gd3dG8fqFjE77OB2feJyfZ8UeF1nvG1hxlmuD1A5e6/osO9V7kjhXKzM2zSO1 1zHQ/5PlUisoUBjJ/QIK4v9RBNGtbRKso5X9Fke692lVgrdggDJ3j2QqMuTo71rA VDLtxerc+GNq0GK5Ag0EXPA56gEQAK3x5otbuNfefm1BD4gJ4Y4EhVvMCdeUf1uN 1OqiWUz6KLCR2UE00QaS7v65D11Oh96bpxtJRe6HmQk0vLm1QgbigZku+kCMt0c/ zOyWBOCVDKahZJu/ayzimrzGgNunoTnV2SjCTIVK4QtkowUeA+Ilt9hfFCZvASIW Eazj7bWXC0ji1U73OzVdvMkP1e2ibGRxhp494HGkj0metYhQq3vQk9mL9BzE8wba yL9iK2cJVqZxXDxauL5bswAlbdLI2MlSrH9wg6Jvy91lXG91ZEhmZ2RaQU7q+/gi i0QEJ+W3pwPQQjLciOfecMN/n2/vCl1ZeJ2PNWMoXQXyNSJfpL9m4TPiW92i2kKA kaEOZ2LilUJhuZKEpt9+BxIn8FafjAI4co8D5cHvnPM+PXt1xZHwP0Yz+yC6wnLZ 9scNw50AM/Kcpbc6TUIQ0kPHsNMPCbZ89Ey63i61IvUUxKXvMENFptw+vWfD7MRG OZKObdH3VrKtD1QqU//g/xeSGNzcUVatFDVDv9KWRSatOpKi5Bw6iUYxrYQ02Co1 45jugTnGc+zrhMsDND3Prq7R0dMw4MBLUDoLSXrous9VgahAsOQztZCfvPOcV2cp Haew85SLSYx3Ksb/raIBG02+4XjF2xQCgIq2LfA6xdG5SGJd6Lg3ip2vwhogQgWB gQoq1qrPABEBAAGJBHIEGAEKACYWIQQU8maC0JFs3YHje21ht7Um2Y8DUwUCXPA5 6gIbAgUJA8JnAAJACRBht7Um2Y8DU8F0IAQZAQoAHRYhBAl7MTB3rmKgL4TaTfGm Zo+7fVcuBQJc8DnqAAoJEPGmZo+7fVcuilAQAIq/yZi3hoxzSgDblQlcxGKDaVck Oo5kFRK8UWby3peB0bTMV3MWjiCqmN7ryQ3K3126G84yxXCTNSv7kOLsVxy+DBx5 TdfmeBamdX8ZxeDv2bm+KKlZR0FGjJn3SiLgPkrp+GWazOZbKIPsoOuN20h0aeMC 6mvO2Rz8kGUjCWa7/jQnINHkN/GnbnoOXOwn8cfwtt2Mkb8XRe+1lTBwd59ruYNe 0hazOwygmcSgvM6Rm8nXeuElxMdVAiYKXJLej34+ToNN/LgB3TD72cXStmn7z8Zd kaKV+f81cyPR6Vi7laLIIGpquBGSsel5s0pb2PFFxrTZoNLZit/x2Wx47V30Vdk8 4FYjENEb75OpGEErwVRbTQOKREOutXlGwiJe2C1Gof6HCScTfeLybz8ooLn+lSXc B/l/68wqxPezWYxfbMD42YtTEkJcvPvRaouhKEjxASP5EuT3K6T3tujsqg1K/rE3 Bei0krOnTxUCdU1f9B12ZXpntKehwSz5LpnQvr2wR98yBRoKQfT2Yi3XK7QNp46X t8zaIPdVw7qMn3gODyiWSEGCtkRF7exdHZ9yIW2yofFhz3M/tjIvu89tRDwGiDER 8hBk8Z2mcbmlMzQ1Jf+aCQ5YjZUkUN7+s4VN1OT7yYpWcV1pR8cDjyoh5LRXTMwt 3g5mStiVbeIq9mmzogAQAJq6KK9Z28q8mAozKLd++JxefiovydXVfqWY4GoOfhtv Q3Sa2mKpd+8b3nT7b2twIlKPqpIxWPxke7BWS2sShZfqWkltbAR1SH7fkOeN8Y5g 3yBbxvHQJ/KoTHnRb74MrlaP4v4MvuOAn+Xt0wrEt2OKOSD6/jIjnXqfNFuvHfHv tTyf4fVX1sesN9JwRKevc+1ZLjN7jIOuv1en5grwDtGQhy0fZOjISLGKfy6299rn p0alRqLuAhFn7Ru1ZjQVfqD9VEfDCEsjNOxoW95Aa7MVtdoaQ6FEBKK0q1tCjZ8P f2oAyLGQo5y030D+cDd1lTwGUmPvbnz4/fqNAM6KWvZjtGVNJNlLlpd4A9vAgvAo I1wONXSR56sZKqJXbOSIfS8AlzFWiSf6IyPpU5ozVHLGySAVqtrRW3ci4COeKsRL 85W4iSGCl5uYZcE9weotakUO0R0u/pi07BJwTR5hvn9IDdeyYreqDMQDMF9yHCXg +oCGJAB/+1vHkSABRsWbCERyB33ExjnDr7z/kWOhfAdiZ1+MpP2IPtDEXcufKayS gpBAZYEeLAmNoDwtg+milSD3B7TTM11IY1e/Utq70Az/8BOK6wfyIsmPfeh9CHoW +JAbpbdOUs+rH+gjHE2Pls8BDLp1EoHVPGVavhfSlRbG6oF45s+gdL2PTKzEo+u7 =xFcH -----END PGP PUBLIC KEY BLOCK-----
March 11, 2019
Back in 2014 I blogged about several ideas about how to make Firefox updates smaller.
Since then, we have been able to implement some of these ideas, and we also landed a few unexpected changes!
It's hard to measure exactly what the impact of all these changes are over time. As Firefox continues to evolve, new code and dependencies are added, old code removed, while at the same time the build system and installer/updater continue to see improvements. Nevertheless I was interested in comparing what the impact of all these changes would be.
To attempt a comparison, I've taken the latest release of Firefox as of March 6, 2019, which is Firefox 65.0.2. Since most of our users are on Windows, I've downloaded the win64 installer.
Next, I tried to reverse some of the changes made below. I re-compressed omni.ja, used bz2 compression for the MAR files, re-added the deleted images and startup cache, and used the old version of mbsdiff to generate the partial updates.
|Format||Current Size||"Old" Size||Improvement (%)|
|Partial Update (from 64.0.2)||14,935,692||28,080,719||47%|
Small updates FTW!
Ideally most of our users are getting partial updates from version to version, and a nearly 50% reduction in partial update size is quite significant! Smaller updates mean users can update more quickly and reliably!
One of the largest contributors to our partial update sizes right now are the binary diff size for compiled code. For example, the patch for xul.dll alone is 13.8MB of the 14.9MB partial update right now. Diffing algorithms like courgette could help here, as could investigations into making our PGO process more deterministic.
Here are some of the things we've done to reduce update sizes in Firefox.
This one is a bit counter-intuitive. omni.ja files are basically just zip files, and originally were shipped as regular compressed zips. The zip format compressed each file in the archive independently, in contrast to something like .tar.bz2 where the entire archive is compressed at once. Having the individual files in the archive compressed makes both types of updates inefficient: complete updates are larger because compressing (in the MAR file) already compressed data (in the ZIP file) doesn't yield good results, and partial updates are larger because calculating a binary diff between two compressed blobs also doesn't yield good results. Also, our Windows installers have been using LZMA compression for a long time, and after switching to LZMA for update compression, we can achieve much greater compression ratios with LZMA of the raw data versus LZMA of zip (deflate) compressed data.
The expected impact of this change was ~10% smaller complete updates, ~40% smaller partial updates, and ~15% smaller installers for Windows 64 en-US builds.
Pretty straightforward idea: LZMA does a better job of compression than bz2. We also looked at brotli and zstd for compression, but LZMA performs the best so far for updates, and we're willing to spend quite a bit of CPU time to compress updates for the benefit of faster downloads.
LZMA compressed updates were first shipped for Firefox 56.
The expected impact of this change was 20% reduction for Windows 64 en-US updates.
This came out of some investigation about why partial updates were so large. I remember digging into this in the Toronto office with Jeff Muizelaar, and we noticed that one of the largest contributors to partial update sizes were the startup cache files. The buildid was encoded into the header of startup cache files, which effectively changes the entire compressed file. It was unclear whether shipping these provided any benefit, and so we experimented with turning them off. Telemetry didn't show any impact to startup times, and so we stopped shipping the startup cache as of Firefox 55.
The expected impact of this change was about 25% for a Windows 64 en-US partial update.
Adam Gashlin was working on a new binary diffing tool called bsopt, meant to generate patch files compatible with bspatch. As part of this work, he discovered that a few changes to the current mbsdiff implementation could substantially reduce partial update sizes. This first landed in Firefox 61.
The expected impact of this change was around 4.5% for partial updates for Window 64 builds.
We removed nearly 1MB of unused images from Firefox 55. This shrinks all complete updates and full installers by about 1MB.
By using a tool called
zopflipng, we were able to losslessly recompress PNG
files in-tree, and reduce the total size of these files by 2.4MB, or about 25%.
We removed a few hundred kilobytes of duplicate files from Firefox 52, and put in place a check to prevent further duplicates from being shipped. It's hard to measure the long term impact of this, but I'd like to think that we've kept bloat to a minimum!
November 20, 2018
I've very happy to have had the opportunity to attend and speak at PyCon Canada here in Toronto last week.
PyCon has always been a very well organized conference. There are a wide range of talks available, even on topics not directly related to Python. I've attended previous PyCon events in the past, but never the Canadian one!
My talk was titled How Mozilla uses Python to Build and Ship Firefox. The slides are available here if you're interested. I believe the sessions were recorded, but they're not yet available online. I was happy with the attendance at the session, and the questions during and after the talk.
As part of the talk, I mentioned how Release Engineering is a very distributed team. Afterwards, many people had followup questions about how to work effectively with remote teams, which gave me a great opportunity to recommend John O'Duinn's new book, Distributed Teams.
Some other highlights from the conference:
Using Python to find Russian Twitter troll tweets aimed at Canada A really interesting dive into 3 million tweets that FiveThirtyEight made available for analysis.
PEP 572: The Walrus Operator My favourite quote from the talk: "Dictators are people too!" If you haven't followed Python governance, Guido stepped down as BDFL (Benevolent Dictator for Life) after the PEP was resolved. Dustin focused much of his talk about how we in the Python community, and more generally in tech, need to treat each other better.
Who's There? Building a home security system with Pi & Slack A great example of how you can get started hacking on home automation with really simple tools.
Froilán Irzarry's Keynote talk on the second day was really impressive.
You Don't Need That! Design patterns in Python My main takeaway from this was that you shouldn't try and write Python code as if it were Java or C++ :) Python has plenty of language features built-in that make many classic design patterns unnecessary or trivial to implement.
Numpy to PyTorch Really neat to learn about PyTorch, and leveraging the GPU to accelerate computation.
Flying Python - A reverse engineering dive into Python performance Made me want to investigate Balrog performance, and also look at ways we can improve Python startup time. Some neat tips about examining disassembled Python bytecode.
Working with Useless Machines Hilarious talk about (ab)using IoT devices.
Gathering Related Functionality: Patterns for Clean API Design I really liked his approach for creating clean APIs for things like class constructors. He introduced a module called variants which lets you write variants of a function / class initializer to support varying types of parameters. For example, a common pattern is to have a function that takes either a string path to a file, or a file object. Instead of having one function that supports both types of arguments, variants allows you to make distinct functions for each type, but in a way that makes it easy to share underlying functionality and also not clutter your namespace.
September 20, 2018
Last week, without a lot of fanfare, we shut off the last of the Buildbot infrastructure here at Mozilla.
Our primary release branches have been switched over to taskcluster for some time now. We needed to keep buildbot running to support the old ESR52 branch. With the release of Firefox 60.2.0esr earlier this month, ESR52 is now officially end-of-life, and therefore so is buildbot here at Mozilla.
Looking back in time, the first commits to our buildbot-configs repository was over 10 years ago on April 27, 2008 by Ben Hearsum: "Basic Mozilla2 configs". Buildbot usage at Mozilla actually predates that by at least two years, Ben was working on some patches in 2006.
Earlier in my career here at Mozilla, I was doing a lot of work with Buildbot, and blogged quite a bit about our experiences with it.
Buildbot served us well, especially in the early days. There really were no other CI systems at the time that could operate at Mozilla's scale.
Unfortunately, as we kept increasing the scale of our CI and release infrastructure, even buildbot started showing some problems. The main architectural limitations of buildbot we encountered were:
Long lived TCP sessions had to stay connected to specific server processes. If the network blipped, or you needed to restart a server, then any jobs running on workers were interrupted.
Its monolithic design meant that small components of the project were hard to develop independently from each other.
The database schema used to implement the job queue became a bottleneck once we started doing hundreds of thousands of jobs a day.
On top of that, our configuration for all the various branches and platforms had grown over the years to a complex set of inheritance rules, defaults, and overrides. Only a few brave souls outside of RelEng managed to effectively make changes to these configs.
Changes are local to the branches they land on. They ride the trains naturally. No need for ugly looooooops.
Developers can self-service most of their own requests. Adding new types of tests, or even changing the compiler are possible without any involvement from RelEng!
Buildbot is dead! Long live taskcluster!
June 27, 2018
This week I attended Toronto’s first Smashing Conf.
On Monday I attended one of the pre-conference workshops by Dan Mall’s “Design workflow for a multi-device world”. Dan guided us through the process of defining a problem, brainstorming objectives and key results (aka OKRs) and made us work together to build some of what we decided to tackle. We divided the whole classroom into five or six teams with 5 to 7 members each. Each team had various skillsets (e.g. designers and coders).
In my team, the “bike shed” team, we decided to rewrite TTC’s trip planning feature. We did not manage to finish the product, however, we managed to build one of the three objectives and partially complete another. The team had two people who did compositions, two people who could code and one person helping us collaborate and coordinate.
This exercise included few new things to me. For instance, I worked within a team context to create a product rather than building a feature by myself.
It was also a new experience for me to work closely with a designer. We chose to build a multi-option toggle feature to mix transit methods. The process started with writing on paper what he had in mind. I tried building a prototype from scratch to see if I understood what he wanted. I did not get it quite right the first time so wedecided to search in codepen to find similar. Once we found something we liked I started iterating on the code while he started preparing the icons for me to use. By the end of it we had something that worked but did not have enough time to complete. This is the codepen I forked and this is the unfinished feature where I left it at.
This exercise was a very humbling experience as I felt the pressure to produce something for a designer (Scott from Motorola services) that was right there beside me and I was “the coding expert”. I quote coding expert as I barely have a year of frontend experience. I started from a forked pen that had roughly what Scott wanted, however, I knew I was going to face a very difficult time not before long. The codepen had been written using few non-standard languages (pugjs and SCSS) instead of writing standard HTML & CSS. Another difficulty I knew I was going to face was that I did not have experience turning an image into a toggle button. The forked pen only had text inside the buttons. I deferred solving it by only dealing with text labels at first, instead I addressed other issues before integrating the icons which would require some extra research.
It was also the first time working with another coder (Sheneille Patil) on a fast paced environment. We needed to quickly figure our own development culture. Creating a GitHub repository was not an option as she was not comfortable with it. We decided to turn to codepen.io and to build features that would not conflict with each other. The final plan was to collect our different pieces of code and merge them in a single pen. We did not have enough time to get to this.
I hope you found something interesting out of this post. It’s not my tipical programming related post. I’m very grateful to SmashingConf for having lined up such great speakers and very practical workshops and for Mozilla to support my learning.
June 06, 2018
For a long time Mozilla’s JS team and others have been using https://arewefastyet.com to track the JS engine performance against various benchmarks.
In the last little while, there’s been work moving those benchmarks to another continuous integration system and we have the metrics in Mozilla’s Perfherder. This rewrite will focus on using the new generated data.
If you’re curious on the details about the UI refresh please visit this document. Feel free to add feedback. Stay tuned for an update next month.
June 04, 2018
Easlier this year I had to split a Koa/SPA app into two separate apps. As part of that I switched from webpack to Neutrino.
Through this work I learned a lot about full stack development (frontend, backend and deployments for both). I could write a blog post per item, however, listing it all in here is better than never getting to write a post for any of them.
Note, I’m pointing to commits that I believe have enough information to understand what I learned.
Npm packages to the rescue:
- How to better configure HMR
- Make HMR work by closing the previous server
- Add source-map-support to help development
- Start project on debugging mode (use --inspect rather than inspect)
- How to properly start Neutrino in production
Neutrino has been a great ally to me and here’s some knowledge on how to use it:
- Add PostCss support
- Specify custom env variables to be passed down into the app
- Add a favicon.ico to your app
- Configure Travis to use --inspect with Neutrino after any failures
- Include a custom font to your SPA
- How to properly call Neutrino with Istanbul
Heroku was my tool for deployment and here you have some specific notes:
- Add a postbuild script for Heroku as well as adding a static.json to configure how to server the frontend code
- Tell the server to send all routes from the root (since we use dynamic routing via React-router)
- Same change but for Netlify
- Changes to enable Heroku review apps and configue for SPAs
May 30, 2018
Netlify makes it very easy to deploy your static sites, however, it needs some initial configuration.
You won’t find Neutrino as one of the tools listed in their docs, thus, adding some docs in here. We’ll see if my instructions are right and maybe ask them to include them in their docs.
When you create a new site you will connect your repository and you will be asked to fill in the following:
- Branch to deploy: master // Selected by default
- Build command: yarn build
- Publish directory: build // Neutrino’s default
NOTE: I prefer yarn over npm .
In few minutes your site will be up and running. You won’t need to do anything else.
May 28, 2018
Back in January I had to make a critical decision. I had to determine if to separate the Firefox health dashboard (formely known as Platform health) into a backend and frontend projects or to keep it together.
The intent was to make it easier to maintain the project by reducing the complexity of having code that is presentational versus processing code. I also wanted to remove the boilerplate needed for webpack and babel. It was also beneficial to have the liberty of changing packages without worrying of regressing the frontend or the backend. The only disadvantages was to have to do the work and that we might need in the future coordinated changes (or versioned APIs). We did not see the disadvantage of code being duplicated since there wasn’t any (or much — I can’t recall now) shared between the two apps.
This all came from hitting a very odd production specific issue. I thought this was all caused from the complex webpack configuration the project had. Because we were not making progress determining the root issue I decided to switch to Neutrino. Switching to Neutrino made everything easier, however, it was unclear how to make it work with the original project’s design. The original design had the frontend files being served as static assets of the Koa app. Switching to Neutrino took away webpack headaches since it makes good default configuration options for the project.
Keeping both frontend and backend apps within the same repository complicated the deployment story since there were some Heroku restrictions. I tried using subtrees, however, it still required manual intervention (see explanation). I didn’t know at the time that we could have deployed the backend to Heroku while deploying the frontend to Netlify. This would have allowed to keep both project within the same repository. Alas! We now have two repositories.
If you want to look at the code changes you can see them here.
At the beginning of the month I came back from my last few weeks of parental leave (thanks Mozilla!). While I was away Sarah Clements took over some Firefox Quantum release criteria work and I’m pleased to see that she managed to tackle everything well by herself.
Some of the major changes she made was to separate the Quantum criteria page into 32-bit and 64-bit. This simplifies the graphs and allows release stakeholders to see more clearly how one specific architecture is doing.
She also added the new release criteria for Firefox’s GeckoView efforts.
To learn more you can visit https://health.graphics to see the changes.
If you would like to contribute visit https://github.com/mozilla/firefox-health-dashboard.
April 25, 2018
People commented on how much faster the Beta and Release releases were compared to the ESR release, so I wanted to dive into the releases on the different branches to understand if this really was the case, and if so, why?
| Firefox ESR 52.7.2 | Firefox 59.0.1 | Firefox 60.0b4 ------------------ | ------------------ | --------------- | -------------- Fix landed in HG | 23:33:06 | 23:31:28 | 23:29:54 en-US builds ready | 03:19:03 +3h45m | 01:16:41 +1h45m | 01:16:47 +1h46m Updates ready | 08:43:03 +5h42m | 04:21:17 +3h04m | 04:41:02 +3h25m Total | 9h09m | 4h49m | 5h11m
(All times UTC from 2018-03-15 -> 2018-03-16)
We can see that Firefox 59 and 60.0b4 were significantly faster to run than ESR 52 was! What's behind this speedup?
Release Engineering have been busy migrating release automation from buildbot to taskcluster . Much of ESR52 still runs on buildbot, while Firefox 59 is mostly done in Taskcluster, and Firefox 60 is entirely done in Taskcluster.
In ESR52 the initial builds are still done in buildbot, which has been missing out on many performance gains from the build system and AWS side. Update testing is done via buildbot on slower mac minis or windows hardware.
The Firefox 59 release had much faster builds, and update verification is done in Taskcluster on fast linux machines instead of the old mac minis or windows hardware.
The Firefox 60.0b4 release also had much faster builds, and ended up running in about the same time as Firefox 59. It turns out that we hit several intermittent infrastructure failures in 60.0b4 that caused this release to be slower than it could have been. Also, because we had multiple releases running simultaneously, we did see some resource contention for tasks like signing.
For comparison, here's what 60.0b11 looks like:
| Firefox 60.0b11 ------------------ | --------------- Fix landed in HG | 18:45:45 en-US builds ready | 20:41:53 +1h56m Updates ready | 22:19:30 +1h37m Total | 3h33m
Wow, down to 3.5 hours!
In addition to the faster builds and faster update tests, we're seeing a lot of wins from increased parallelization that we can do now using taskcluster's much more flexible scheduling engine. There's still more we can do to speed up certain types of tasks, fix up intermittent failures, and increase parallelization. I'm curious just how fast this pipeline can be :)
April 20, 2018
Over the past few weeks we've hit a few major milestones in our project to migrate all of Firefox's CI and release automation to taskcluster.
Firefox 60 and higher are now 100% on taskcluster!
At the end of March, our Release Operations and Project Integrity teams finished migrating Windows tests onto new hardware machines, all running taskcluster. That work was later uplifted to beta so that CI automation on beta would also be completely done using taskcluster.
This marked the last usage of buildbot for Firefox CI.
Periodic updates of blocklist and pinning data
Last week we switched off the buildbot versions of the periodic update jobs. These jobs keep the in-tree versions of blocklist, HSTS and HPKP lists up to date.
These were the last buildbot jobs running on trunk branches.
And to wrap things up, yesterday the final patches landed to migrate partner repacks to taskcluster. Firefox 60.0b14 was built yesterday and shipped today 100% using taskcluster.
A massive amount of work went into migrating partner repacks from buildbot to taskcluster, and I'm really proud of the whole team for pulling this off.
So, starting today, Firefox 60 and higher will be completely off taskcluster and not rely on buildbot.
It feels really good to write that :)
We've been working on migrating Firefox to taskcluster for over three years! Code archaeology is hard, but I think the first Firefox jobs to start running in Taskcluster were the Linux64 builds, done by Morgan in bug 1155749.
Into the glorious future
It's great to have migrated everything off of buildbot and onto taskcluster, and we have endless ideas for how to improve things now that we're there. First we need to spend some time cleaning up after ourselves and paying down some technical debt we've accumulated. It's a good time to start ripping out buildbot code from the tree as well.
We've got other plans to make release automation easier for other people to work with, including doing staging releases on try(!!), making the nightly release process more similar to the beta/release process, and for exposing different parts of the release process to release management so that releng doesn't have to be directly involved with the day-to-day release mechanics.
March 03, 2018
Firefox, now 100% buildbot-free!
First, the good news - Developer Edition 60.0b1 will be the first release in nearly 10 years done without using buildbot. This is an amazing milestone, and I'm incredibly proud of everybody who has contributed to make this possible!
Long time, no update
How did we get here? It's been, uh, almost 6 months since I last posted an update about our migration to Taskcluster.
In my last update, I described our plans for the end of 2017...
We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52. Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot. We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.
How'd we do?
We're past the end of 2017, so how are we doing?
(side note: 57 had the most complex update scenarios we've ever had to support for Firefox...a subject for another post!)
Post-56.0, our release process was using Taskcluster exclusively for producing the initial builds, and all the release process scheduling. We were still using Buildbot for many of the post-build tasks, like l10n repacks, publishing updates, pushing files to S3, etc. Once again we relied on the buildbot bridge to allow us to integrate existing buildbot components with the newer taskcluster pipeline. I learned from Kim Moir that this is a great example of the strangler pattern.
In the fall of 2017, we decided to begin migrating all of the scheduling logic for release automation into taskcluster using the in-tree taskgraph scheduling system. We did this for a few reasons...
Having the release scheduling logic ride the trains is much more maintainable. Previous to this we had an externally defined release pipeline in our releasetasks repo. It was hard to keep this repository in sync with changes required for beta/release and ESR branches.
More importantly, having the release scheduling logic in-tree meant that we could then rely on chain-of-trust to verify artifacts produced by the release pipeline.
We felt that having the complete release pipeline defined in taskcluster would make it easier for us to tackle the remaining buildbot bridge tasks in parallel.
We hit this milestone in the 58 cycle. Starting with 58.0b3, Firefox and Fennec releases were completely scheduled using the in-tree taskgraph generation. We also migrated over the l10n repacks at the same time, removing a longstanding source of problems where repacks would fail when we first got to beta due to environmental differences between taskcluster and buildbot.
Still, as of 58, much of release automation still ran on buildbot, even if Taskcluster was doing all the scheduling.
Since December, we've been working on removing these last few pieces of buildbot from the release process. Progress was initially a bit slow, given Austin and Christmas, but we've been hard at work in the new year.
That brings us to today.
We've moved uptake monitoring, update verify (and made it 2x faster too!), update submission, final verify, bouncer submission, version bumping and tagging, balrog submission all to run in Taskcluster via various kinds of scriptworkers.
As I mentioned above, DevEdition 60.0b1 will be the first release in nearly 10 years done without using buildbot. The rest of the 60 release cycle will follow suit, and once 60 hits the release channel, only ESR52 will remain on buildbot!
February 27, 2018
I discovered Neutrino in the last year and it has become my preferred tool to bootstrap any JS project.
Here’s the definition of what Neutrino is from the project’s site:
Neutrino combines the power of webpack with the simplicity of presets.
For me, the main advantage Neutrino has, is that it removes the need to write webpack.config.babel.js configuration files and that starting a project is a simple wizard.
To get started it is as simple as this:
npx @neutrinojs/create-project <directory-name>
That will start a wizard that will help you select the stack you want:https://medium.com/media/4d6a5ad4e44a77f220ead444c12dc8b2/href
Once the wizard completes you can change to that directory and start your project with npm start; That’s it! You don’t need any configuration changes. The minimum number of files for your project to start are now in place.
Neutrino is opinionated and has a bunch of good defaults that works for both production and development. You can always customize the configuration and/or create your own presets.
If you want an example of:
- A basic React project with Neutrino check this .neutrinorc.js file
- A basic Koa project with Neutrino check this .neutrinorc.js file
I hope you give it a try!
November 10, 2017
Few months ago I decided to become a frontend developer and this is my first product: Firefox code coverage diff viewer.
The Firefox code coverage diff viewer allows determining code coverage changes for added lines per changeset.
The main purpose of this tool is to determine if Release Management can use code coverage data to help them make risk analysis about individual changesets.
marco has been working on collecting the code coverage data from Mozilla’s continuos integration and developed the backend this app uses.
The main view shows you changesets from the last ten pushes on mozilla-central that: 1) have coverage data and 2) are not merges or backouts.
From there you can navigate to individual changesets. The diff viewer will only highlight added lines as having coverage or no coverage. There are also some added lines that will not have highlighting since it is non-code added lines.
Brief technical details
- The app is using React written with JSX and ES8
- The app is auto deployed to Heroku
- The app’s state is normalized in case we wanted to use Redux
The app got promoted to beta this week and development on it will stop for this quarter. When this tool becomes clearly essential for Release Management we can reinvest on it.
November 03, 2017
Background story: I’ve been working with Mozilla full-time since 2009 (contributor in 2007 — intern in 2008). I’ve been working with the release engineering team, the automation team (A-team) and now within the Product Integrity organization. In all these years I’ve been blessed with great managers, smart and helpful co-workers, and enthusiastic support to explore career opportunities. It is an environment that has helped me flourish as a software engineer.
I will go straight to some of the benefits that I’ve enjoyed this year.
Three months at 100% of my salary. I did not earn bonus payouts during that time, however, it was worth the time I spent with my firstborn. We bonded very much during that time, I learned how to take care of my family while my wife worked, and I can proudly say that he’s a “daddy’s boy” :) (Not that I spoil him!).
Working from home 100% of the time
My favourite benefit. Period.
It really helps me as an employee, as I don’t enjoy commuting and I tend to talk a lot when I’m in the office. My family is very respectful of my work hours and I’m able to have deep-thought sessions in the comfort of my own home.
This is not a benefit that a lot of companies give, especially the bigger ones which expect you to relocate and come often to the office. I chuckle when I hear a company offer that their employees can work from home only a couple of days per week.
I appreciate that Mozilla allocaters some of their budget to pay for anything related to employee wellness (mental, spiritual & physical). Knowing that if I don’t use it I will lose it causes me to think about ways to apply the money to help me stay in shape.
This year, after a re-org and many years of doing the same work, I found myself in need of a new adventure — I get bored if I don’t feel as though I’m learning. With my manager’s support (thanks jmaher!), I embarked on a journey to become a front-end developer. Mozilla also supported me by paying for me to complete a React Nanodegree as part of the company’s learning budget.
Thank you, Mozilla, for your continued support!
August 30, 2017
All your nightlies are belong to Taskcluster
We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.
In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.
Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.
We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.
First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.
In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!
As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.
One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.
Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.
Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.
macOS performance (and why your build directory matters)
Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.
However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.
Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.
Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.
Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...
Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.
We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.
First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.
One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.
Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.
We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.
Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.
We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.
Thank you for reading this far!
Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.
In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":
- Aki Sasaki
- Dustin Mitchell
- Greg Arndt
- Joel Maher
- Johan Lorenzo
- Justin Wood
- Kim Moir
- Mihai Tabara
- Mike Shal
- Nick Thomas
- Rail Aliiev
- Rob Thijssen
- Simon Fraser
- Wander Costa
June 13, 2017
The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.
This blog post is a part of a serial. Checkout the other posts:
- How did the project start?
- Presentation of the solution
- 5 things I would have loved knowing about Google Play
- What’s next? Want to contribute? & Special thanks [Here]
What’s next? Want to contribute?
The initial publication was the first big step of the project. The next big one will be to expand the release workflow to simplify what percentage of the user base is getting the latest update.
Finally, if you have some stories on how you use Google Play you would like to share, questions or comments about our tools, please leave a comment.
A lot of people were involved in a way or another in this project. By alphabetical order:
- Rail Aliiev, for the help spawning nightly tasks against the correct events and more hints
- Chris AtLee, for guiding me during the entire project and always pointing me in the right direction
- Chris Cooper, for organizing the Taskcluster migration work week, which allowed to put the last bits for Firefox 53.0
- Julien Cristau, for giving feedback from Release Management’s point of view at almost every step we made
- Guillaume Destuynder, for reviewing the security risks
- Ben Hearsum, for reading and testing out the documentation in place
- Liz Henry, for helping with Google Play permissions
- Sylvestre Ledru, for the initial solution and the overall administration of Google Play
- Francesco Lodolo, for detecting and helping diagnose locales-related bugs
- Jordan Lund, for the help in the 52 release duty cycle, while the taskgraph was being finished up in 53
- Ritu Kothari, for initiating the new rollout process
- Amy Rich, for watching my back with Puppet
- Aki Sasaki, for helping me with scriptworker and reviewing so many of my patches across several components
- Mihai Tabara, for wiring the last parts of the Taskcluster graph together
- Justin Woods, for the tips and tricks of Taskcluster’s taskgraph
- Anybody involved in the Android Taskcluster migration, and who might not be in this list
- The reviewers of these articles
- And anyone I missed above
You can read and leave comments on this Github issue.
June 12, 2017
The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.
This blog post is a part of a serial. Checkout the other posts:
5 things I would have loved knowing about Google Play
This part is more oriented to personal takeaways and a couple of questions that remain unanswered.
It is easy to publish an APK that is not localized
A few checks done in pushapk_scriptworker are actually because of previous errors. The number of locales is one of them. A few weeks after Fennec Aurora was shipped to Google Play, some users started to see their browser in English, even though their phone is set in a different locale and Aurora used to be in their own locales.
Like said previously, APKs were originally uploaded via a script. This was true also for the first Aurora APKs. Furthermore, automation on Taskcluster was first tried on Aurora. When I started to roll these bits out, there was a configuration issue. Pushapk_scriptworker picked the en-US-only-APK, instead of the multi-locale one. The fix was fairly simple: just change the APK locations.
Google Play has a lot of ways to detect wrong APKs: by signatures, by package name (none of the Firefox versions share the same one), by version code, and some others. Although, it doesn’t warn about a big change in:
- Size. There is approximately a 10-MB-difference between a single locale build and a multi-locale one. Multi-locale APK are usually less than 40 MB. That APK shrunk by 25%, but not for good reasons.
- The directory list of the archive. Manifest files were smaller, there were 90 times less files in some directories.
Of course, from one stable version to another, a lot of files may change in the archive. Asking Google to watch out for everything doesn’t seem reasonable. Although, when I started working on Google Play, it left me the feeling of being well-hardened. At that time, I thought Google Play did check the locales within an APK.
The consequence I take away: If your app has different build flavors (like single vs multi-locales), I recommend you write your own sanity checks, first.
Locales on the Play Store are independent from locales in APK
It might sound obvious after explaining of the previous issue, but this error message confused several Mozillians:
Tried to set recent changes text for APK version 12345678 for language es-US. Language is not associated with the app.
We hit it with a couple of locales, at pace of 1 per month, approximately. Like explained in the architecture part, locales are defined in an external service, stores_l10n. We have had many theories about it:
- This locale is not supported by Google Play
- The locale code (for instance “es-US”) expected by Google Play is not the one we provide. We may want to find the list of the locales officially supported
- We don’t ship this locale within the APK, and the Play Store detects it
Actually, the fix ended up being simple. “Recent changes” is something we want to update on every new APK. But because the descriptions were more set in stone, they were not a part of the automated workflow. Actually, stores_l10n released had recently released new locales each time we hit the problem. That error message was actually telling us the descriptions of these new locales had never been uploaded. Once we figured this out, it became a part of the regular update workflow.
You cannot catch everything when you don’t commit transactions
The dry-run feature in MozApkPublisher, which just doesn’t commit the Google Play transaction, helps in detecting early failures. For instance: wrong version codes, wrong package names. Nevertheless, we have hit cases where dry runs went smoothly and we had to diagnose new issues at commit time.
- Locales mismatch. The previous issue was not dectected at the upload time, likely for the reason that listings and the recent changes are 2 different API calls. I assume, because you can call these API in whatever order within the same transaction, Google doesn’t check until the final state is fully known.
- Permissions not granted to the account. Due to Firefox Aurora being stopped, we restricted the Google Play account in charge of Aurora to not be able to upload any APK anymore. Yet, we decided to publish Nightly to the same product as Aurora (in order to not strand users on a unmaintained version). Our tests went fine, Google Play accepted our APKs. But the day of the go-live, a new error came up saying we are not able to upload APKs after all. I don’t have any theory for this scenario, but if you do, I would love to hear it!
User fractions can also be specified on other tracks (but that’s not a feature)
Fennec Release 53.0 is the first version which was entirely published via Taskcluster. Mozilla also uses the rollout track on Release only. Beta is pushed to the production one (and that is actually something we are reviewing). Sadly, there was another configuration error: even though, the user fraction was specified, the configured track was the production one. Google Play didn’t raise any error (even at commit time), starting a full-throttled release. At that time, I contacted the Google Play support, to ask if it was possible to switch back to rollout. The person was very courteous. He explained they were not able to perform that kind of action. This is why they transmitted my request to the tech team, who will follow up by email.
In parallel, we have fixed the configuration error and implemented our own check in MozApkPublisher.
There is no way to rollback to previous APKs, even if you ask a human
The previous configuration error could have remained brieve, if somebody didn’t report what seemed like an important crash, 1 hour later. At that point, the Release Management team wanted to stall updates, in order to not spread the regression too much. We were still waiting on the support’s answer, but I reached out to them again since our request became different. I told the new contact I had about the previous issue, the new one, and the fact we were running against the clock. Sadly for us, the person only gave us the option to wait on the email follow up.
About 16 hours later, we got the email with the official answer:
Unfortunately, we cannot remove the APK in question, neither can we claw back the APK from the users that have already installed this update.
If you want to stop further users from installing this APK, then you need to make another release that deactivates this APK and add the APK that you want users to install instead.
With that in mind, we have been thinking of another way to use Google Play, where the rollout track would be more extensively used.
What’s next? Want to contribute? & Special thanks
See the next post.
You can read and leave comments on this Github issue.
June 07, 2017
The Release Engineering team fully-automated the publication of Firefox for Android in version 53.0. Let’s see what was already there and how things have changed since version 53.0.
This blog post is a part of a serial. Checkout the other posts:
- How did the project start?
- Presentation of the solution [Here]
- 5 things I would have loved knowing about Google Play
- What’s next? Want to contribute? & Special thanks
Presentation of the solution
- Interact with the Play Store.
- Securely authenticate to it. It offers several ways to authenticate, including P12 certificates.
- Securely store the authentication credentials.
- Securely fetch the builds.
Based on Taskcluster
Mozilla, and more specifically the Release Engineering team, uses Taskcluster to implement the Firefox release workflow. The workflow can be summed up as:
- Build Firefox with all supported locales (languages)
- Sign these builds
- Publish them everywhere (on https://archive.mozilla.org/, on https://www.mozilla.org/firefox/, via updates, etc.)
Each step is defined by its own set of tasks. These tasks are processed by specialized workers (represented by worker types). Those workers basically run a script against parameters given in the task definition.
Therefore, publishing to Google Play was a matter of creating a new Taskcluster task, which will be processed by a dedicated script and executed by its own worker type.
With some extra-security features
The aforementioned script must be bootstrapped to be integrated to the rest of Taskcluster. There are several ways to bootstrap scripts for Taskcluster. One of them is to create a docker image which Taskcluster pulls and run.
However, because of the needs stated above, we decided to go with a security-focused framework: scriptworker. Scriptworker was initially created to perform one of the most critical operation security-wise: sign builds. The framework has some great interesting features:
- It securely downloads artifacts. Files are forced to be downloaded over https. Checksums and signatures are checked.
- It validates that the task definitions were not changed between the time of creation and the time of execution. This prevents some tasks to be duplicated and edited to introduce extra-commands, which may be used to tamper a build, for instance.
- It abstracts some of the Taskcluster details, thus you just have write a script (language-agnostic) that does what your worker has to do.
How pieces are wired together
Here’s a general view of how things are wired together:
- 1. The “decision task” creates a task for pushapk_scriptworker
- 2/3. Scriptworker polls for pending tasks and check their scopes. It downloads APKs via Chain of Trust. Scriptworker checks if the upstream tasks were altered.
- 4. Scriptworker defers valid tasks to pushapkscript. The latter validates APKs signatures, makes sure every APK architecture is present.
- 5. Pushapkscript calls MozApkPublisher with credentials and on-disk locations of APKs
- 6/7. MozApkPublisher verifies whether APKs contain several locales. It fetches localized strings displayed on Google Play Store (aka “listings” and “what’s new section”)
- 8. MozApkPublisher opens the Google play credentials.
- 9. MozApkPublisher publishes APKs, listings and “what’s new”
1. Task creation
There are many ways to submit the definition of a task to Taskcluster. For example, you can:
- Use the web task creator (useful for debugging)
- Call the Taskcluster API via one of the libraries (interesting if you want a bot that spawns tasks)
- Generate the task via taskcluster/taskgraph that lives “in-tree”, that is to say, alonside the Firefox code.
Each of them was used at some point, but the ultimate solution relies on the last one. The taskgraph is a graph generator which, depending on given parameters, creates a graph of builds, tests and deployment tasks. Taskgraph generation is run on Taskcluster too, under what we commonly call “the decision task”. This solution benefits from being on hg.mozilla.org: it is versioned and only vouched people are able to modify it.
Moreover, taskgraph generates what is necessary for scriptworker to validate the task definitions and artifacts. To do so, taskgraph:
- Creates a JSON representation of the graph,
- Creates another JSON file that describes the artifacts generated (including the JSON of the graph) and signs it.
2. Scriptworker and new tasks
Scriptworker polls tasks from Taskcluster queue. That is actually one of the great things about Taskcluster: workers don’t have to open inbound (listening) ports. This reduces the potential surface of attack. Fetching new tasks is done via this bit of REST API which workers can poll. Speaking of which, workers are authenticated to Taskcluster, which prevents them from claiming a task that it isn’t meant to take.
- Make sure the current task and its dependencies have not changed since its definition. This is done by comparing the JSON the taskgraph generated and the actual definition.
- Check the signatures of every dependency, by looking at a special artifact Chain of Trust creates. This helps to verify no rogue worker processed a upstream task.
- Download artifacts on the worker and verify the checksums.
If all goes well, scriptworker will call pushapkscript
3. Pushapkscript and APKs
Here starts the Android-specific bits. Pushapkscript performs some extra checks on the APKs:
APKs are signed with the correct certificates. In the previous steps, we have only checked the origin of the tasks. Now, we verify the APK itself. This may not sound extremely important because Google Play is vigilant about APK signatures and will refuse any APK for which the signature is not valid. However, it is safer to bail out before any outbound traffic is done to Google Play. Besides, with this check, Google acts as a second factor instead of being the only actor accountable for signatures.
No required processor architecture is missing, in order to upload them all in the same request. We have to publish them at the same time because some Android devices support several architectures. We have already had one big crash on these devices because an x86 APK was overseeded by its “brother in ARM”.
Pushapkscript knows about the location of the Google Play credentials (P12 certificates). It finally gives all the files (checked APKs and credentials) to MozApkPublisher.
4. MozApkPublisher, locales and Google Play
To be honest, MozApkPublisher could have been implemented within pushapkscript, but the split exists for historical reasons and still has a meaning today: this was the script Release Management used before this project got started. It also remains a way to let a human publish, in case of emergency.
It checks that APKs are multi-locale. We serve the same APK, which includes (almost) every locale in it. That’s a verification Google doesn’t do.
It also fetches the latest strings to display on the Play Store (like the localized descriptions). These strings are then posted on Google Play, alongside the APKs.
MozApkPublisher provides a dry-run mode thanks to the transaction mechanism exposed by Google’s API. Nothing is effectively published until the transaction is committed.
5. Pushapk_scriptworker: Scriptworker, Pushapkscript, and MozApkPublisher on the same machine
The 3 pieces live on the same Amazon EC2 instance, under the name pushapk_scriptworker. The configuration of this instance is managed by Puppet. The entire Puppet configuration is public on hg.mozilla.org, with the exception of secrets (Tascluster credentials, P12 certificates) which are encrypted on a seperate machine. Like the main Firefox repository, only vouched people can submit changes to the Puppet configuration.
5 things I would have loved knowing about Google Play
See the next post.
You can read and leave comments on this Github issue.
June 06, 2017
This blog post is a part of a serial. Checkout the other posts:
- How did the project start? [Here]
- Presentation of the solution
- 5 things I would have loved knowing about Google Play
- What’s next? Want to contribute? & Special thanks
How did the project start?
Mozilla ships Firefox every day
This is true for desktop (Windows, Linux, Mac) and Android. However, we don’t ship that often to every user. We have different channels, receiving updates at different frequencies:
- Firefox Nightly (and Aurora until we stopped it) gets updated usually every day (unless an important breakage happens).
- Firefox Beta and Developer Edition get two updates every week on Desktop. On Android, Beta is usually shipped once a week.
- Firefox Release (also known as simply “Firefox”) gets one every six weeks.
About Firefox Aurora
You may have heard, Firefox Aurora has been discontinued in April 2017. Although, these blog posts will talk about it. The main reason is: Most of the experiments were done on Aurora, before it was stopped.
Today, the Android users who were on Aurora have been migrated to Nightly. New users are also given Nightly.
Why do we need Firefox for Android on app stores?
Unlike Firefox for desktop, Android apps have to be uploaded onto application stores (like Google Play Store). Otherwise, they have very low visibility. For instance, Firefox for Android Aurora (codenamed “Fennec Aurora”) was not on Google Play until September 2016, but it was downloadable from our official website (now redirected to Nightly). After we started publishing Aurora on Google Play, we increased our number of users by 5x.
Why are we automating the publication today?
Google didn’t offer a way to script a publication on Play Store, before July 2014. It had to be done manually, from their website. Around that time, a few people from Release Management implemented a first script. One person from the Release Management team ran it every time Beta or Release was ready, from his/her own machine. With Aurora being out, we now have several APKs (one per processor architecture/Android API level, which translates to 2 at the moment: one for x86 processors, the other for ARM) to publish each day.
The daily frequency was new for Fennec. It led to 2 concerns:
- A human has to repeat the same task every day.
- Pushing every day from a workstation increases the surface of security threats.
That is why we decided to make APK publication a part of the automated release workflow.
Presentation of the solution
See the next post.
You can read and leave comments on this Github issue.
May 17, 2017
January 20, 2017
Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.
We now have a much more secure, resilient, and hackable nightly build and release process.
It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.
The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.
Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.
Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.
This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.
Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!
December 28, 2016
As 2016 winds down, I wanted to take some time to highlight all the work our Release Engineering team has done this year. Personally, I really enjoy writing these retrospective posts. I think it's good to spend some time remembering how far we've come in a year. It's really easy to forget what you did last month, and 6 months ago seems like ancient history!
We added four people to our team this year!
Aki (:aki) re-joined us in January and has been working hard on developing a security model for Taskcluster for sensitive tasks like signing and publishing binaries.
Rok (:garbas) started in February and has been working on modernizing our web application framework development and deployment processes.
Johan (:jlorenzo) started in August and has been improving our release automation, Balrog, and automatically publishing Android builds to the Google Play Store.
Simon (:sfraser) started in October and has been improving monitoring of our production systems, as well as getting his feet wet with our partial update generation system.
This year we released 104 desktop versions of Firefox, and 58 android versions (including Beta, Release and ESR branches).
5 of those releases were just in the week prior to our all hands meeting in Hawaii!
Several other releases this year were special for particular reasons, and required special efforts on our part. We continued to provide SHA-1 signed installers for Windows XP users. We also produced a special 47.0.2 release in order to try and rescue users stuck on 47. We've never shipped a point release for a previous release branch before! We've also generated partial updates to try and help users on 43.0.1 and 47.0.2 get faster updates to the latest version of Firefox.
We couldn't have shipped so many releases so quickly last week if it weren't for release promotion. Previous to Firefox 46, our release process would generate completely new builds after CI was finished. This wasted a lot of time, and also meant we weren't shipping the exact binaries we had tested. Today, we ship the same builds that CI has generated and tested. This saves a ton of time (up to 8 hours!), and gives us a lot more confidence in the quality of the release.
This is one of those major kinds of changes that really transforms how we approach doing releases. I can't really remember what it was like doing releases prior to release promotion!
We also added support in Shipit to allow starting a release before all the en-US builds are done. This lets our Release Management team kick off a release early, assuming all the builds pass. It saves a person having to wait around watching Treeherder for the coveted green builds.
Windows in AWS
This year we completed our migration to AWS for Windows builds. 100% of our Windows builds are now done in AWS. This means that we now have a much faster and more scalable Windows build platform.
In addition, we also migrated most of the Windows 7 unittests to run in AWS. Previously these were running on dedicated hardware in our datacentre. By moving these tests to AWS, we again get a much more scalable test platform, but we also freed up hardware capacity for other test platforms (e.g. Windows XP).
One of our major focus areas this year was migrating our infrastructure from Buildbot to Taskcluster. As of today, we have:
Scheduled Changes in Balrog means that now we can have machines set background update rate to 0% 24 hours after release, instead of having a human do it.
Balrog itself was migrated from our datacentre in SCL3 into AWS. We now have a much more flexible deployment pipeline.
Balrog has also been one of our best projects for getting volunteer contributions! Many of the work done this year was done by contributors!
Being able to shut off old, crufty and deprecated stuff is an important part of staying agile. This year we were finally able to develop an end of life plan for Windows XP. In addition, we discontinued support for OSX 10.6-10.8, systems without SSE2, and 32-bit OSX systems. Not having to support these old platforms simplifies managing our infrastructure, and also makes product development easier.
2017 is looking like it's going to be another interesting (and busy!) year for RelEng.
Our top priority is to finish the migration to Taskcluster. Hopefully by the end of 2017, the only thing left on buildbot will be the ESR52 branch. This will require some big changes to our release automation, especially for Fennec.
We're also planning to provide some automated processes to assist with the rest of the release process. Releases still involve a lot of human to human handoffs, and places where humans are responsible for triggering automation. We'd like to provide a platform to be able to manage these handoffs more reliably, and allow different pieces of automation to coordinate more effectively.
October 21, 2016
Using Auto Increment Fields to Your Advantage
I just found, and read, Clément Delafargue’s post “Why Auto Increment Is A Terrible Idea” (via @CoreRamiro). I agree that an opaque primary key is very nice and clean from an information architecture viewpoint.
However, in practice, a serial (or monotonically increasing) key can be handy to have around. I was reminded of this during a recent situation where we (app developers & ops) needed to be highly confident that a replica was consistent before performing a failover. (None of us had access to the back end to see what the DB thought the replication lag was.)Read more...
September 30, 2016
The conference started off with Anna Lambert of Shopify welcoming everyone to the conference.
The first speaker was Atlee Clark, Director of App and Developer relations at Shopify who discussed the wheel of diversity.
The wheel of diversity is a way of mapping the characteristics that you're born with (age, gender, gender expression, race or ethnicity, national origin, mental/physical ability), along with those that you acquire through life (appearance, education, political belief, religion, income, language and communication skills, work experience, family, organizational role). When you look at your team, you can map how diverse it is by colour. (Of course, some of these characteristics are personal and might not be shared with others). You can see how diverse the team is by mapping different characteristics with different colours. If you map your team and it's mostly the same colour, then you probably will not bring different perspectives together when you work because you all have similar backgrounds and life experiences. This is especially important when developing products.
This wheel also applies to hiring too. You want to have different perspectives when you're interviewing someone. Atlee mentioned when she was hiring for a new role, she mapped out the characteristics of the people who would be conducting the hiring interviews and found there was a lot of yellow.
So she switched up the team that would be conducting the interviews to include people with more diverse perspectives.
The next talk was by Erica Joy, who is a build and release engineer at Slack, as well as a diversity advocate. I have to admit, when I saw she was going to speak at Beyond the Code, I immediately pulled out my credit card and purchased a conference ticket. She is one of my tech heroes. Not only did she build the build and release pipeline at Slack from the ground up, she is an amazing writer and advocate for change in the tech industry. I highly recommend reading everything she has written on Medium, her chapter in Lean Out and all her discussions on twitter. So fantastic.
Her talk at the conference was "Building a Diverse Corporate Culture: Diversity and Inclusion in Tech". She talked about how literally thousands of companies say they value inclusion and diversity. However, few talk about what they are willing to give up to order to achieve it. Are you willing to give up your window seat with a great view? Something else so that others can be paid fairly? She mentioned that change is never free. People need both mentorship and sponsorship in in order to progress in their career.
I really liked her discussion around hiring and referrals. She stated that when you're hire people you already know you're probably excluding equally or better qualified that you don't know. By default, women of colour are underpaid.
|Pay gap for white woman, African American women and Hispanic women compared to a white man in the United States.|
Some companies have referral system to give larger referral bonuses to people who are underrepresented in tech, she gave the example of Intel which has this in place. This is a way to incentivize your referral system so you don't just hire all your white friends.
|The average white American has 91 white friends and one black friend so it's not very likely that they will refer non-white people. Not sure what the numbers are like in Canada but I'd guess that they are quite similar.|
In addition, don't ask people to work for free, to speak at conferences or do diversity and inclusion work. Her words were "We can't pay rent with exposure".
Spend time talking to diversity and inclusion experts. There are people that have spent their entire lives conducting research in this area and you can learn from their expertise. Meritocracy is a myth, we are just lucky to be in the right place in the right time. She mentioned that her colleague Duretti Hirpa at Slack points out the need for accomplices, not allies. People that will actually speak up for others. So people feeling pain or facing a difficult work environment don't have to do all the work of fighting for change.
In most companies, there aren't escalation paths for human issues either. If a person is making sexist or racist remarks, shouldn't that be a firing offense?
If people were really working hard on diversity and inclusion, we would see more women and people of colour on boards and in leadership positions. But we don't.
She closed with a quote from Beyonce:
Gabriel Fayant from Assembly of Seven Generation's talk was entitled "Walking in Both Worlds, traditional ways of being and the world of technology". I found this quite interesting, she talked about traditional ceremonies and how they promote the idea of living in the moment, and thus looking at your phone during a drum ceremony isn't living the full experience. A question from the audience from someone who worked in the engineering faculty at the University of Toronto was how we can work with indigenous communities to share our knowledge of the technology and make youth both producers of tech, not just consumers.
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
conference earlier so you watch it here. It described the progression of printing technology from 7000 years ago until today. Each new technology disrupted the previous one, and it was difficult for those who worked on the previous technology to make the jump to work on the new one.
So according to Sandi, what is your future?
- What you are working on now probably won't be relevant in 10 years
- You will all die
- All the people you love will die
- Your body will start to fail you
- Life is short
- Tell people that you love them
- Guard your health
- Spend time with your kids
- Get some exercise (she loves to bike)
- We are bigger than tech
- Community and schools need help
- She gave the example of Habitat for Humanity where she volunteers
- These organizations also need help to write code, they might not have the knowledge or time to do it right
The last talk I attended was by Sabrina Geremia of Google Canada. She talked about the factors that encourage a girl to consider computer science (encouragement, career perception, self-perception and academic exposure.)
I found that this talk was interesting but it focused a bit too much on the pipeline argument - that the major problem is that girls are not enrolling in CS courses. If you look at all the problems with environment, culture, lack of pay equity and opportunities for promotion due to bias, maybe choosing a career where there is more diversity is a better choice. For instance, law, accounting and medicine have much better numbers for these issues, despite there still being an imbalance.
At the end of the day, there was a panel to discuss diversity issues:
|Moderator: Ariti Sharma, Shopify, Panelists: Mohammed Asaduallah, Format, Katie Krepps, Capital One Canada, Lateesha Thomas, Dev Bootcamp, Ramya Raghavan, Google, Kara Melton, TWG, Gladstone Grant, Microsoft Canada|
- Be intentional about seeking out talent
- Fix culture to be more diverse
- Recruit from bootcamps. Better diversity today. Don't wait for universities to change the ratios.
- Environment impacts retention
- Conduct and engagement survey to see if underrepresented groups feel that their voices are being heard.
- There is a need for sponsorship, not just mentoring. Define a role that doesn't exist at the company. A sponsor can make that role happen by advocating for it at higher levels
- Mentors do better if matched with demographics. They will realize the challenges that you will face in the industry better than a white man who has never directly experienced sexism or racism.
- Sponsors tend to be men due to the demographics of our industry
- At Microsoft, when you reach a certain level your are expected to mentor an unrepresented person
- Look at compensation and representation across diverse groups
- Attrition is normal, it varies by region, especially acute in San Francisco.
- Women leave companies at 2x the rate of men due to culture
- You shouldn't stay at a place if you are burnt out, take care of yourself.
Compared to the previous two iterations of this conference, it seemed that this time it focused a lot more on solutions to have more diversity and inclusion in your company. The previous two conferences I attended seemed to focus more on technical talks by diverse speakers.
As a side note, there were a lot of Shopify folks in attendance because they ran the conference. They sent a bus of people from their head office in Ottawa to attend it. I was really struck at how diverse some of the teams were. I met group of women who described themselves as a team of "five badass women developers" 💯 As someone who has been the only woman on her team for most of her career, this was beautiful to see and gave me hope for the future of our industry. I've visited the Ottawa Shopify office several times (Mr. Releng works there) and I know that the representation of of their office doesn't match the demographics of the Beyond the Code attendees which tended to be more women and people of colour. But still, it is refreshing to see a company making a real effort to make their culture inclusive. I've read that it is easier to make your culture inclusive from the start, rather than trying to make difficult culture changes years later when your teams are all homogeneous. So kudos to them for setting an example for other companies.
Thank you Shopify for organizing this conference, I learned a lot and I look forward to the next one!
August 22, 2016
Py Bay 2016 - a First Report
PyBay held their first local Python conference this last weekend (Friday, August 19 through Sunday, August 21). What a great event! I just wanted to get down some first impressions - I hope to do more after the slides and videos are up.Read more...
July 29, 2016
|The venue was across the street from Confederation Park aka land of Pokemon.|
I really enjoyed it. The people I chatted with were very friendly and welcoming. Of course, I ran into some people I used to work with, as is with any tech event in Ottawa it seems. Nice to catch up!
|The venue had the Canada Council for the Arts as a tenant, thus the quintessentially Canadian art.|
She then spoke about a project she wrote to create generative poetry from a RNN (recurrent neural network). It was based on a RNN tutorial that she heavily refactored to meet her needs. She went through the code that she developed to generate artificial prose from the works of H.G. Wells and Jane Austen. She talked about how she cleaned up the text to remove EOL delimiters, page breaks, chapters numbers and so on. And then it took a week to train it with the data.
She then talked about another example which used data from Jack Kerouac and Virginia Woolf novels, which she posts some of the results to twitter.
"David Williams has expired your commit rights to the
eclipse.platform.releng project. The reason for this change is:
We have all known this day would come, but it does not make it any easier.
It has taken me four years to accept that Kim is no longer helping us with
Eclipse. That is how large her impact was, both on myself and Eclipse as a
whole. And that is just the beginning of why I am designating her as
"Committer Emeritus". Without her, I humbly suggest that Eclipse would not
have gone very far. Git shows her active from 2003 to 2012 -- longer than
most! She is (still!) user number one on the build machine. (In Unix terms,
that is UID 500). The original admin, when "Eclipse" was just the Eclipse
She was not only dedicated to her job as a release engineer she was
passionate about doing all she could to make other committer's jobs easier
so they could focus on their code and specialties. She did (and still does)
know that release engineering is a field of its own; a specialized
profession (not something to "tack on" at the end) that just anyone can do)
and good, committed release engineers are critical to the success of any
For anyone reading this that did not know Kim, it is not too late: you can
follow her blog at
You will see that she is still passionate about release engineering and
influential in her field.
And, besides all that, she was (I assume still is :) a well-rounded, nice
person, that was easy to work with! (Well, except she likes running for
Thanks, Kim, for all that you gave to Eclipse and my personal thanks for
all that you taught me over the years (and I mean before I even tried to
fill your shoes in the Platform).
We all appreciate your enormous contribution to the success of Eclipse and
happy to see your successes continuing.
To honor your contributions to the project, David Williams has nominated
you for Committer Emeritus status."
Thank you David! I really appreciate your kind words. I learned so much working with everyone in the Eclipse community. I had the intention to contribute to Eclipse when I left IBM but really felt that I have given all I had to give. Few people have the chance to contribute to two fantastic open source communities during their career. I'm lucky to have that opportunity.
|My IBM friends made this neat Eclipse poster when I left. The Mozilla dino displays my IRC handle.|
July 18, 2016
Legacy vcs-sync is dead! Long live vcs-sync!
Today’s the day to celebrate! No more bash scripts running in screen sessions providing dvcs conversion experiences. Woot!!!
I’ll do a historical retrospective in a bit. Right now, it’s time to PARTY!!!!!
July 12, 2016
Firefox has it's own built-in update system. The update system supports 2 types of updates: complete and incremental. Completes can be applied to any older version, unless there are some incompatible changes in the MAR format. Incremental updates can be applied only to a release they were generated for.
Usually for the beta and release channels we generate incremental updates against 3-4 versions. This way we try to minimize bandwidth consumption for our end users and increase the number of users on the latest version. For Nightly and Developer Edition builds we generate 5 incremental updates using funsize.
Both methods assume that we know ahead of time what versions should be used for incremental updates. For releases and betas we use ADI stats to be as precise as possible. However, these methods are static and don't use real-time data.
The idea to generate incremental updates on demand has been around for ages. Some of the challenges are:
- Acquiring real-time (or close to real-time) data for making decisions on incremental update versions
- Size of the incremental updates. If the size is very close to the size of the corresponding complete, there is reason to serve incremental updates. One of the reasons is that the that the updater tries to use the incremental update first, and then falls back to the complete in case if something goes wrong. In this case the updater downloads both the incremental and the complete.
Ben and I talked about this today and to recap some of the ideas we had, I'll put them here.
- We still want to "pre-seed" most possible incremental updates before we publish any updates
- Whenever Balrog serves a complete-only update, it should generate a structured log entry and/or an event to be consumed by some service, which should contain all information required to generate a incremental update.
- The "new system" should be able to decide if we want to discard incremental update generation, based on the size. These decisions should be stored, so we don't try to generate incremental update again next time. This information may be stored in Balrog to prevent further events/logs.
- Before publishing the incremental update, we should test if they can be applied without issues, similar to the update verify tests we run for releases, but without hitting Balrog. After they pass this test, we can publish them to Balrog and check if Balrog returns expected XML with partial info in it.
- Minimize the amount of served completes, if we plan to generate incremental updates. One of the ideas was to modify the client to support responses like "Come in 5 minutes, I may have something for you!"
The only remaining thing is to implement all these changes. :)
End of an Experiment
tl;dr: We’ll be shutting down the Firefox mirrors on Bitbucket.
A long time ago we started an experiment to see if there was any support for developing Mozilla products on social coding sites. Well, the community-at-large has spoken, with the results many predicted:
June 07, 2016
|Picture by howardignatius- Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)|
- Migrated to a new build or continuous integration system
- Implemented a new release or deployment pipeline
- Implemented tooling to simplify managing your apps in a mobile store
- Significantly reduced build time with parallelization or some other interesting optimization!
- Moved your build and test system to containers
- Refactored your infrastructure code for a live production environment
- ... we'd love to see your submission to the workshop
We'd like to encourage people new to speaking to apply, as well as those from underrepresented groups in tech. We'd love to hear from some new voices and new companies !
Submissions are due July 1, 2016. If you have questions on of the submission process, topics to submit, or anything else, I'm happy to help! I'm kmoir and I work at mozilla.com or contact me on twitter. Submit early and often!
June 03, 2016
|Glenn Gould Studios, CBC, Toronto.|
|Statue of Glenn Gould outside the CBC studios that bear his name.|
The day started out with an introduction from the organizers and a brief overview of history of DevOps days. They also made a point about reminding everyone that they had agreed to the code of conduct when they bought their ticket. I found this explicit mention of the code of conduct quite refreshing.
The first talk of the day was John Willis, evangelist at Docker. He gave an overview of the state of enterprise devops. I found this a fresh perspective because I really don't know what happens in enterprises with respect to DevOps since I have been working in open source communities for so long. John providing an overview of what DevOps encompasses.
DevOps is a continuous feedback loop.
He talked a lot about how empathy is so important in our jobs. He mentions that at Netflix has a slide deck that describes company culture. He doesn't know if this is still the case, but it he had heard that if you hadn't read the company culture deck and show up for an interview at Netflix, you would be automatically disqualified for further interviews. Etsy and Spotify have similar open documents describing their culture.
Christina Maslach on the six sources of burnout.
Inclusivity, Complexity and Empathy.
The second talk was a really interesting talk by Hany Fahim, CEO of VM Farms. It was a short mystery novella describing how VM Farms servers suddenly experienced a huge traffic spike when the Brazilian government banned Whatsapp as a result of a legal order. I love a good war story.
Looking on twitter, they found that a court case in Brazil had recently ruled that Whatsup would be blocked for 48 hours. Users started circumventing this block via VPN. Looking at their logs, they determined that most of the traffic was resolving to ip addresses from Brazil and that there was a large connection time during SSL handshakes.
The government of Brazil encouraged the use of open source software versus Windows, and thus the users became more technically literate, and able to circumvent blocks via VPN.
In conclusion, making changes to use multi-core HAProxy fixed a lot of issues. Also, twitter was and continues to be a great source of information on activity that is happening in other countries. Whatsapp was returned to service and then banned a second time, and their servers were able to keep up with the demand.
After lunch, we were back to to more talks. The organizers came on stage for a while to discuss the afternoon's agenda. They also remarked that one individual had violated the code of conduct and had been removed from the conference. So, the conference had a code of conduct and steps were taken if it was violated.
Next up, Bridget Kromhout from Pivotal gave a talk entitled Containers will not Fix your Broken Culture.
I first saw Bridget speak at Beyond the Code in Ottawa in 2014 about scaling the streaming services for Drama Fever on AWS. At the time, I was moving our mobile test infrastructure to AWS so I was quite enthralled with her talk because 1) it was excellent 2) I had never seen another woman give a talk about scaling services on AWS. Representation matters.
The summary of the talk last week was that no matter what tools you adopt, you need to communicate with each other about the cultural changes are required to implement new services. A new microservices architecture is great, but if these teams that are implementing these services are not talking to each other, the implementation will not succeed.
In the afternoon, there were were lightning talks and then open spaces. Open spaces are free flowing discussions where the topic is voted upon ahead of time. I attended ones on infrastructure automation, CI/CD at scale and my personal favourite, horror stories. I do love hearing how distributed system can go down and how to recover. I found that the conversations were useful but it seemed like some of them were dominated by a few voices. I think it would be better if the person that suggested to topic for the open space also volunteered to moderate the discussion.
Day 2The second day started out with a fantastic talk by John Arthorne of Shopify speaking on scaling their deployment pipeline. As a side note, John and I worked together for more than a decade on Eclipse while we both worked at IBM so it was great to catch up with him after the talk.
He started by giving some key platform characteristics. Stores on Shopify have flash sales that have traffic spikes so they need to be able to scale for these bursts of traffic.
Armen at the conference and he started chuckling at this point because at Mozilla we also wrote an application called ship-it which release management uses to kick off Firefox releases. Shopify also has a overall view of the ship it deployment process which allows developers to see the percentages of nodes where their change has been deployed. One of the questions after the talk was why they use AWS for their deployment pipeline when they have use machines in data centres for their actual customers. Answer: They use AWS where resilency is not an issue.
He emphasized the need to empower developers to use DevOp practices by giving them tools, and showing them how to use them. For instance, if they needed to run docker to test something, walk them through it so they will know how to do it next time.
He had excellent advice on how to work on projects outside of work to showcase skills for future employers.
Diversity and Inclusion
As an aside, whenever I'm at a conference I note the number of people in the "not a white guy" group. This conference had an all men organizing committee but not all white men. (I recognize the fact that not all diversity is visible i.e. mental health, gender identity, sexual orientation, immigration status etc) They was only one woman speaker, but there were a few non-white speakers. There were very few women attendees. I'm not sure what the process was to reach out to potential speakers other than the CFP.
There were slides that showed diverse developers which was refreshing.
I learned a lot at the conference and am thankful for all the time that the speakers took to prepare their talks. I enjoyed all the conversations I had learning about the challenges people face in the organizations implementing continuous integration and deployment. It also made me appreciate the culture of relentless automation, continuous integration and deployment that we have at Mozilla.
I don't know who said this during the conference but I really liked it
It was interesting to learn how all these people are making their companies heart beat stronger via DevOps practices and tools.
May 13, 2016
Francis and Connor will be working on implementing some new features in release promotion as well as migrating some builds to taskcluster. I'll be mentoring Francis, while Rail will be mentoring Connor. If you are in the Toronto office, please drop by to say hi to them. Or welcome them on irc as fkang or sheehan.
|Kim, Francis, Connor and Rail|
Mentoring an intern provides an opportunity to see the systems we run from a fresh perspective. They both have lots of great questions which makes us revisit why design decisions were made, could we do things better? Like all teaching roles, I always find that I learn a tremendous amount from the experience, and hope they have fun learning real world software engineering concepts with respect to running large distributed systems.
Welcome to Mozilla!
April 27, 2016
In my previous post I introduced the new release process we have been adopting in the 46.0 release cycle.
Release build promotion has been in production since Firefox 46.0 Beta 1. We have discovered some minor issues; some of them are already fixed, some still waiting.
One of the visible bugs is Bug 1260892. We generate a big SHA512SUMS file, which should contain all important checksums. With numerous changes to the process the file doesn't represent all required files anymore. Some files are missing, some have different names.
We are working on fixing the bug, but you can use the following work around to verify the files.
For example, if you want to verify http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe, you need use the following 2 files:
# download all required files $ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe $ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums $ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc $ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/KEY # Import Mozilla Releng key into a temporary GPG directory $ mkdir .tmp-gpg-home && chmod 700 .tmp-gpg-home $ gpg --homedir .tmp-gpg-home --import KEY # verify the signature of the checksums file $ gpg --homedir .tmp-gpg-home --verify firefox-46.0.checksums.asc && echo "OK" || echo "Not OK" # calculate the SHA512 checksum of the file $ sha512sum "Firefox Setup 46.0.exe" c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 Firefox Setup 46.0.exe # lookup for the checksum in the checksums file $ grep c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 firefox-46.0.checksums c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 sha512 46275456 install/sea/firefox-46.0.ach.win64.installer.exe
This is just a temporary work around and the bug will be fixed ASAP.
April 23, 2016
Enterprise Software Writers R US
Someone just accused me of writing Enterprise Software!!!!!Read more...
April 05, 2016
Hello from Release Engineering! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Release Build Promotion.
What is Release Build Promotion?
Release build promotion (or "build promotion", or "release promotion" for short), is the latest release pipeline for Firefox being developed by Release Engineering at Mozilla.
Release build promotion starts with the builds produced and tested by CI (e.g. on mozilla-beta or mozilla-release). We take these builds, and use them as the basis to generate all our l10n repacks, partial updates, etc. that are required to release Firefox. We "promote" the CI builds to the release channel.
How is this different?
The previous release pipeline also started with builds produced and tested by CI. However, when it came time to do a release, we would create an entirely new set of builds with slightly different build configuration. These builds would not get the regular CI testing.
Release build promotion improves the process by removing the second set of builds. This drastically improves the total time to do a release, and also increases our confidence in our products since we now are shipping exactly what's been tested. We also improve visibility of the release process; all the tasks that make up the release are now reported to Treeherder along with the corresponding CI builds.
Release build promotion is in use for Firefox desktop starting with the 46 beta cycle. ESR and release branches have not yet been switched over.
Firefox for Android is also not yet handled. We plan to have this ready for Firefox 47.
One of the major reasons of this project was our release end-to-end times. I pulled some data to compare:
- One of the Firefox 45 betas took almost 12 hours
- One of the Firefox 46 betas took less than 3 hours
- Support Firefox for Android
- Support release and ESR branches
- Extend this process back to the aurora and nightly channels
Can I contribute?
Yes! We still have a lot of things to do and welcome everyone to contribute.
For more information, please refer to these other resources about build promotion:
- March 7th, 2016 Mozilla Project Meeting - Jordan Lund (:jlund) presentation about release build promotion (starts at ~3:20 and ends at about ~10:15)
- Join us on IRC in #promotion!
There will be multiple blog posts regarding this project. You have probably seen Jordan's blog on how to be productive when distributed teams get together. It covers some of our experience we had during the project sprint week in Vancouver.
March 07, 2016
Improve Release Pipeline:
- Nick ran a staging release for 46.0b1 to check for issues before the merge, preventing some bustage for Fennec and ensuring we can fall back to the old system if any unexpected issues show up with release promotion
- Varun improved Balrog’s detection of certain types of bad data.
- Ben finished most of the work involved with preparing Balrog to move to CloudOps infrastructure, including automatically building Docker images.
- We’re hoping to do our first beta with the new release build promotion pipeline next week for 47.0b1. Stay tuned!
|Everyone gets a release promotion! Source:|
Improve CI Pipeline:
- Dustin deployed a new version of the TaskCluster tools/login system with much improved UI for handling signing in and out and editing clients and roles. He also simplified the existing roles, with the result that the set of roles now fits on one screen, and is entirely composed of human-readable names. All of this works toward two important goals: building a sign-in system that is useful and usable by all mozillians; and configuring the access-control system to give everyone their appropriate permissions and no more.
The releases calendar is getting busier as we get closer to the end of the cycle. Many releases were shipped or are still in-flight:
- Firefox 45.0b10
- Fennec 45.0b11
- Fennec 45.0 (in-progress)
- Firefox 45.0 (in-progress) - we shipped the RC to the beta channel
- Firefox 45.0esr (in-progress)
- Firefox 38.7.0esr (in-progress)
- Vlad, Alin and Amy reallocated 28 Windows test machines to the w7 pool to help with backlog and e10s testing.
- Jake deployed new OpenSSL packages to protect our infrastructure from Drown Attack and various other recent OpenSSL vulnerabilities
Until next time!
February 29, 2016
Improve Release Pipeline:
- Chris, Jordan, Callek (remotely), Kim, Mihai and Rail had a sprint on Release Promotion. We made so much progress on this project that we decided to use the new process for Firefox 46.0b1. https://bugzil.la/1118794 So many green jobs!
- A new contributor, Kumar, fixed a longstanding annoyance with how Balrog shows changes to Releases.
- Ben and Nick came up with a new way to model GMP and System Addons in Balrog which will allow for more fine grained permissions and simpler rules.
- Mihai added a functionality in tctalker to walk the graph and cancel all pending/running tasks. He also added release sanity check logic for en-US binaries within promotion.
- Kim added support for virus scanning of release promotion artifacts, a new docker image for this task, a task to move source bundles and checksum artifacts, and sha256 checksums for the docker images we use in release promotion.
Improve CI Pipeline:
- Rok added a button “Purge workers cache” in Taskcluster
- A new contributor, Varun, is helping to improve our l10n repack process.
- Aki added python3 compatibility to taskcluster-client.py https://github.com/taskcluster/taskcluster-client.py/pull/44
- Ben, Mihai, Nick, Rail and Callek Shipped Firefox 45.0b6, Firefox 45.0b7, Firefox 45.0b9, Fennec 45.0b6, Thunderbird 45.0b2
- Alin landed changes to run mochitest-push-e10s tests on Windows 7 https://bugzil.la/1248729. This is another step toward completing the enabling of e10s tests.
January 21, 2016
Every new year gives you an opportunity to sit back, relax, have some scotch and re-think the passed year. Holidays give you enough free time. Even if you decide to not take a vacation around the holidays, it's usually calm and peaceful.
This time, I found myself thinking mostly about productivity, being effective, feeling busy, overwhelmed with work and other related topics.
When I started at Mozilla (almost 6 years ago!), I tried to apply all my GTD and time management knowledge and techniques. Working remotely and in a different time zone was an advantage - I had close to zero interruptions. It worked perfect.
Last year I realized that my productivity skills had faded away somehow. 40h+ workweeks, working on weekends, delivering goals in the last week of quarter don't sound like good signs. Instead of being productive I felt busy.
"Every crisis is an opportunity". Time to make a step back and reboot myself. Burning out at work is not a good idea. :)
Here are some ideas/tips that I wrote down for myself you may found useful.
- Task #1: make a daily plan. No plan - no work.
- Don't start your day by reading emails. Get one (little) thing done first - THEN check your email.
- Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
- Meetings are time consuming, so "Set a goal for each meeting". Consider skipping a meeting if you don't have any goal set, unless it's a beer-and-tell meeting! :)
- Constantly ask yourself if what you're working on is important.
- 3-4 times a day ask yourself whether you are doing something towards your goal or just finding something else to keep you busy. If you want to look busy, take your phone and walk around the office with some papers in your hand. Everybody will think that you are a busy person! This way you can take a break and look busy at the same time!
- Take breaks! Pomodoro technique has this option built-in. Taking breaks helps not only to avoid RSI, but also keeps your brain sane and gives you time to ask yourself the questions mentioned above. I use Workrave on my laptop, but you can use a real kitchen timer instead.
- Wear headphones, especially at office. Noise cancelling ones are even better. White noise, nature sounds, or instrumental music are your friends.
- Make sure you enjoy your work environment. Why on the earth would you spend your valuable time working without joy?!
- De-clutter and organize your desk. Less things around - less distractions.
- Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them. Your health is more important and expensive. Thanks to mhoye for this advice!
- Don't check email every 30 seconds. If there is an emergency, they will call you! :)
- Reward yourself at a certain time. "I'm going to have a chocolate at 11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you are Pavlov's dog too!
- Don't try to read everything NOW. Save it for later and read in a batch.
- Capture all creative ideas. You can delete them later. ;)
- Prepare for next task before break. Make sure you know what's next, so you can think about it during the break.
This is my list of things that I try to use everyday. Looking forward to see improvements!
I would appreciate your thoughts this topic. Feel free to comment or send a private email.
Happy Productive New Year!
January 08, 2016
- I don't work in HR
- I'm not a manager (but I interview Mozilla releng candidates)
- I'm not looking for a new job.
- These are just my observations after working in the tech industry for a long time.
Everyone tends to jump into looking at job descriptions and making their resume look pretty. Another scenario is that people have a sudden realization that they need to get out of their current position and find a new job NOW and frantically start applying for anything that matches their qualifications. Before you do that, take a step back and make a list of things that are important to you. For example, when I applied at Mozilla, my list was something like this
- learn release engineering at scale + associated tools/languages
- open source
- no relocation
- work on a team of release engineers (not be the only one)
- good team dynamics - people happy to share knowledge and like to ship
- work in an organization where release engineering is valued for increasing the productivity of the organization as a whole and is funded (hardware/software/services/training) accordingly
- support to attend and present at conferences
|Picture by Mufidah Kassalias - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)|
People tend focus a lot on the technical skills they want to use or new ones you want to learn. You should also think about what kind of culture where you want to work. Do the goals and ethics of the organization align with your own? Who will you be working with? Will you enjoy working with this team? Are you interested in remote work or do you want to work in an office? How will a long commute impact or relocation your quality of life? What is the typical career progression of someone in this role? Are there both management and technical tracks for advancement?
|Picture by mugley - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/mugley/4221455156/sizes/o/|
To summarize, itemize the skills you'd like to use or learn, the culture of the company and the team and why you want to work there.
Your cover letter should succinctly map your existing skills to the role you are applying for and convey enthusiasm and interest. You don't need to have a long story about how you worked on a project at your current job that has no relevance to your potential new employer. Teams that are looking to hire have problems to solve. Your cover letter needs to paint a picture that your have the skills to solve them.
|Picture by Jim Bauer - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/lens-cap/10320891856/sizes/l|
Refactoring your resume
Developers have a lot of opportunities these days, but if you intend to move from another industry, into a tech company, it can be more tricky. The important thing is to convey the skills you have in a a way that people can see they can be applied to the problems they want to hire you to fix.
Many people describe their skills and accomplishments in a way that is too company specific. They may have a list of acronyms and product names on their resume that are unlikely to be known by people outside the company. When describing the work you did in a particular role, describe the work that you did in a that is measurable way that highlights the skills you have. An excellent example of a resume that describes the skills that without going into company specific detail is here. (Julie Pagano also has a terrific post about how she approached her new job search.)
Another tip is to leave out general skills that are very common. For instance, if you are a technical writer, omit the fact that you know how to use Windows and Word and focus on highlighting your skills and accomplishments.
Non-technical interview preparation
Every job has different technical requirements and there are many books and blog posts on how to prepare for this aspect of the interview process. So I'm going to just cover the non-technical aspects.
When I interview someone, I like to hear lots of questions. Questions about the work we do and upcoming projects. This indicates that have taken the time to research the team, company and work that we do. It also shows that enthusiasm and interest.
Here is a list suggestions to prepare for interviews
1. Research the company make a list of relevant questions
Not every company is open about the work that they do, but most will be have some public information that you can use to formulate questions during the interviews. Do you know anyone you can have coffee or skype with to who works for the company and can provide insight? What products/services do the company produce? Is the product nearing end of life? If so, what will it be replaced by? What is the companies market share, is it declining, stable or experiencing growth? Who are their main competitors? What are some of the challenges they face going forward? How will this team help address these challenges?
2. Prepare a list of questions for every person that interviews you ahead of time
Many companies will give you the list of names of people who will interview you.
Have they recently given talks? Watch the videos online or read the slides.
Does the team have github or other open repositories? What are recent projects are they working on? Do they have a blog or are active on twitter? If so, read it and formulate some questions to bring to the interview.
Do they use open bug tracking tools? If so, look at the bugs that have recent activity and add them to the list of questions for your interview.
A friend of mine read the book of a person that interviewed him had written and asked questions about the book in the interview. That's serious interview preparation!
|Photo by https://www.flickr.com/photos/wocintechchat/ https://www.flickr.com/photos/wocintechchat/22506109386/sizes/l|
3. Team dynamics and tools
Is the team growing or are you hiring to replace somebody who left?
What's the onboarding process like? Will you have a mentor?
How is this group viewed by the rest of the company? You want to be in a role where you can make a valuable contribution. Joining a team where their role is not valued by the company or not funded adequately is a recipe for disappointment.
What does a typical day look like? What hours do people usually work?
What tools do people use? Are there prescribed tools or are you free to use what you'd like?
4. Diversity and Inclusion
If you're a member of an underrepresented group in tech, the numbers are lousy in this industry with some notable exceptions. And I say that while recognizing that I'm personally in the group that is the lowest common denominator for diversity in tech.
|The entire thread on this tweet is excellent https://twitter.com/radiomorillo/status/589158122108932096|
I don't really have good advice for this area other than do your research to ensure you're not entering a toxic environment. If you look around the office where you're being interviewed and nobody looks like you, it's time for further investigation. Look at the company's website - is the management team page white guys all the way down? Does the company support diverse conferences, scholarships or internships? Ask on a mailing list like devchix if others have experience working at this company and what it's like for underrepresented groups. If you ask in the interview why there aren't more diverse people in the office and they say something like "well, we only hire on merit" this is a giant red flag. If the answer is along the lines of "yes, we realize this and these are the steps we are taking to rectify this situation", this is a more encouraging response.
A final piece of advice, ensure that you meet with your manager that you're going to report to as part of your hiring process. You want to ensure that you have rapport with them and can envision a productive working relationship.
What advice do you have for people preparing to find a new job?
Katherine Daniels gave at really great talk at Beyond the Code 2014 about how to effectively start a new job. Press start: Beginning a New Adventure Job
She is also the co-author of Effective Devops which has fantastic chapter on hiring.
Erica Joy writes amazing articles about the tech industry and diversity.
Cate Huston has some beautiful posts on how to conduct technical interviews and how to be a better interviewer
Camille Fournier's blog is excellent reading on career progression and engineering management.
Mozilla is hiring!
December 10, 2015
You may have noticed that Windows has had no updates for Nightly for the last week or so. We’ve had a few issues with signing the binaries as part of moving from a SHA-1 certificate to SHA-2. This needs to be done because Windows won’t accept SHA-1 signed binaries from January 1 2016 (this is tracked in bug 1079858).
Updates are now re-enabled, and the update path looks like this
older builds → 20151209095500 → latest Nightly
Some people may have been seeing UAC prompts to run the updater, and there could be one more of those when updating to the 20151209095500 build (which is also the last SHA-1 signed build). Updates from that build should not cause any UAC prompts.
December 03, 2015
Tuning Legacy vcs-sync for 2x profit!
One of the challenges of maintaining a legacy system is deciding how much effort should be invested in improvements. Since modern vcs-sync is “right around the corner”, I have been avoiding looking at improvements to legacy (which is still the production version for all build farm use cases).
While adding another gaia branch, I noticed that the conversion path for active branches was both highly variable and frustratingly long. It usually took 40 minutes for a commit to an active branch to trigger a build farm build. And worse, that time could easily be 60 minutes if the stars didn’t align properly. (Actually, that’s the conversion time for git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to generate the trigger.)
The full details are in bug 1226805, but a simple rearrangement of the jobs removed the 50% variability in the times and cut the average time by 50% as well. That’s a savings of 20-40 minutes per gaia push!
Moral: don’t take your eye off the legacy systems – there still can be some gold waiting to be found!
November 24, 2015
I gave two talks at the summit. One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.
Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city. We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.
|Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l|
In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems. In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
- All Your Base 2014: Laura Thomson, Director of Engineering, Cloud Services Engineering and Operations, Mozilla – Many moving parts: monitoring complex systems
- Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for the long game: Managing complexity in distributed systems
- Mountain West Ruby Conference, Paul Hinze, Hashicorp - Primitives of High Availability
- Strange Loop 2015, Camille Fournier - Hopelessness and Confidence in Distributed Systems Design
In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release". He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success. For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".
Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing. Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.
At the release engineering summit itself, Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company. This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.
Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!
Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.
For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US.
Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines. Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster.
There were many more great talks, hopefully more slides will be up soon!
November 16, 2015
The primary way to download Firefox is at www.mozilla.org, but Mozilla’s Release Engineering team has also maintained directories like
to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.
Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:
Please adapt your scripts to use download.mozilla.org links. We hope it will help you simplify at the same time, as scraping to determine the current version is no longer necessary.
PS. We’ve also removed some latest- directories which were old and crufty, eg firefox/releases/latest-3.6.
November 13, 2015
Complexity & * Practices
I was fortunate enough to be able to attend Dev Ops Days Silicon Valley this year. One of the main talks was given by Jason Hand, and he made some great points. I wanted to highlight two of them in this post:
- Post Mortems are really learning events, so you should hold them when things go right, right? RIGHT!! (Seriously, why wouldn’t you want to spot your best ideas and repeat them?)
- Systems are hard – if you’re pushing the envelope, you’re teetering on the line between complexity and chaos. And we’re all pushing the envelope these days - either by getting fancy or getting lean.
Post Mortems as Learning Events
Our industry has talked a lot about “Blameless Post Mortems”, and techniques for holding them. Well, we can call them “blameless” all we want, but if we only hold them when things go wrong, folks will get the message loud and clear.
If they are truly blameless learning events, then you would also hold them when things go right. And go meh. Radical idea? Not really - why else would sports teams study game films when they win? (This point was also made in a great Ignite by Katie Rose: GridIronOps - go read her slides.)
My $0.02 is - this would also give us a chance to celebrate success. That is something we do not do enough, and we all know the dedication and hard work it takes to not have things go sideways.
And, by the way, terminology matters during the learning event. The person who is accountable for an operation is just that: capable of giving an account of the operation. Accountability is not responsibility.
Terminology and Systems – Setting the right expectations
Part way through Jason’s talk, he has this awesome slide about how system complexity relates to monitoring which relates to problem resolution. Go look at slide 19 - here’s some of what I find amazing in that slide:
- It is not a straight line with a destination. Your most stable system can suddenly display inexplicable behavior due to any number of environmental reasons. And you’re back in the chaotic world with all that implies.
- Systems can progress out of chaos, but that is an uphill battle. Knowing which stage a system is in (roughly) informs the approach to problem resolution.
- Note the wording choices: “known” vs “unknowable” – for all but the “obvious” case, it will be confusing. That is a property of the system, not a matter of staff competency.
While not in his slide, Jason spoke to how each level really has different expectations. Or should have, but often the appropriate expectation is not set. Here’s how he related each level to industry terms.
- Best Practices:
The only level with enough certainty to be able to expect the “best” is the known and familiar one. This is the “obvious” one, because we’ve all done exactly this before over a long enough time period to fully characterize the system, its boundaries, and abnormal behavior.
Here, cause and effect are tightly linked. Automation (in real time) is possible.
- Good Practices:
Once we back away from such certainty, it is only realistic to have less certainty in our responses. With the increased uncertainty, the linkage of cause and effect is more tenuous.
Even if we have all the event history and logs in front of us, more analysis is needed before appropriate corrective action can be determined. Even with automation, there is a latency to the response.
- Emergent Practices:
Okay, now we are pushing the envelope. The system is complex, and we are still learning. We may not have all the data at hand, and may need to poke the system to see what parts are stuck.
Cause and effect should be related, but how will not be visible until afterwards. There is much to learn.
- Novel Practices:
- For chaotic systems, everything is new. A lot is truly unknowable because that situation has never occurred before. Many parts of the system are effectively black boxes. Thus resolution will often be a process of trying something, waiting to see the results, and responding to the new conditions.
There is so much more in that diagram I want to explore. The connecting of problem resolution behavior to complexity level feels very powerful.
My experience tells me that many of these subjective terms are highly context sensitive, and in no way absolute. Problem resolution at 0300 local with a bad case of the flu just has a way of making “obvious” systems appear quite complex or even chaotic.
By observing the behavior of someone trying to resolve a problem, you may be able to get a sense of how that person views that system at that time. If that isn’t the consensus view, then there is a gap. And gaps can be bridged with training or documentation or experience.
October 23, 2015
Due to a bug with the new ftp server we’ve had to disable updates for
- Firefox for Android Nightly
- Firefox for Android Aurora
They’ll resume just as soon as we can get the fix landed.
Update (Oct 25th): Updates are re-enabled, thanks to Mike Shal for the fix.
October 21, 2015
Today we started serving an important set of directories on ftp.mozilla.org using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.
Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.
October 02, 2015
duo MFA & viscosity no-cell setup
The Duo application is nice if you have a supported mobile device, and it’s usable even when you you have no cell connection via TOTP. However, getting Viscosity to allow both choices took some work for me.
For various reasons, I don’t want to always use the Duo application, so would like for Viscosity to alway prompt for password. (I had already saved a password - a fresh install likely would not have that issue.) That took a bit of work, and some web searches.
Disable any saved passwords for Viscosity. On a Mac, this means opening up “Keychain Access” application, searching for “Viscosity” and deleting any associated entries.
Ask Viscosity to save the “user name” field (optional). I really don’t need this, as my setup uses a certificate to identify me. So it doesn’t matter what I type in the field. But, I like hints, so I told Viscosity to save just the user name field:
defaults write com.viscosityvpn.Viscosity RememberUsername -bool true
With the above, you’ll be prompted every time. You have to put “something” in the user name field, so I chose to put “push or TOTP” to remind me of the valid values. You can put anything there, just do not check the “Remember details in my Keychain” toggle.
September 26, 2015
Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests. However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches. This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.
The work to determine the cause of high pending counts is always an interesting mystery.
- Are tests being chunked into smaller jobs that increase end to end time due to the added start up time?
|Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0|
Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem. We
|Increase in seconds that new jobs added to the total compute time per push. (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)|
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives:
September 22, 2015
Using Password Store
Password Store (aka “pass”) is a very handy wrapper for dealing with pgp encrypted secrets. It greatly simplifies securely working with multiple secrets. This is still true even if you happen to keep your encrypted secrets in non-password-store managed repositories, although that setup isn’t covered in the docs. I’ll show my setup here. (See the Password Store page for usage: “pass show -c <spam>” & “pass search <eggs>” are among my favorites.)
- Short version:
Have gpg installed on your machine.
Setup a local password store. Scroll down in the usage section to “Setting it up” for instructions.
Clone your secrets repositories to your normal location. Do not clone inside of ~/.password-store/.
Set up symlinks inside of ~/.password-store/ to directories inside your clone of the secrets repository. I did:
ln -s ~/path/to/secrets-git/passwords rePasswords ln -s ~/path/to/secrets-git/keys reKeys
Enjoy command line search and retrieval of all your secrets. (Use the regular method for your separate secrets repository to add and update secrets.)
- By using symlinks, pass will not allow me to create or update secrets in the other repositories. That prevents mistakes, as the process is different for each of those alternate stores.
- I prefer to have just one tree of secrets to search, rather than the “multiple configuration” approach documented on the Password Store site.
- By using symlinks, I can control the global namespace, and use names that make sense to me.
- I’ve migrated from using KeePassX to using pass for my personal secret management. That is my “main” password-store setup (backed by a git repo).
- If you’d prefer a GUI, there’s qtpass which also works with the above setup.
August 03, 2015
Note: this post has been sitting in the drafts queue for some reason. Better to publish it. :)
As of Tuesday, Aug 4, 2015 Funsize has been enabled on mozilla-central. From now on all Nightly builds builds will get updates for builds up to 4 days in the past (for yesterday, for the day before yesterday, etc). This should make people who don't run their nightlies every day happier.
Firefox Developer Edition partial updates will be enabled after 42.0 hits mozilla-aurora, but early adopters can use the aurora-funsize channel.
Partial updates as a part of builds and L10N repacks on nightly will be disabled as soon as Bug 1173459 is resolved.
As a bonus, you can take a look at the presentation I gave in Whistler during the work week.
If you see any issues, please report them to Bugzilla.
July 30, 2015
Decoding Hashed known_hosts Files
tl;dr: You might find this gist handy if you enable HashKnownHosts
Modern ssh comes with the option to obfuscate the hosts it can connect to, by enabling the HashKnownHosts option. Modern server installs have that as a default. This is a good thing.
The obfuscation occurs by hashing the first field of the known_hosts file - this field contains the hostname,port and IP address used to connect to a host. Presumably, there is a private ssh key on the host used to make the connection, so this process makes it harder for an attacker to utilize those private keys if the server is ever compromised.
Super! Nifty! Now how do I audit those files? Some services have multiple IP addresses that serve a host, so some updates and changes are legitimate. But which ones? It’s a one way hash, so you can’t decode.
Well, if you had an unhashed copy of the file, you could match host keys and determine the host name & IP.  You might just have such a file on your laptop (at least I don’t hash keys locally).  (Or build a special file by connecting to the hosts you expect with the options “-o HashKnownHosts=no -o UserKnownHostsFile=/path/to/new_master”.)
I through together a quick python script to do the matching, and it’s at this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.
Bonus Tip: https://github.com/defunkt/gist
Is a very nice way to manage gists from the command line.
|||A lie - you’ll only get the host name and IP’s that you have connected to while building your reference known_hosts file.|
|||I use other measures to keep my local private keys unusable.|
July 17, 2015
This xckd reminded me of the challenges of managing our buildfarm somedays :-)
I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes). There are quizzes and assignments that are due each week. Each course below was about four weeks long.
The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R. Most of the quizzes and examples used publicly available data for the programming exercises. I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments. The lectures didn't provide a lot of the material needed to complete the assignments. Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R. Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing R programs. The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.
There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning. My next coursera class is Introduction to Systems Engineering which I'll start in a couple of weeks. I've really become interested in learning more about this subject after reading Thinking in Systems.
The other course I took this spring was the Software Carpentry Instructor training course. The Software Carpentry Foundation teachers researchers basic software skills. For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others. These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively. The instructor course was excellent, thanks Greg Wilson for your work teaching us.
We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material. How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more. Really interesting research.
For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises. We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix. Then we are ready to sign up to teach a Software Carpentry course!
In conclusion, data science is a great skill to have if you are managing large distributed systems. Also, using evidence based teaching methods to help others learn is the way to go!
Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy
July 07, 2015
Funsize is very close to be enabled in production! It has undergone a Rapid Risk Assessment procedure, which found a couple of potential issues. Most of them are either resolved or waiting for deployment.
To make sure everything works as expected and to catch some last-minute bugs, I added new update channels for Firefox Nightly and Developer Edition. If you are brave enough, you can use the following instructions and change your update channels to either nightly-funsize (for Firefox Nightly) or aurora-funsize (for Firefox Developer Edition).
TL;DR instruction look like this:
- Enable update logging. Set app.update.log to true in about:config. This step is optional, but it may help with debugging possible issues.
- Shut down all running instances of Firefox.
- Edit defaults/pref/channel-prefs.js in your Firefox installation directory and change the line containing app.update.channel with:
- Continue using you Firefox
You can check your channel in the About dialog and it should contain the channel name:
If you see any issues, please report them to Bugzilla.
June 16, 2015
We run SETA on two branches (mozilla-inbound and fx-team) and on 18 types of builds. Collectively, these two branches represent about 20% of pushes each month. Implementing SETA allowed us to move from ~400 -> ~240 jobs per push on these two branches1 We run the tests identified as not reporting regressions on every 10th commit or 90 minutes since the last test was scheduled. We run the critical tests on every commit.2
|Reduction in number of jobs per push on mozilla-inbound as SETA scheduling is rolled out|
A graph for the fx-team branch shows a similar trend. It was a staged rollout starting in early April, as I enabled platforms and as the SETA data became available. The dip in early June reflects where I enabled SETA for Android 4.3.
This data will continue to be updated in our scheduling configuration as it evolves and is updated by the code that Joel and Vaibhav wrote to analyze regressions. The analysis identifies that there were
Jobs to ignore: 440
Jobs to run: 114
Total number of jobs: 554
which is significant. Our buildbot configurations are updated the latest SETA data with every reconfig, which occurs usually occurs every couple of days.
The platforms configured to run fewer tests for both opt and debug are
MacOSX (10.6, 10.10)
Windows (XP, 7, 8)
Ubuntu 12.04 for linux32, linux64 and ASAN x64
Android 2.3 armv7 API 9
Android 4.3 armv7 API 11+
1Tests may have been disabled/added at the same time, this is not taken into account
2There still some scheduling issues to be fixed see bug 1174870 and bug 1174746 for further details
June 12, 2015
Running a large continuous integration farm forces you to deal with many dynamic inputs coupled with capacity constraints. The number of pushes increase. People add more tests. We build and test on a new platform. If the number of machines available remains static, the computing time associated with a single push will increase. You can scale this for platforms that you build and test in the cloud (for us - Linux and Android on emulators), but this costs more money. Adding hardware for other platforms such as Mac and Windows in data centres is also costly and time consuming.
Do we really need to run every test on every commit? If not, which tests should be run? How often do they need to be run in order to catch regressions in a timely manner (i.e. able to bisect where the regression occurred)
Several months ago, jmaher and vaibhav1994, wrote code to analyze the test data and determine the minimum number of tests required to run to identify regressions. They named their software SETA (search for extraneous test automation). They used historical data to determine the minimum set of tests that needed to be run to catch historical regressions. Previously, we coalesced tests on a number of platforms to mitigate too many jobs being queued for too few machines. However, this was not the best way to proceed because it reduced the number of times we ran all tests, not just less useful ones. SETA allows us to run a subset of tests on every commit that historically have caught regressions. We still run all the test suites, but at a specified interval.
|encouragement, Creative Commons by-nc-sa 2.0by ©|
- MacOSX (10.6, 10.10)
- Windows (XP, 7, 8)
- Ubuntu 12.04 for linux32, linux64 and ASAN x64
- Android 2.3 armv7 API 9
As we gather more SETA data for newer platforms, such as Android 4.3, we can implement SETA scheduling for it as well and reduce our test load. We continue to run the full suite of tests on all platforms other branches other than m-i and fx-team, such as mozilla-central, try, and the beta and release branches. If we did miss a regression by reducing the tests, it would appear on other branches mozilla-central. We will continue to update our configs to incorporate SETA data as it changes.
How does SETA scheduling work?
We specify the tests that we would like to run on a reduced schedule in our buildbot configs. For instance, this specifies that we would like to run these debug tests on every 10th commit or if we reach a timeout of 5400 seconds between tests.
Previously, catlee had implemented a scheduling in buildbot that allowed us to coallesce jobs on a certain branch and platform using EveryNthScheduler. However, as it was originally implemented, it didn't allow us to specify tests to skip, such as mochitest-3 debug on MacOSX 10.10 on mozilla-inbound. It would only allow us to skip all the debug or opt tests for a certain platform and branch.
I modified misc.py to parse the configs and create a dictionary for each test specifying the interval at which the test should be skipped and the timeout interval. If the tests has these parameters specified, it should be scheduled using the EveryNthScheduler instead of the default scheduler.
Joel Maher: SETA – Search for Extraneous Test Automation
The number of pushes decreased from those recorded in the previous month (8894) with a total of 8363.
- 8363 pushes
- 270 pushes/day (average)
- Highest number of pushes/day: 445 pushes on May 21, 2015
- 16.03 pushes/hour (highest average)
- Try has around 62% of all the pushes now
- The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 27% of all the pushes.
- August 2014 was the month with most pushes (13090 pushes)
- August 2014 had the highest pushes/day average with 422 pushes/day
- July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
- October 8, 2014 had the highest number of pushes in one day with 715 pushes
The number of pushes decreased from those recorded in the previous month with a total of 8894. This is due to the fact that gaia-try is managed by taskcluster and thus these jobs don't appear in the buildbot scheduling databases anymore which this report tracks.
- 8894 pushes
- 296 pushes/day (average)
- Highest number of pushes/day: 528 pushes on Apr 1, 2015
- 17.87 pushes/hour (highest average)
- Try has around 58% of all the pushes now that we no longer track gaia-try
- The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 28% of all the pushes.
- August 2014 was the month with most pushes (13090 pushes)
- August 2014 had the highest pushes/day average with 422 pushes/day
- July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
- October 8, 2014 had the highest number of pushes in one day with 715 pushes
I've changed the graphs to only track 2015 data. Last month they were tracking 2014 data as well but it looked crowded so I updated them. Here's a graph showing the number of pushes over the last few years for comparison.
April 28, 2015
|pinomoscato, Creative Commons by-nc-sa 2.0by ©|
April 21, 2015
ftp.mozilla.org has been around for a long time in the world of Mozilla, dating back to original source release in 1998. Originally it was a single server, but it’s grown into a cluster storing more than 60TB of data, and serving more than a gigabit/s in traffic. Many projects store their files there, and there must be a wide range of ways that people use the cluster.
This quarter there is a project in the Cloud Services team to move ftp.mozilla.org (and related systems) to the cloud, which Release Engineering is helping with. It would be very helpful to know what functionality people are relying on, so please complete this survey to let us know. Thanks!
April 15, 2015
The number of pushes increased from those recorded in the previous month with a total of 10943.
- 10943 pushes
- 353 pushes/day (average)
- Highest number of pushes/day: 579 pushes on Mar 11, 2015
- 23.18 pushes/hour (highest average)
- Try keeps on having around 49% of all the pushes
- The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 26% of all the pushes.
- August 2014 was the month with most pushes (13090 pushes)
- August 2014 had the highest pushes/day average with 422 pushes/day
- July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
- October 8, 2014 had the highest number of pushes in one day with 715 pushes
March 31, 2015
The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.
I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.
The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.
All tasks are run inside Docker containers which are published on the docker.com registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.
Things that I really liked
- Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
- Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
- Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
- User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.
Things that could be improved
- Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
- It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.
March 20, 2015
This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1. (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)
So basically upgrade all the machines without letting people notice what you're doing!
|Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0|
Why didn't we just buy more minis and add them to the existing pool of test machines?
- We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid. If we buy new hardware, we need to replace the entire pool at once. Machines with different hardware specifications = useless performance test comparisons.
- We tried to purchase some used machines with the same hardware specs as our existing machines. However, we couldn't find a source for them. As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
|Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0|
Given that Yosemite was released last October, why we are only upgrading our test pool now? We wait until the population of users running a new platform2 surpass those the old one before switching.
Mountain Lion -> Yosemite is an easy upgrade on your laptop. It's not as simple when you're updating production machines that run tests at scale.
The first step was to pull a few machines out of production and verify the Puppet configuration was working. In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.
Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms. This was a very iterative process. Run tests on try. Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.
- Disable selected Mountain Lion machines from the production pool
- Reimage as Yosemite, update DNS and let them puppetize
- Land patches to disable Mountain Lion tests and enable corresponding Yosemite tests on selected branches
- Enable Yosemite machines to take production jobs
- Reconfig so the buildbot master enable new Yosemite builders and schedule jobs appropriately
- Repeat this process in batches
- Enable Yosemite opt and performance tests on trunk (gecko >= 39) (50 machines)
- Enable Yosemite debug (25 more machines)
- Enable Yosemite on mozilla-aurora (15 more machines)
As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane. We encountered two major problems that impeded that goal:
- Persistent and increasing numbers of DNS failures as we migrated more machines to Yosemite. The default Yosemite configuration broadcasts multicast messages via Bonjour. This is fine for a few Apple devices talking to each other in your house. This doesn't scale for 100 machines in a colo. I saw many DNS timeout messages in the system log. This manifested itself large numbers of performance tests failing because they couldn't resolve the name of the graphing server to upload their results. We disabled this multicast broadcast via Puppet and our tests turned green again.
- Debug tests initially took 1.5-2x longer to complete on Yosemite than they did on Mountain Lion. If we deployed these tests in production, our wait times would increase because of the longer time required to complete each test run. This was fixed by bug 1138616 Don't track addref and release counts in the bloat log.
It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see. So it was a nice to hear that from one of my colleagues this week.
Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up. Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!
1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform. To add to this, the list of platform we support varies across branches. For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10
Bug 1121199: Green up 10.10 tests currently failing on try
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite
Why don't you just run these tests in the cloud?
- The Apple EULA severely restricts virtualization on Mac hardware.
- I don't know of any major cloud vendors that offer the Mac as a platform. Those that claim they do are actually renting racks of Macs on a dedicated per host basis. This does not have the inherent scaling and associated cost saving of cloud computing. In addition, the APIs to manage the machines at scale aren't there.
- We manage ~350 Mac minis. We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.