Planet Release Engineering

May 28, 2015

Armen Zambrano G. (@armenzg)

mozci 0.7.0 - Less network fetches - great speed improvements!

This release is not large in scope but it has many performance improvements.
The main improvement is to have reduced the number of times that we fetch for information and use a cache where possible. The network cost was very high.
You can read more about in here:


Thanks to @adusca @parkouss @vaibhavmagarwal for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • Reduce drastically the number of requests by caching where possible
  • If a failed build has uploaded good files let's use them
  • Added support for retriggering and cancelling jobs
  • Retrigger a job once with a count of N instead of triggering individually N times

Minor improvements

  • Documenation updates
  • Add badge

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 28, 2015 02:01 PM

May 27, 2015

Armen Zambrano G. (@armenzg)

Welcome adusca!

It is my privilege to announce that adusca (blog) joined Mozilla (since Monday) as an Outreachy intern for the next 4 months.

adusca has an outstanding number of contributions over the last few months including Mozilla CI Tools (which we're working on together).

Here's a bit about herself from her blog:
Hi! I’m Alice. I studied Mathematics in college. I was doing a Master’s degree in Mathematical Economics before getting serious about programming.
She is also a graduate from Hacker's School.

Even though Alice has not been a programmer for many years, she has shown already lots of potential. For instance, she wrote a script to generate scheduling relations for buildbot; for this and many other reasons I tip my hat.

adusca will initially help me out with creating a generic pulse listener to handle job cancellations and retriggers for Treeheder. The intent is to create a way for Mozilla CI tools to manage scheduling on behalf of TH, make the way for more sophisticated Mozilla CI actions and allow other people to piggy back to this pulse service and trigger their own actions.

If you have not yet had a chance to welcome her and getting to know her, I highly encourage you to do so.

Welcome Alice!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 27, 2015 05:10 PM

May 26, 2015

Chris AtLee (catlee)

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.


"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.


Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.
    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?
    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.
    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards
  • actionable metrics - people need to know how to move the needle
  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.
  • standardized quality metrics across the company
  • informing engineering teams about risk
  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

May 26, 2015 06:51 PM

May 25, 2015

Aki Sasaki (aki)

introducing scriptharness

I found myself missing mozharness at various points over the past 10 months. Several things kept me from using it at my then-new job:

I had wanted to address these issues for years, but never had time to devote fully to harness-specific development.

Now I do.

Introducing scriptharness 0.1.0:

I'm proud of this. I'm also aware it's not mature [yet], and it's currently missing some functionality.

There are some ideas I'd love to explore before 1.0.0:

I already have 0.2.0 on the brain. I'd love any feedback or patches.

comment count unavailable comments

May 25, 2015 08:32 PM

mozharness turns 5

Five years ago today, I landed the first mozharness commit in my user repo. (github)

starting something, or wasting my time. + a scratch trunk_nightly.json

The project had three initial goals:

Multi-locale Fennec became a reality, and then we started adding projects to mozharness, one by one.

As of last July, mozharness was the client-side engine for the majority of Mozilla's CI and release infrastructure. I still see plenty of activity in bugmail and IRC these days. I'll be the first to point out its shortcomings, but I think overall it has been a success.

Happy birthday, mozharness!

comment count unavailable comments

May 25, 2015 08:21 PM

May 15, 2015

Armen Zambrano G. (@armenzg)

mozci 0.6.0 - Trigger based on Treeherder filters, Windows support, flexible and encrypted password managament

In this release of mozci we have a lot of developer facing improvements like Windows support or flexibility on password management.
We also have our latest experimental script mozci-triggerbyfilters (

How to update

Run "pip install -U mozci" to update.


We have move all scripts from scripts/ to mozci/scripts/.
Note that you can now use "pip install" and have all scripts available as mozci-name_of_script_here in your PATH.


We want to welcome @KWierso as our latest contributor!
Our gratitude @Gijs for reporting the Windows issues and for all his feedback.
Congratulations to @parkouss for making the first project using mozci as its dependency.
In this release we had @adusca and @vaibhavmagarwal as our main and very active contributors.

Major highlights

  • Added script to trigger jobs based on Treeherder filters
    • This allows using filters like --include "web-platform-tests" and that will trigger all matching builders
    • You can also use --exclude to exclude builders you don't want
  • With the new trigger by filters script you can preview what will be triggered:
233 jobs will be triggered, do you wish to continue? y/n/d (d=show details) d
05/15/2015 02:58:17 INFO: The following jobs will be triggered:
Android 4.0 armv7 API 11+ try opt test mochitest-1
Android 4.0 armv7 API 11+ try opt test mochitest-2
  • Remove storing passwords in plain-text (Sorry!)
    • We now prompt the user if he/she wants to store their password enctrypted
  • When you use "pip install" we will also install the main scripts as mozci-name_of_script_here binaries
    • This makes it easier to use the binaries in any location
  • Windows issues
    • The python module is uncapable of decompressing large binaries
    • Do not store buildjson on a temp file and then move

Minor improvements

  • Updated docs
  • Improve wording when triggering a build instead of a test job
  • Loosened up the python requirements from == to >=
  • Added filters to

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 15, 2015 08:13 PM

May 08, 2015

Armen Zambrano G. (@armenzg)

mozci 0.5.0 released - Store password in keyring, prevent corrupted data, progress bar and many small improvements

In this release we have many small improvements that help with issues we have found.

The main improvement is that we now don't store credentials in plain-text (sorry!) but use keyring to store it encrypted.

We also prevent partially downloading any data (corrupted data) and added progress bar to downloads.

Congrats to @chmanchester as our latest contributor!
Our usual and very appreciated contributions are by @adusca @jmaher and @vaibhavmagarwal

Minor improvements:
  • Lots of test changes and increased coverage
  • Do not use the root logger but a mozci logger
  • Allow passing custom files to a triggered job
  • Work around buildbot status corruptions (Issue 167)
  • Allow passing buildernames with lower case and removing trailing spaces (since we sometimes copy/paste from TH)
  • Added support to use build a buildername based on trychooser syntax
  • Allow passing extra properties when scheduling a job on Buildbot
You can see all changes in here:

Link to official release notes.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 08, 2015 10:15 PM

May 05, 2015

Chris Cooper (coop)

Automated reconfigs and you

In an effort to offload yet more work from buildduty, today I deployed scripts to automatically reconfig masters when relevant repos are updated. The process works as follows:

Pretty simple, right?

So how does this change affect you?

In practice, it doesn’t.

Unless you already have an environment setup to run the script, you’ll probably still want to ask buildduty to perform the merge for you. The script has been changed to *only* merge repos by default, but the script also updates the wiki and bugzilla, which is important for maintaining an audit trail.

However, if you *are* willing to perform these extra steps, you can have your changes automatically deployed within the hour.

The eventual goal is to become comfortable enough with our travis coverage that we can move the production tags automatically when the tests pass.

Our tests seem pretty solid now to me, but maybe others have thoughts about how aggressive or cautious we should be here.

May 05, 2015 10:23 PM

May 02, 2015

Morgan Phillips (mrrrgn)

To Serve Developers

The neatest thing about release engineering, is the fact that our pipeline forms the primary bridge between users and developers. On one end, we maintain the CI infrastructure that engineers rely on for thorough testing of their code, and, on the other end, we build stable releases and expose them for the public to download. Being in this position means that we have the opportunity to impact the experiences of both contributors and users by improving our systems (it also makes working on them a lot of fun).

Lately, I've become very interested in improving the developer experience by bringing our CI infrastructure closer to contributors. In short, I would like developers to have access to the same environments that we use to test/build their code. This will make it:
[The release pipeline from 50,000ft]


The first part of my plan revolves around integrating release engineering's CI system with a tool that developers are already using: mach; starting with a utility called: mozbootstrap -- a system that detects its host operating system and invokes a package manager for installing all of the libraries needed to build firefox desktop or firefox android.

The first step here was to make it possible to automate the bootstrapping process (see bug: 1151834 "allow users to bootstrap without any interactive prompts"), and then integrate it into the standing up of our own systems. Luckily, at the moment I'm also porting some of our Linux builds from buildbot to TaskCluster (see bug: 1135206), which necessitates scrapping our old chroot based build environments in favor of docker containers. This fresh start has given me the opportunity begin this transition painlessly.

This simple change alone strengthens the interface between RelEng and developers, because now we'll be using the same packages (on a given platform). It also means that our team will be actively maintaining a tool used by contributors. I think it's a huge step in the right direction!

What platforms/distributions are you supporting?

Right now, I'm only focusing on Linux, though in the future I expect to support OSX as well. The bootstrap utility supports several distributions (Debian/Ubuntu/CentOS/Arch), though, I've been trying to base all of release engineering's new docker containers on Ubuntu 14.04 -- as such, I'd consider this our canonical distribution. Our old builders were based on CentOS, so it would have been slightly easier to go with that platform, but I'd rather support the platform that the majority of our contributors are using.

What about developers who don't use Ubuntu 14.04, and/or have a bizarre environment

One fabulous side effect of using TaskCluster is that we're forced to create docker containers for running our jobs, in fact, they even live in mozilla-central. That being the case, I've started a conversation around integrating our docker containers into mozbootstrap, giving it the option to pull down a releng docker container in lieu of bootstrapping a host system.

On my own machine, I've been mounting my src directory inside of a builder and running ./mach build, then ./mach run within it. All of the source, object files, and executables live on my host machine, but the actual building takes place in a black box. This is a very tidy development workflow that's easy to replicate and automate with a few bash functions [which releng should also write/support].

[A simulation of how I'd like to see developers interacting with our docker containers.]

Lastly, as the final nail in the coffin of hard to reproduce CI bugs, I'd like to make it possible for developers to run our TaskCluster based test/build jobs on their local machines. Either from mach, or a new utility that lives in /testing.

If you'd like to follow my progress toward creating this brave new world -- or heckle me in bugzilla comments -- check out these tickets:

May 02, 2015 06:54 AM

May 01, 2015

Kim Moir (kmoir)

Mozilla pushes - April 2015

Here's April 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.  

The number of pushes decreased from those recorded in the previous month with a total of 8894.  This is due to the fact that gaia-try is managed by taskcluster and thus these jobs don't appear in the buildbot scheduling databases anymore which this report tracks.


General Remarks


I've changed the graphs to only track 2015 data.  Last month they were tracking 2014 data as well but it looked crowded so I updated them.  Here's a graph showing the number of pushes over the last few years for comparison.

May 01, 2015 04:44 PM

April 28, 2015

Kim Moir (kmoir)

Releng 2015 program now available

Releng 2015 will take place in concert with ICSE in Florence, Italy on May 19, 2015. The program is now available. Register here!

via romana in firenze by ©pinomoscato, Creative Commons by-nc-sa 2.0

April 28, 2015 07:54 PM

Less testing, same great Firefox taste!

Running a large continuous integration farm forces you to deal with many dynamic inputs coupled with capacity constraints. The number of pushes increase.  People add more tests.  We build and test on a new platform.  If the number of machines available remains static, the computing time associated with a single push will increase.  You can scale this for platforms that you build and test in the cloud (for us - Linux and Android on emulators), but this costs more money.  Adding hardware for other platforms such as Mac and Windows in data centres is also costly and time consuming.

Do we really need to run every test on every commit? If not, which tests should be run?  How often do they need to be run in order to catch regressions in a timely manner (i.e. able to bisect where the regression occurred)

Several months ago, jmaher and vaibhav1994, wrote code to analyze the test data and determine the minimum number of tests required to run to identify regressions.  They named their software SETA (search for extraneous test automation). They used historical data to determine the minimum set of tests that needed to be run to catch historical regressions.  Previously, we coalesced tests on a number of platforms to mitigate too many jobs being queued for too few machines.  However, this was not the best way to proceed because it reduced the number of times we ran all tests, not just less useful ones.  SETA allows us to run a subset of tests on every commit that historically have caught regressions.  We still run all the test suites, but at a specified interval. 

SETI – The Search for Extraterrestrial Intelligence by ©encouragement, Creative Commons by-nc-sa 2.0
In the last few weeks, I've implemented SETA scheduling in our our buildbot configs to use the data that the analysis that Vaibhav and Joel  implemented.  Currently, it's implemented on mozilla-inbound and fx-team branches which in aggregate represent around 19.6% (March 2015 data) of total pushes to the trees.  The platforms configured to run fewer pushes for both opt and debug are

As we gather more SETA data for newer platforms, such as Android 4.3, we can implement SETA scheduling for it as well and reduce our test load.  We continue to run the full suite of tests on all platforms other branches other than m-i and fx-team, such as mozilla-central, try, and the beta and release branches. If we did miss a regression by reducing the tests, it would appear on other branches mozilla-central. We will continue to update our configs to incorporate SETA data as it changes.

How does SETA scheduling work?
We specify the tests that we would like to run on a reduced schedule in our buildbot configs.  For instance, this specifies that we would like to run these debug tests on every 10th commit or if we reach a timeout of 5400 seconds between tests.

Previously, catlee had implemented a scheduling in buildbot that allowed us to coallesce jobs on a certain branch and platform using EveryNthScheduler.  However, as it was originally implemented, it didn't allow us to specify tests to skip, such as mochitest-3 debug on MacOSX 10.10 on mozilla-inbound.  It would only allow us to skip all the debug or opt tests for a certain platform and branch.

I modified to parse the configs and create a dictionary for each test specifying the interval at which the test should be skipped and the timeout interval.  If the tests has these parameters specified, it should be scheduled using the  EveryNthScheduler instead of the default scheduler.
There are still some quirks to work out but I think it is working out well so far. I'll have some graphs in a future post on how this reduced our test load. 

Further reading
Joel Maher: SETA – Search for Extraneous Test Automation

April 28, 2015 06:47 PM

April 27, 2015

Armen Zambrano G. (@armenzg)

mozci hackday - Friday May 1st, 2015

I recently blogged about mozci and I was gladly surprised that people have curiosity about it.

I want to spend Friday fixing some issues on the tool and I wonder if you would like to join me to learn more about it and help me fix some of them.

I will be available as armenzg_mozci from 9 to 5pm EDT on IRC (#ateam channel).
I'm happy to jump on Vidyo to give you a hand understanding mozci.

I hand picked some issues that I could get a hand with.
Documentation and definition of the project in readthedocs.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 27, 2015 05:30 PM

April 24, 2015

Armen Zambrano G. (@armenzg)

What Mozilla CI tools is and what it can do for you (aka mozci)

Mozci (Mozilla CI tools) is a python library, scripts and package which allows you to trigger jobs on
Not all jobs can be triggered but those that are run on Release Engineering's Buildbot setup. Most (if not all) Firefox desktop and Firefox for Android jobs can be triggered. I believe some B2G jobs can still be triggered.

NOTE: Most B2G jobs are not supported yet since they run on TaskCluster. Support for it will be given on this quarter.

Using it

Once you check out the code:
git clone
python develop
you can run scripts like this one (click here for other scripts):
python scripts/ \
  --buildername "Rev5 MacOSX Yosemite 10.10 fx-team talos dromaeojs" \
  --rev e16054134e12 --times 10
which would trigger a specific job 10 times.

NOTE: This is independent if a build job exist to trigger the test job. mozci will trigger everything which is required to get you what you need.

One of the many other options is if you want to trigger the same job for the last X revisions, this would require you to use --back-revisions X.

There are many use cases and options listed in here.

A use case for developers

One use case which could be useful to developers (thanks @mike_conley!) is if you pushed to try and used this try syntax: "try: -b o -p win32 -u mochitests -t none". Unfortunately, you later determine that you really need this one: "try: -b o -p linux64,macosx64,win32 -u reftest,mochitests -t none".

In normal circumstances you would go and push again to the try server, however, with mozci (once someone implements this), we could simply pass the new syntax to a script (or with ./mach) and trigger everything that you need rather than having to push again and waster resources and your time!

If you have other use cases, please file an issue in here.

If you want to read about the definition of the project, vision, use cases or FAQ please visit the documentation.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 08:23 PM

Firefox UI update testing

We currently trigger manually UI update tests for Firefox releases. There are automated headless update verification tests but they don't test the UI of Firefox.

The goal is to integrate this UI update testing as part of the Firefox releases.
This will require changes to firefox-ui-tests, buildbot scheduling changes, Marionette changes and other Mozbase packages. The ultimate goal is to speed up our turn around on releases.

The update testing code was recently ported from Mozmill to use Marionette to drive the testing.

I've already written some documentation on how to run the update verification using Release Engineering configuration files. You can use my tools repository until the code lands (update_testing is the branch to be used).

My deliverable is to ensure that the update testing works reliably on Release Engineering infrastructure and there is existing scheduling code for it.

You can read more about this project in bug 1148546.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 02:42 PM

April 21, 2015

Nick Thomas (nthomas)

Changes coming to has been around for a long time in the world of Mozilla, dating back to original source release in 1998. Originally it was a single server, but it’s grown into a cluster storing more than 60TB of data, and serving more than a gigabit/s in traffic. Many projects store their files there, and there must be a wide range of ways that people use the cluster.

This quarter there is a project in the Cloud Services team to move (and related systems) to the cloud, which Release Engineering is helping with. It would be very helpful to know what functionality people are relying on, so please complete this survey to let us know. Thanks!

April 21, 2015 02:47 AM

April 20, 2015

Chris AtLee (catlee)

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates



All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server,, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!


When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... is going away...

brace yourselves...ftp is going away its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.


In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.


Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

April 20, 2015 11:00 AM

April 15, 2015

Kim Moir (kmoir)

Mozilla pushes - March 2015

Here's March 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

The number of pushes increased from those recorded in the previous month with a total of 10943. 


General Remarks


April 15, 2015 02:18 PM

April 12, 2015

Massimo Gervasini (mgerva)

Buildduty April 2015

This month I am on Buildduty with Callek. We cover the European and the East Coast time zones.

What is “buildduty”?

From the buildduty wiki page

“Every month, there is one person from the Release Engineering (releng) team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole month. This is similar to the sheriff role that rotates through the sheriffing team . To avoid confusion, the releng sheriff position is known as “buildduty.”

My impressions after two weeks

I haven’t been covering buildduty role for the last year, and I have to admit the idea to take this role for a month was giving me some headaches. Buildduty used to be a source of stress and this is the reason why each person was assigned to the buildduty role for a single week.
My felling were hugely exaggerated, buildduty is not as painfully as it was in the past. Covering this role for just two weeks made me appreciate the incredible work done to make buildduty a better experience. A huge “thank you” to all who contributed to this!

What to expect

Recent changes have created a virtuous cycle, where we can spend less time on manual work and we can invest more time to push the automation further. In particular we are working on two specific areas for the near future:


April 12, 2015 08:59 AM

April 07, 2015

Justin Wood (Callek)

Find our footing on python best practices, of yesteryear.

In the beginning there was fire buildbot. This was Wed, 13 Feb 2008 for the first commit in the repository buildbot-configs.

For context, at this time:

In picking buildbot as our tool we were improving vastly on the decade old technology we had at the time (tinderbox) which was also written in oft-confusing and not-as-shiny perl (we love to hate it now, but it was a good language) [see relevant: image of then-new cutting edge technology but strung together in clunky ways]

As such, we at Mozilla Release Engineering, while just starting to realize the benefits of CI for tests in our main products (like Firefox), were not accustomed to it.

We were writing our buildbot-related code in 3 main repositories at the time (buildbot-configs, buildbotcustom, and tools) all of which we still use today.

Fast forward 5 years and you would have seen a some common antipatterns in large codebases… (over 203k lines of code! )  It was hard to even read most code, let alone hack on it. Each patch was requiring lots of headspace. And we would consistently break things with patches that were not well tested. (even when we tried)

It was at a workweek here in 2013 that catlee got our group agreement on trying to improve that situation by continually running autopep8 over the codebase until there was no (or few) changes with each pass.

Thus began our first, attempt, at bringing our processes to what we call our modern practices.

This reduced, in buildbotcustom and tools alone our pep8 error rate from ~7,139 to ~1,999. (In contrast our current rate for those two repos is ~1485).

(NOTE: This is a good contributor piece, to drive pep8 errors/warnings down to 0 for any of our repos, such as these. We can then make our current tests fail if pep8 fails. Though newer repos started with pep8 compliance, older ones did not. See List of Repositories to pick some if you want to try. — Its not glorious work, but makes everyone more productive once its done.)

The one agreement we decided where pep8 wasn’t for us was line length, we have had many cases where a single line (or even url) barely fits in 80 characters for legit reasons, and felt that arbitrarily limiting variable names or depth just to satisfy that restriction was going to reduce readability. Therefore we generally use –max-line-length of ~159 when validating against pep8.  (The above numbers do not account for –max-line-length)

Around this time we had also setup an internal only jenkins instance as a test for validating at least pep8 and its trends, we have since found jenkins to not be suitable for what we wanted.

Stay tuned to this blog for more history and how we arrived at some best practices that most don’t take for granted these days.

April 07, 2015 01:52 AM

March 31, 2015

Rail Alliev (rail)

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.


There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

March 31, 2015 12:47 PM

March 28, 2015

Jordan Lund (jlund)

Mozharness is moving into the forest

Since its beginnings, Mozharness has been living in its own world (repo). That's about to change. Next quarter we are going to be moving it in-tree.

what's Mozharness?

it's a configuration driven script harness

why in tree?

  1. First and foremost: transparency.
    • There is an overarching goal to provide developers the keys to manage and stand up their own builds & tests (AKA self-serve). Having the automation step logic side by side to the compile and test step logic provides developers transparency and a sense of determinism. Which leads to reason number 2.
  2. deterministic builds & tests
    • This is somewhat already in place thanks to Armen's work on pinning specific Mozharness revisions to in-tree revisions. However the pins can end up behind the latest Mozharness revisions so we end up often landing multiple changes to Mozharness at once to one in-tree revsion.
  3. Mozharness automated build & test jobs are not just managed by Buildbot anymore. Taskcluster is starting to take the weight off Buildbot's hands and, because of its own behaviour, Mozharness is better suited in-`tree.
  4. ateam is going to put effort this quarter into unifying how we run tests locally vs automation. Having mozharness in-tree should make this easier

this sounds great. why wouldn't we want to do this?

There are downsides. It arguably puts extra strain on Release Engineering for managing infra health. Though issues will be more isolated, it does become trickier to have a higher view of when and where Mozharness changes land.

In addition, there is going to be more friction for deployments. This is because a number of our Mozharness scripts are not directly related to continuous integration jobs: e.g. releases, vcs-sync, b2g bumper, and merge tasks.

why wasn't this done yester-year?

Mozharness now handles > 90% of our build and test jobs. Its internal components: config, script, and log logic, are starting to mature. However, this wasn't always the case.

When it was being developed and its uses were unknown, it made sense to develop on the side and tie itself close to buildbot deployments.

okay. I'm sold. can we just simply hg add mozharness?

Integrating Mozharness in-tree comes with a fe6 challenges

  1. chicken and egg issue

    • currently, for build jobs, Mozharness is in charge of managing version control of the tree itself. How can Mozharness checkout a repo if it itself lives within that repo?
  2. test jobs don't require the src tree

    • test jobs only need a binary and a It doesn't make sense to keep a copy of our branches on each machine that runs tests. In line with that, putting mozharness inside also leads us back to a similar 'chicken and egg' issue.
  3. which branch and revisions do our release engineering scripts use?

  4. how do we handle releases?

  5. how do we not cause extra load on hg.m.o?

  6. what about integrating into Buildbot without interruption?

it's easy!

This shouldn't be too hard to solve. Here is a basic outline my plan of action and road map for this goal:

This is a loose outline of the integration strategy. What I like about this

  1. no code change required within Mozharness' code
  2. there is very little code change within Buildbot
  3. allows Taskcluster to use Mozharness in whatever way it likes
  4. no chicken and egg problem as (in Buildbot world), Mozharness will exist before the tree exists on the slave
  5. no need to manage multiple repos and keep them in sync

I'm sure I am not taking into account many edge cases and I look forward to hitting those edges head on as I start this in Q2. Stay tuned for further developments.

One day, I'd like to see Mozharness (at least its internal parts) be made into isolated python packages installable by pip. However, that's another problem for another day.

Questions? Concerns? Ideas? Please comment here or in the tracking bug

March 28, 2015 11:10 PM

March 26, 2015

Morgan Phillips (mrrrgn)

Whoop, Whoop: Pull Up!

Since December 1st 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).[1] For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.[2]

In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.[3]

Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners.

For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....

The St. Patrick's Day Massacre

Case in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process.

The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called runner, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....

On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):

The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories.

There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.

From the repository's point of view, the explosion looked like this:

Then, from runner's point of view, the retrying task:

In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.

Avoiding Future Massacres

After these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:

1.) Bug 1146974 - Add automated alerting for abnormally high retries (in runner).

In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed.

The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.

We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by papertrail.

Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:

2.) Add automated testing, and tiered roll-outs to golden ami generation

Finally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair.

By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.

My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.

The new process will work like this:Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever.

The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....

March 26, 2015 11:55 PM

Armen Zambrano G. (@armenzg)

mozci 0.4.0 released - Many bug fixes and improved performance

For the release notes with all there hyper-links go here.

NOTE: I did a 0.3.1 release but the right number should have been 0.4.0

This release does not add any major features, however, it fixes many issues and has much better performance.

Many thanks to @adusca, @jmaher and @vaibhavmagarwal for their contributions.



For all changes visit: 0.3.0...0.4.0

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 26, 2015 08:36 PM

March 20, 2015

Kim Moir (kmoir)

Scaling Yosemite

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1.  (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?
  1. We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid.  If we buy new hardware, we need to replace the entire pool at once.  Machines with different hardware specifications = useless performance test comparisons.
  2. We tried to purchase some used machines with the same hardware specs as our existing machines.  However, we couldn't find a source for them.  As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now?  We wait until the population of users running a new platform2 surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop.  It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working.  In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms.  This was a very iterative process.  Run tests on try.  Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy
We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane.  We encountered two major problems that impeded that goal:

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see.  So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up.  Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform.  To add to this, the list of platform we support varies across branches.  For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10 
Bug 1121199: Green up 10.10 tests currently failing on try 
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite

Why don't you just run these tests in the cloud?
  1. The Apple EULA severely restricts virtualization on Mac hardware. 
  2. I don't know of any major cloud vendors that offer the Mac as a platform.  Those that claim they do are actually renting racks of Macs on a dedicated per host basis.  This does not have the inherent scaling and associated cost saving of cloud computing.  In addition, the APIs to manage the machines at scale aren't there.
  3. We manage ~350 Mac minis.  We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.

March 20, 2015 06:50 PM

March 17, 2015

Kim Moir (kmoir)

Mozilla pushes - February 2015

Here's February's 2015 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Although February is a shorter month, the number of pushes were close to those recorded in the previous month.  We had a higher average number of daily pushes (358) than in January (348).

10015 pushes
358 pushes/day (average)
Highest number of pushes/day: 574 pushes on Feb 25, 2015
23.18 pushes/hour (highest)

General Remarks
Try had around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes

August 2014 was the month with most pushes (13090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 

March 17, 2015 03:54 PM

March 10, 2015

Hal Wine (hwine)

Docker at Vungle

Docker at Vungle

Tonight I attended the San Francisco Dev Ops meetup at Vungle. The topic was one we often discuss at Mozilla - how to simplify a developer’s life. In this case, the solution they have migrated to is one based on Docker, although I guess the title already gave that away.

Long (but interesting - I’ll update with a link to the video when it becomes available) story short, they are having much more success using DevOps managed Docker containers for development than their previous setup of Virtualbox images built & maintained with Vagrant and Chef.

Vungle’s new hire setup:
  • install Boot2Docker (they are an all Mac dev shop)
  • clone the repository. [1]
  • run script which pulls all the base images from DockerHub. This one time image pull gives the new hire time to fill out HR paperwork ;)
  • launch the app in the container and start coding.

Sigh. That’s nice. When you come back from PTO, just re-run the script to get the latest updates - it won’t take nearly as long as only the container deltas need to come down. Presto - back to work!

A couple of other highlights – I hope to do a more detailed post later.

  • They follow the ‘each container has a single purpose’ approach.
  • They use “helper containers” to hold recent (production) data.
  • Devs have a choice in front end development: inside the container (limited tooling) or in the local filesystem (dev’s choice of IDE, etc.). [2]
  • Currently, Docker containers are only being used in development. They are looking down the road to deploying containers in production, but it’s not a major focus at this time.


[1]Thanks to BFG for clarifying that docker-foo is kept in a separate repository from source code. The script is in the main source code repository. [Updated 2015-03-11]
[2]More on this later. There are some definite tradeoffs.

March 10, 2015 07:00 AM

March 06, 2015

Armen Zambrano G. (@armenzg)

How to generate data potentially useful to a dynamically generated trychooser UI

If you're interested on generating an up-to-date trychooser, I would love to hear from you.
adusca has helped me generate data similar to what a dynamic trychooser UI could use.
If you would like to help, please visit bug 983802 and let us know.

In order to generate the data all you have to do is:
git clone
cd mozilla_ci_tools
python develop
python scripts/misc/

That's it! You will then have a graphs.json dictionary with some of the pieces needed. Once we have an idea on how to generate the UI and what we're missing we can modify this script.

Here's some of the output:
    "android": [

Here are the remaining keys:
[u'android', u'android-api-11', u'android-api-9', u'android-armv6', u'android-x86', u'emulator', u'emulator-jb', u'emulator-kk', u'linux', u'linux-pgo', u'linux32_gecko', u'linux64', u'linux64-asan', u'linux64-cc', u'linux64-mulet', u'linux64-pgo', u'linux64_gecko', u'macosx64', u'win32', u'win32-pgo', u'win64', u'win64-pgo']

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 06, 2015 04:41 PM

March 05, 2015

Armen Zambrano G. (@armenzg)

mozci 0.3.0 - Support for backfilling jobs on treeherder added

Sometime on treeherder, jobs get coalesced (a.k.a. we run the tests on the most recent revision) in order to handle load. This is good so we can catch up when many pushes are committed on a tree.

However, when a job run on the most recent code comes back failing we need to find out which revision introduced the the regression. This is when we need to backfill up to the last good run.

In this release of mozci we have added the ability to --backfill:
python scripts/ --buildername "b2g_ubuntu64_vm cedar debug test gaia-js-integration-5" --dry-run --revision 2dea8b3c6c91 --backfill
This should be useful specially for sheriffs.

You can start using mozci as long as you have LDAP credentials. Follow these steps to get started:
git clone
python develop (or install)

Release notes

Thanks again to vaibhav1994 and adusca for their many contributions in this release.

Major changes
  • Issue #75 - Added the ability to backfill changes until last good is found
  • No need to use --repo-name anymore
  • Issue #83 - Look for request_ids from a better place
  • Add interface to get status information instead of scheduling info
Minor fixes:
  • Fixes to make livehtml documentation
  • Make determine_upstream_builder() case insensitive
      Release notes:
      PyPi package:

      Creative Commons License
      This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

      March 05, 2015 04:19 PM

      March 03, 2015

      Armen Zambrano G. (@armenzg)

      mozci 0.2.5 released - major bug fixes + many improvements

      Big thanks again to vaibhav1994adusca and valeriat for their many contributions in this release.

      Release notes

      Major bug fixes:
      • Bug fix: Sort pushid_range numerically rather than alphabetically
      • Calculation of hours_ago would not take days into consideration
      • Added coveralls/coverage support
      • Added "make livehtml" for live documentation changes
      • Improved FAQ
      • Updated roadmap
      • Large documentation refactoring
      • Automatically document scripts
      • Added partial testing of mozci.mozci
      • Streamed fetching of allthethings.json and verify integrity
      • Clickable treeherder links
      • Added support for zest.releaser
        Release notes:
        PyPi package:

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 10:46 PM

        How to generate allthethings.json

        It's this easy!
            hg clone
            cd braindump/community

        allthethings.json is generated based on data from buildbot-configs.
        It contains data about builders, schedulers, masters and slavepools.

        If you want to extract information from allthethings.json feel free to use mozci to help you!

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 04:54 PM

        February 27, 2015

        Chris AtLee (catlee)

        Diving into python logging

        Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

        We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

        Here are a few tips and tricks that have really helped me with python logging:

        There can be only more than one

        Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

        You can have multiple loggers active, each specific to its own module or even class or instance.

        Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

        # in module
        import logging
        log = logging.getLogger(__name__)

        This works because inside, __name__ is equal to "foo". So inside this module the log object is specific to this module.

        Loggers are hierarchical

        The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

        # Let's enable all the debug logging for all the foo modules
        import logging

        Log messages are like events that flow up through the hierarchy

        Let's say we have a module

        import logging
        log = logging.getLogger(__name__)  # __name__ is "" here
        def make_widget():
            log.debug("made a widget!")

        When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

        The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

        So if the logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

        This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

        This also allows us to use mixed levels of log output depending on where the message are coming from:

        import logging
        # Enable debug logging for all the foo modules
        # Configure the root logger to log only INFO calls, and output to the console
        # (the default)
        # This will output the debug message

        If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

        exc_info is teh awesome

        All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

        import logging
        log = logging.getLogger(__name__)
            assert False
        except AssertionError:
  "surprise! got an exception!", exc_info=True)

        There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

        Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

        "No handlers found..."

        You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off of the chain (maybe I need a better metaphor).

        import logging
        log = logging.getLogger()
        log.error("no log for you!")


        No handlers could be found for logger "root"

        There are two things that can be done here:

        1. Configure logging in your module with basicConfig() or similar

        2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

        Want more?

        I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

        February 27, 2015 09:09 PM

        February 25, 2015

        Massimo Gervasini (mgerva)

        on mixins

        We use mixins quite a lot in mozharness.

        Mixins are a powerful pattern that allow you to extend your objects, reusing your code (more here). Think about mixin as “plugins”, you can create your custom class and import features just inheriting from a Mixin class for example:

        class B2GBuild(LocalesMixin, PurgeMixin, B2GBuildBaseScript,
                       GaiaLocalesMixin, SigningMixin, MapperMixin, BalrogMixin):

        B2GBuild manages FirefoxOS builds and it knows how to:
        * manage locales (LocalesMixin)
        * how to deal with repositories (PurgeMixin)
        * sign the code (SigningMixin)
        * and more…

        this is just from the class definition! At this point a we haven’t added any single method or property, but we already know how to do a lot of tasks and it’s almost for free!

        So should we use mixins everywhere? Short answer: No.
        Long answer Mixins are powerful but also they can lead to some unexpected behavior.

        object C and D have exactly the same parents and the same methods but their behavior is different, it depends on how the parents are declared.

        This is a side effect of the way python implements the inheritance. Having an object inheriting from too many Mixins can lead to unexpected failures (MRO – method resolution objects) when the object is instantiated, or even worse, at runtime when a method is doing something that is not expected.
        When the inheritance becomes obscure, it’s also becomes difficult to write appropriate tests.

        How can we write a mozharness module without using mixins? Let’s try to write a generic module that provides some disk informations for example we could create the mozharness.base.diskutils module that provides useful information about the disk size. Our first approach would be writing something as:

        class DiskInfoMixin():
            def get_size(self, path):
      'calculating disk size')
                <code here>
            def other_methods(self):
                <code here>

        and then use it in the final class

        from mozharness.base.diskutils import DiskInfoMixin
        class BuildRepackages(ScriptMixin, LogMixin, ..., DiskInfoMixin):
            disk_info = self.get_size(path)

        Easy! But why are we using a mixin here? Because we need to log some operations and to do so, we need to interact with the LogMixin. This mixin provides everything we need to log messages with mozharness, it provides an abstraction layer to make logging consistent among all the mozharness script and it’s very easy to use, just import the LogMixin and start logging!
        The same code without the using the LogMixin, would more or less be:

        import logging
  'calculating disk size')
            return disk_size

        Just a function. Even easier.

        … and the final script becomes:

        from mozharness.base.diskutils import get_size
        class BuildRepackages(ScriptMixin, LogMixin, ...,):
             disk_info = get_size(path)

        One less mixin!
        There’s a problem though. Messages logged by get_size() will be inconsistent with the rest of the logging. How can we use the mozharness logging style in other modules?
        The LogMixin, it’s a complex class and it has many methods, but at the end of the day it’s a wrapper around the logging module, so behind the scenes, it must call the logger module. What if we can just ask our logger to use the python log facilities, already configured by mozharness?
        getLogger() method is what we need here!

        import logger
        mhlog = logger.getLogger('Multi')
  'calculating disk size')
            return disk_size

        Mozharness by default uses this ‘Multi‘ logger for its messages, so we have just hooked up our logger into mozharness one. Now every logger call will follow the mozharness style!
        We are half way trough the logging issues for our brand new module: what if we want to log to an arbitrary log level, for example, a quite common pattern in mozharness, is let the caller of a function, decide at what level we want to log, so let’s add a log_level parameter…

        import logger
        mhlog = logger.getLogger('Multi')
        get_size(path, log_level=logger.INFO):
            mhlog.log(lvl=log_level, msg='calculating disk size')
            return disk_size

        This will work fine for a generic module but we want to use this module in mozharness so there’s only one more thing to change: mozharness log levels are strings type, logging module levels are integers, we need a function to convert between the two formats.
        For convenience, in mozharness.base.log we will explicitly expose the mozharness log levels and add function that converts mozharness log levels to standard log levels.

        LOG_LEVELS = {
            DEBUG: logging.DEBUG,
            INFO: logging.INFO,
            WARNING: logging.WARNING,
            ERROR: logging.ERROR,
            CRITICAL: logging.CRITICAL,
            FATAL: FATAL_LEVEL
        def numeric_log_level(level):
            """Converts a mozharness log level (string) to the corresponding logger
               level (number). This function makes possible to set the log level
               in functions that do not inherit from LogMixin
            return LOG_LEVELS[level]

        our final module becomes:

        import logging
        from mozharness.base.log import INFO, numeric_log_level
        # use mozharness log
        mhlog = logging.getLogger('Multi')
        def get_size(path, unit, log_level=INFO):
            lvl = numeric_log_level(log_level)
            mhlog.log(lvl=lvl, msg="calculating disk size")

        This is just an example on how to use the standard python logging modules.
        A real diskutils module is about to land in mozharness (bug 1130336), and shouldn’t be too difficult, following the same pattern to create new modules with no dependencies on LogMixin.

        This is a first step in the direction of removing some mixins from the mozharness code (see bug 1101183).
        Mixin are not the absolute evil but they must be used carefully. From now on, if I have to write or modify anything in a mozarness module I will try to enforce the following rules:

        February 25, 2015 05:00 PM

        Kim Moir (kmoir)

        Release Engineering special issue now available

        The release engineering special issue of IEEE software was published yesterday. (Download pdf here).  This issue focuses on the current state of release engineering, from both an industry and research perspective. Lots of exciting work happening in this field!

        I'm interviewed in the roundtable article on the future of release engineering, along with Chuck Rossi of Facebook and Boris Debic of Google.  Interesting discussions on the current state of release engineering at organizations that scale large number of builds and tests, and release frequently.  As well,  the challenges with mobile releases versus web deployments are discussed. And finally, a discussion of how to find good release engineers, and what the future may hold.

        Thanks to the other guest editors on this issue -  Stephany Bellomo, Tamara Marshall-Klein, Bram Adams, Foutse Khomh and Christian Bird - for all their hard work that make this happen!

        As an aside, when I opened the issue, the image on the front cover made me laugh.  It's reminiscent of the cover on a mid-century science fiction anthology.  I showed Mr. Releng and he said "Robot birds? That is EXACTLY how I pictured working in releng."  Maybe it's meant to represent that we let software fly free.  In any case, I must go back to tending the flock of robotic avian overlords.

        February 25, 2015 03:26 PM

        February 24, 2015

        Armen Zambrano G. (@armenzg)

        Listing builder differences for a buildbot-configs patch improved

        Up until now, we updated the buildbot-configs repository to the "default" branch instead of "production" since we normally write patches against that branch.

        However, there is a problem with this, buildbot-configs is always to be on the same branch as buildbotcustom. Otherwise, we can have changes land in one repository which require changes on the other one.

        The fix was to simply make sure that both repositories are either on default or their associated production branches.

        Besides this fix, I have landed two more changes:

        1. Use the production branches instead of 'default'
          • Use -p
        2. Clobber our whole set up (e.g. ~/.mozilla/releng)
          • Use -c

        Here are the two changes:

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 24, 2015 09:45 PM

        February 23, 2015

        Nick Thomas (nthomas)

        FileMerge bug

        FileMerge is a nice diff and merge tool for OS X, and I use it a lot for larger code reviews where lots of context is helpful. It also supports intra-line diff, which comes in pretty handy.

        filemerge screenshot

        However in recent releases, at least in v2.8 which comes as part of XCode 6.1, it assumes you want to be merging and shows that bottom pane. Adjusting it away doesn’t persist to the next time you use it, *gnash gnash gnash*.

        The solution is to open a terminal and offer this incantation:

        defaults write MergeHeight 0

        Unfortunately, if you use the merge pane then you’ll have to do that again. Dear Apple, pls fix!

        February 23, 2015 09:23 AM

        February 15, 2015

        Rail Alliev (rail)

        Funsize hacking


        The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

        Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

        Funsize willl solve the problems listed above: to distribute load and to be flexible.

        Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

        Funsize is split into several pieces:

        • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
        • Celery-based workers to generate partial updates and upload them to S3.
        • SQS or RabbitMQ to coordinate Celery workers.

        One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

        The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

        Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

        February 15, 2015 04:32 AM

        February 13, 2015

        Armen Zambrano G. (@armenzg)

        Mozilla CI tools 0.2.1 released - Trigger multiple jobs for a range of revisions

        Today I have released a major release of mozci which includes the following:


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 13, 2015 04:14 PM

        Kim Moir (kmoir)

        Mozilla pushes - January 2015

        Here's January 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        We're back to regular volume after the holidays. Also, it's really cold outside in some parts of the of the Mozilla world.  Maybe committing code > going outside.

        10798 pushes
        348 pushes/day (average)
        Highest number of pushes/day: 562 pushes on Jan 28, 2015
        18.65 pushes/hour (highest)

        General Remarks
        Try had around around 42% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 24% of all of the pushes

        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 

        February 13, 2015 04:13 PM

        February 09, 2015

        Morgan Phillips (mrrrgn)

        Gödel, Docker, Bach: Containers Building Containers

        As Docker continues to mature, many organizations are striving to run as much of their infrastructure as possible within containers. Of course, this investment results in a lot of docker-centric tooling for deployment, development, etc...

        Given that, I think it makes a lot of sense for docker containers themselves to be built within other docker containers. Otherwise, you'll introduce a needless exception into your automation practices. Boo to that!

        There are a few ways to run docker from within a container, but here's a neat way that leaves you with access to your host's local images: just mount the docker from your host system.

        ** note: in cases where arbitrary users can push code to your containers, this would be a dangerous thing to do **
        Et voila!

        February 09, 2015 10:11 PM

        Introducing RelEng Containers: Build Firefox Consistently (For A Better Tomorrow)

        From time to time, Firefox developers encounter errors which only appear on our build machines. Meaning -- after they've likely already failed numerous times to coax the failure form their own environment -- they must resort to requesting RelEng to pluck a system from our infrastructure so they can use it for debugging: we call this a slave loan, and they happen frequently.

        Case in point: bug #689291

        Firefox is a huge open source project: slave loans can never scale enough to serve our community. So, this weekend I took a whack at solving this problem with Docker. So far, five [of an eventual fourteen] containers have been published, which replicate the following aspects of our in house build environments:As usual, you can find my scratch work on GitHub: mozilla/build-environments

        What Are These Environments Based On?

        For a long time, builds have taken place inside of chroots built with Mock. We have three bare bones mock configs which I used to bake some base platform images: On top of our Mock configs, we further specialize build chroots via build scripts powered by Mozharness. The specifications of each environment are laid out in these mozharness configs. To make use of these, I wrote a simple script which converts a mozharness config into a Dockerfile.

        The environments I've published so far:The next step, before I publish more containers, will be to write some documentation for developers so they can begin using them for builds with minimal hassle. Stay tuned!

        February 09, 2015 06:09 AM

        February 06, 2015

        Hal Wine (hwine)

        Kaizen the low tech way

        Kaizen the low tech way

        On Jan 29, I treated myself to a seminar on Successful Lean Teams, with an emphasis on Kanban & Kaizen techniques. I’d read about both, but found the presentation useful. Many of the other attendees were from the Health Care industry and their perspectives were very enlightening!

        Hearing how successful they were in such a high risk, multi-disciplinary, bureaucratic, and highly regulated environment is inspiring. I’m inclined to believe that it would also be achievable in a simple-by-comparison low risk environment of software development. ;)

        What these hospitals are using is a light weight, self managed process which:

        • ensures visibility of changes to all impacted folks
        • outlines the expected benefits
        • includes a “trial” to ensure the change has the desired impact
        • has a built in feedback system

        That sounds achievable. In several of the settings, the traditional paper and bulletin board approach was used, with 4 columns labeled “New Ideas”, “To Do”, “Doing”, and “Done”. (Not a true Kanban board for several reasons, but Trello would be a reasonable visual approximation; CAB uses spreadsheets.)

        Cards move left to right, and could cycle back to “New Ideas” if iteration is needed. “New Ideas” is where things start, and they transition from there (I paraphrase a lot in the following):

        1. Everyone can mark up cards in New Ideas & add alternatives, etc.
        2. A standup is held to select cards to move from “New Ideas” to “To Do”
        3. The card stays in “To Do” for a while to allow concerns to be expressed by other stake holders. Also a team needs to sign up to move the change through the remaining steps. Before the card can move to “Doing”, a “test” (pilot or checkpoints) is agreed on to ensure the change can be evaluated for success.
        4. The team moves the card into “Doing”, and performs PSDA cycles (Plan, Do, Study, Adjust) as needed.
        5. Assuming the change yields the projected results, the change is implemented and the card is moved to “Done”. If the results aren’t as anticipated, the card gets annotated with the lessons learned, and either goes to “Done” (abandon) or back to “New Ideas” (try again) as appropriate.

        For me, I’m drawn to the 2nd and 3rd steps. That seems to be the change from current practice in teams I work on. We already have a gazillion bugs filed (1st step). We also can test changes in staging (4th step) and update production (5th step). Well, okay, sometimes we skip the staging run. Occasionally that *really* bites us. (Foot guns, foot guns – get your foot guns here!)

        The 2nd and 3rd steps help focus on changes. And make the set of changes happening “nowish” more visible. Other stakeholders then have a small set of items to comment upon. Net result - more changes “stick” with less overall friction.

        Painting with a broad brush, this Kaizen approach is essentially what the CAB process is that Mozilla IT implemented successfully. I have experienced the CAB reduce the amount of stress, surprises, and self inflicted damage amongst both inside and outside of IT. Over time, the velocity of changes has increased and backlogs have been reduced. In short, it is a “Good Thing(tm)”.

        So, I’m going to see if there is a way to “right size” this process for the smaller teams I’m on now. Stay tuned....

        February 06, 2015 08:00 AM

        February 04, 2015

        Rail Alliev (rail)

        Deploying your code from github to AWS Elastic Beanstalk using Travis

        I have been playing with Funsize a lot recently. One of the goals was iterating faster:

        I have hit some challenges with both Travis and Elastic Beanstalk.

        The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

        A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

        The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

        Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

        # .travis.yml snippet
            - provider: elasticbeanstalk
              app: funsize # Elastic Beanstalk app name
              env: funsize-dev-rail # Elastic Beanstalk env name
              bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
              region: us-east-1
                secure: "encrypted key id"
                secure: "encrypted key"
                  repo: rail/build-funsize # Deploy only using my user repo for now
                  all_branches: true
                  # deploy only if particular jobs in the job matrix passes, not any
                  condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis

        Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

        Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

        February 04, 2015 02:09 AM

        February 03, 2015

        Armen Zambrano G. (@armenzg)

        What the current list of buildbot builders is

        This becomes very easy with mozilla_ci_tools (aka mozci):
        >>> from mozci import mozci
        >>> builders = mozci.list_builders()
        >>> len(builders)
        >>> builders[0]
        u'Linux x86-64 mozilla-inbound leak test build'
        This and many other ways to interact with our CI will be showing up in the repository.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 03, 2015 07:48 PM

        Morgan Phillips (mrrrgn)

        shutdown -r never part deux

        In my last post, I wrote about how runner and cleanslate were being leveraged by Mozilla RelEng to try at eliminating the need for rebooting after each test/build job -- thus reclaiming a good deal of wasted time. Since then, I've had the opportunity to outfit all of our hosts with better logging, and collect live data which highlights the progress that's been made. It's been bumpy, but the data suggests that we have reduced reboots (across all tiers) by around 40% -- freeing up over 72,000 minutes of compute time per day, with an estimated savings of $51,000 per year.

        Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.

        Collecting Data

        With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed ps. To take advantage of this, I wrote a "task hook" feature which passes task state to an external script. From there, I wrote a hook script which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.

        Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:

        * a started buildbot task generally indicates that a job is active on a machine *

        * a global ps! *

        * spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *

        * we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *

        Costs/Time Saved Calculations

        To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving.

        The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.

        I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:

        (non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr

        (spot) cost: $14277.72 time: 875936hr avg: $0.02/hr

        Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.

        To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.

        You can see all of my raw data, queries, and helper scripts at this github repo:

        Why Are We Only Saving 40%?

        The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, anything which interacts with a display server.

        To take advantage of the jobs which area already working, I added a task "," which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.

        Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.

        Why Not Use Containers?

        First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: task cluster. Another big part of my work will be porting some of our current builds to that system.

        What About Windows

        If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: bug 1055794

        Future Plans

        Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc...

        Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.

        Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis.

        February 03, 2015 05:53 PM

        January 27, 2015

        Justin Wood (Callek)

        Release Engineering does a lot…

        Hey Everyone,

        I spent a few minutes a week over the last month or two working on compiling a list of Release Engineering work areas. Included in that list is identifying which repositories we “own” and work in, as well as where these repositories are mirrored. (We have copies in hg.m.o git.m.o and github, some exclusively in their home).

        While we transition to a more uniform and modern design style and philosphy.

        My major takeaway here is we have A LOT of things that we do. (this list is explicitly excluding repositories that are obsolete and unused)

        So without further ado, I present our page ReleaseEngineering/Repositories

        repositoriesYou’ll notice a few things about this, we have a column for Mirrors, and RoR (Repository of Record), “Committable Location” was requested by Hal and is explicitly for cases where “Where we consider our important location the RoR, it may not necessarily be where we allow commits to”

        The other interesting thing is we have automatic population of travis and coveralls urls/status icons. This is for free using some magic wiki templates I did.

        The other piece of note here, is the table is generated by a list of pages, using “SemanticMediaWiki” so the links to the repositories can be populated with things like “where are the docs” “what applications use this repo”, “who are suitable reviewers” etc. (all those are TODO on the releng side so far).

        I’m hoping to be putting together a blog post at some point about how I chose to do much of this with mediawiki, however in the meantime should any team at Mozilla find this enticing and wish to have one for themselves, much of the work I did here can be easily replicated for your team, even if you don’t need/like the multiple repo location magic of our table. I can help get you setup to add your own repos to the mix.

        Remember the only fields that are necessary is a repo name, the repo location, and owner(s). The last field can even be automatically filled in by a form on your page (see the end of Release Engineerings page for an example of that form)

        Reach out to me on IRC or E-mail (information is on my mozillians profile) if you desire this for your team and we can talk. If you don’t have a need for your team, you can stare at all the stuff Releng is doing and remember to thank one of us next time you see us. (or inquire about what we do, point contributors our way, we’re a friendly group, I promise.)

        January 27, 2015 11:11 PM

        January 22, 2015

        Armen Zambrano G. (@armenzg)

        Backed out - Pinning for Mozharness is enabled for the fx-team integration tree

        EDIT=We had to back out this change since it caused issues for PGO talos jobs. We will try again after further testing.

        Pinning for Mozharness [1] has been enabled for the fx-team integration tree.
        Nothing should be changing. This is a no-op change.

        We're still using the default mozharness repository and the "production" branch is what is being checked out. This has been enabled on Try and Ash for almost two months and all issues have been ironed out. You can know if a job is using pinning of Mozharness if you see "" in its log.

        If you notice anything odd please let me know in bug 1110286.

        If by Monday we don't see anything odd happening, I would like to enable it for mozilla-central for few days before enabling it on all trunk trees.

        Again, this is a no-op change, however, I want people to be aware of it.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 22, 2015 08:57 PM

        January 21, 2015

        Kim Moir (kmoir)

        Reminder: Releng 2015 submissions due Friday, January 23

        Just a reminder that submissions for the Releng 2015 conference are due this Friday, January 23. 

        It will be held on May 19, 2015 in Florence Italy.

        If you've done recent work like
        we'd love to hear from you.  Please consider submitting a talk!

        In addition, if you have colleagues that work in this space that might have interesting topics to discuss at this workshop, please forward this information. I'm happy to talk to people about the submission process or possible topics if there are questions.

        Il Duomo di Firenze by ©eddi_07, Creative Commons by-nc-sa 2.0

        Sono nel comitato che organizza la conferenza Releng 2015 che si terrà il 19 Maggio 2015 a Firenze. La scadenza per l’invio dei paper è il 23 Gennaio 2015.

        se avete competenze in:
        e volete discutere della vostra esperienza, inviateci una proposta di talk!

        Per favore inoltrate questa richiesta ai vostri colleghi e alle persone interessate a questi argomenti. Nel caso ci fossero domande sul processo di invio o sui temi di discussione, non esitate a contattarmi.

        (Thanks Massimo for helping with the Italian translation).

        More information
        Releng 2015 web page
        Releng 2015 CFP now open

        January 21, 2015 08:36 PM

        January 16, 2015

        Nick Thomas (nthomas)

        Plans for 2015 – Revamping the Release Automation

        Mozilla’s Release Engineering team has been through several major iterations of our “release automation”, which is how we produce the bits for Firefox betas and releases. With each incarnation, the automation has become more reliable, supported more functionality, and end-to-end time has reduced. If you go back a few years to Firefox 2.0 it took several days to prepare 40 or so locales and three platforms for a release; now it’s less than half a day for 90 locales and four platforms. The last major rewrite was some time ago so it’s time to embark on a big revamp – this time we want to reduce the end-to-end time significantly.

        Currently, when a code change lands in the repository (eg mozilla-beta) a large set of compile and test jobs are started. It takes about 5 hours for the slowest platform to complete an optimized build and run the tests, in part because we’re using Profile-Guided Optimization (PGO) and need to link XUL twice. Assuming the tests have passed, or been recognized as an intermittent failure, a Release Manager will kick off the release automation. It will tag the gecko and localization repositories, and a second round of compilation will start, using the official branding and other release-specific settings. Accounting for all the other release work (localized builds, source tarballs, updates, and so on) the automation takes 10 or more hours to complete.

        The first goal of the revamp is to avoid the second round of compilation, with all the loss of time and test coverage it brings. Instead, we’re looking at ‘promoting’ the builds we’ve already done (in the sense of rank, not marketing). By making some other improvements along the way, eg fast generation of partial updates using funsize, we may be able to save as much as 50% from the current wall time. So we’ll be able to ship fixes to beta users more often than twice a week, get feedback earlier in the cycle, and be more confident about shipping a new release. It’ll help us to ship security fixes faster too.

        We’re calling this ‘Build Promotion’ for short, and you can follow progress in Bug 1118794 and dependencies.

        January 16, 2015 10:08 AM

        January 10, 2015

        Hal Wine (hwine)

        ChatOps Meetup

        ChatOps Meetup

        This last Wednesday, I went to a meetup on ChatOps organized by SF DevOps, hosted by Geekdom (who also made recordings available), and sponsored by TrueAbility.

        I had two primary goals in attending: I wanted to understand what made ChatOps special, and I wanted to see how much was applicable to my current work at Mozilla. The two presentations helped me accomplish the first. I’m still mulling over the second. (Ironically, I had to shift focus during the event to clean up a deployment-gone-wrong that was very close to one of the success stories mentioned by Dan Chuparkoff.)

        My takeaway on why chatops works is that it is less about the tooling (although modern web services make it a lot easier), and more about the process. Like a number of techniques, it appears to be more successful when teams fully embrace their vision of ChatOps, and make implementation a top priority. Success is enhanced when the tooling supports the vision, and that appears to be what all the recent buzz is about – lots of new tools, examples, and lessons learned make it easier to follow the pioneers.

        What are the key differentiators?

        Heck, many teams use irc for operational coordination. There are scripts which automate steps (some workflows can be invoked from the web even). We’ve got automated configuration, logging, dashboards, and wikis – are we doing ChatOps?

        Well, no, we aren’t.

        Here are the differences I noted:
        • ChatOps requires everyone both agreeing and committing to a single interface to all operations. (The opsbot, like hubot, lita or Err.) Technical debt (non-conforming legacy systems) will be reworked to fit into ChatOps.
        • ChatOps requires focus and discipline. There are a small number of channels (chat rooms, MUC) that have very specific uses - and folks follow that. High signal to noise ratio. (No animated gifs in the deploy channel - that’s what the lolcat channel is for.)
        • A commitment to explicitly documenting all business rules as executable code.

        What do you get for giving up all those options and flexibility? Here was the “ah ha!” concepts for me:

        1. Each ChatOps room is a “shared console” everyone can see and operate. No more screen sharing over video, or “refresh now” coordination!

        2. There is a bot which provides the “facts” about the world. One view accessible by all.

        3. The bot is also the primary way folks interact and modify the system. And it is consistent in usage across all commands. (The bot extensions perform the mapping to whatever the backend needs. The code adapts, not the human!)

        4. The bot knows all and does all:
          • Where’s the documentation?
          • How do I do X?
          • Do X!
          • What is the status of system Y?
        5. The bot is “fail safe” - you can’t bypass the rules. (If you code in a bypass, well, you loaded that foot gun!)

        Thus everything is consistent and familiar for users, which helps during those 03:00 forays into a system you aren’t as familiar with. Nirvana ensues (remember, everyone did agree to drink the koolaid above).

        Can you get there from here?

        The speaker selection was great – Dan was able to speak to the benefits of committing to ChatOps early in a startup’s life. James Fryman (from StackStorm) showed a path for migrating existing operations to a ChatOps model. That pretty much brackets the range, so yeah, it’s doable.

        The main hurdle, imo, would be getting the agreement to a total commitment! There are some tensions in deploying such a system at a highly open operation like Mozilla: ideally chat ops is open to everyone, and business rules ensure you can’t do or see anything improper. That means the bot has (somewhere) the credentials to do some very powerful operations. (Dan hopes to get their company to the “no one uses ssh, ever” point.)

        My next steps? Still thinking about it a bit – I may load Err onto my laptop and try doing all my local automation via that.

        January 10, 2015 08:00 AM

        January 09, 2015

        Chris AtLee (catlee)

        Upcoming hotness from RelEng

        To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.


        First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

        Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/


        Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

        We're aiming to have nightly builds use funsize for partial update generation this quarter.

        I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

        The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

        S3 uploads from automation

        This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

        No reboots

        Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

        Release build promotion

        Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

        Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

        This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

        ... and much more!

        This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

        January 09, 2015 06:35 PM

        Ben Hearsum (bhearsum)

        UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

        (This post has been updated with the new go-live date.)

        Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we’re ready to switch the vast majority of our users over. We’ll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

        Stick around if you’re interested in some of the load testing we did.

        Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

        We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn’t a complete surprise given that we hadn’t implemented any form of caching yet. After implementing a simple LRU cache on Balrog’s largest objects, we did another load test. Here’s what the load looked like on one web head:

        Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

        I’m not sure what to call that other than winning.

        January 09, 2015 04:39 PM

        January 08, 2015

        Kim Moir (kmoir)

        Mozilla pushes - December 2014

        Here's December 2014's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        There was a low number of pushes this month.  I expect this is due to the Mozilla all-hands in Portland in early December where we were encouraged to meet up with other teams instead of coding :-) and the holidays at the end of the month for many countries.
        As as side node, in 2014 we had a total number of 124423 pushes, compared to 79233 in 2013 which represents a growth rate of 57% this year.

        7836 pushes
        253 pushes/day (average)
        Highest number of pushes/day: 706 pushes on Dec 17, 2014
        15.25 pushes/hour (highest)

        General Remarks
        Try had around around 46% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all of the pushes

        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 

        January 08, 2015 05:14 PM

        January 06, 2015

        Armen Zambrano G. (@armenzg)

        Tooltool fetching can now use LDAP credentials from a file

        You can now fetch tooltool files by using an authentication file.
        All you have to do is append "--authentication-file file" to your tooltool fetching command.

        This is important if you want to use automation to fetch files from tooltool on your behalf.
        This was needed to allow Android test jobs to run locally since we need to download tooltool files for it.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 06, 2015 04:45 PM

        January 05, 2015

        Armen Zambrano G. (@armenzg)

        Run Android test jobs locally

        You can now run Android test jobs on your local machine with Mozharness.

        As with any other developer capable Mozharness script, all you have to do is:

        An example for this is:
        python scripts/ --cfg android/
        --test-suite mochitest-gl-1 --blob-upload-branch try
        --download-symbols ondemand --cfg

        Here's the bug where the work happened.
        Here's the documentation on how to run Mozharness as a developer.

        Please file a bug under Mozharness if you find any issues.

        Here are some other related blog posts:


        Bug 1117954- I think that I need a different SDK or emulator version is needed to run Android API 10 jobs.

        I wish we run all of our jobs in proper isolation!

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 05, 2015 08:47 PM

        December 22, 2014

        Armen Zambrano G. (@armenzg)

        Run mozharness talos as a developer (Community contribution)

        Thanks to our contributor Simarpreet Singh from Waterloo we can now run a talos job through mozharness on your local machine (bug 1078619).

        All you have to add is the following:

        To read more about running Mozharness locally go here.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 22, 2014 08:10 PM

        December 11, 2014

        Kim Moir (kmoir)

        Releng 2015 CFP now open

        Florence, Italy.  Home of beautiful architecture.

        Il Duomo di Firenze by ©runner310, Creative Commons by-nc-sa 2.0

        Delicious food and drink.

        Panzanella by © Pete Carpenter, Creative Commons by-nc-sa 2.0

        Caffè ristretto by © Marcelo César Augusto Romeo, Creative Commons by-nc-sa 2.0

        And next May, release engineering :-)

        The CFP for Releng 2015 is now open.  The deadline for submissions is January 23, 2015.  It will be held on May 19, 2015 in Florence Italy and co-located with ICSE 2015.   We look forward to seeing your proposals about the exciting work you're doing in release engineering!

        If you have questions about the submission process or anything else, please contact any of the program committee members. My email is kmoir and I work at

        December 11, 2014 09:00 PM

        December 09, 2014

        Armen Zambrano G. (@armenzg)

        Running Mozharness in developer mode will only prompt once for credentials

        Thanks to Mozilla's contributor kartikgupta0909 we now only have to enter LDAP credentials once when running the developer mode of Mozharness.

        He accomplished it in bug 1076172.

        Thank you Kartik!

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 09, 2014 09:43 PM

        December 08, 2014

        Armen Zambrano G. (@armenzg)

        Test mozharness changes on Try

        You can now push to your own mozharness repository (even a specific branch) and have it be tested on Try.

        Few weeks ago we developed mozharness pinning (aka mozharness.json) and recently we have enabled it for Try. Read the blog post to learn how to make use of it.

        NOTE: This currently only works for desktop, mobile and b2g test jobs. More to come.
        NOTE: We only support named branches, tags or specific revisions. Do not use bookmarks as it doesn't work.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 08, 2014 06:59 PM

        December 04, 2014

        Morgan Phillips (mrrrgn)

        shutdown -r never

        For the past month I've worked on achieving the effects of a reboot without actually doing one. Sort of a "virtual" reboot. This isn't a usual optimization; but in Mozilla's case it's likely to create a huge impact on performance.

        Mozilla build/test infrastructure is complex. The jobs can be expensive and messy. So messy that, for a while now, machines have been rebooted after completing tasks to ensure that environments remain fresh.

        This strategy works marvelously at preventing unnecessary failures; but wastes a lot of resources. In particular, with reboots taking something like two minutes to complete, and at around 100k jobs per day, a whopping 200,000 minutes of machine time. That's nearly five months - yikes!1

        Yesterday I began rolling out these "virtual" reboots for all of our Linux hosts, and it seems to be working well [edit: after a few rollbacks]. By next month I should also have it turned on for OSX and Windows machines.

        What does a "virtual" reboot look like?

        For starters [pun intended], each job requires a good amount of setup and teardown, so, a sort of init system is necessary. To achieve this a utility called runner has been created. Runner is a project that manages starting tasks in a defined order. If tasks fail, the chain can be retried, or halted. Many tasks that once lived in /etc/init.d/ are now managed by runner including buildbot itself.

        Among runner's tasks are various scripts for cleaning up temporary files, starting/restarting services, and also a utility called cleanslate. Cleanslate resets a users running processes to a previously recorded state.

        At boot, cleanslate takes a snapshot of all running processes, then, before each job it kills any processes (by name) which weren't running when the system was fresh. This particular utility is key to maintaining stability and may be extended in the future to enforce other kinds of system state as well.

        The end result is this:

        old work flow

        Boot + init -> Take Job -> Reboot (2-5 min)

        new work flow

        Boot + Runner -> Take Job -> Shutdown Buildslave
        (runner loops and restarts slave)

        [1] What's more, this estimate does not take into account the fact that jobs run faster on a machine that's already "warmed up."

        December 04, 2014 06:54 PM

        December 03, 2014

        Kim Moir (kmoir)

        Mozilla pushes - November 2014

        Here's November's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

        Not a record breaking month, in fact we are down over 2000 pushes since the last month.

        10376 pushes
        346 pushes/day (average)
        Highest number of pushes/day: 539 pushes on November 12
        17.7 pushes/hour (average)

        General Remarks
        Try keeps had around 38% of all the pushes, and gaia-try has about 30%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all the pushes.

        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes    

        December 03, 2014 09:41 PM

        November 24, 2014

        Armen Zambrano G. (@armenzg)

        Pinning mozharness from in-tree (aka mozharness.json)

        Since mozharness came around 2-3 years ago, we have had the same issue where we test a mozharness change against the trunk trees, land it and get it backed out because we regress one of the older release branches.

        This is due to the nature of the mozharness setup where once a change is landed all jobs start running the same code and it does not matter on which branch that job is running.

        I have recently landed some code that is now active on Ash (and soon on Try) that will read a manifest file that points your jobs to the right mozharness repository and revision. We call this process to "pin mozhaness". In other words, what we do is to fix an external factor to our job execution.

        This will allow you to point your Try pushes to your own mozharness repository.

        In order to pin your jobs to a repository/revision of mozharness you have to change a file called mozharness.json which indicates the following two values:

        This is a similar concept as talos.json introduced which locks every job to a specific revision of talos. The original version of it landed in 2011.

        Even though we have a similar concept since 2011, that doesn't mean that it was as easy to make it happen for mozharness. Let me explain a bit why:

        Coming up:
        • Enable on Try
        • Free up Ash and Cypress
          • They have been used to test custom mozharness patches and the default branch of Mozharness (pre-production)
        Long term:
        • Enable the feature on all remaining Gecko trees
          • We would like to see this run at scale for a bit before rolling it out
          • This will allow mozharness changes to ride the trains
        If you are curious, the patches are in bug 791924.

        Thanks for Rail for all his patch reviews and Jordan for sparking me to tackle it.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        November 24, 2014 05:35 PM

        November 12, 2014

        Kim Moir (kmoir)

        Scaling capacity while saving cash

        There was a very interesting release engineering summit this Monday held in concert with LISA in Seattle.  I was supposed fly there this past weekend so I could give a talk on Monday but late last week I became ill and was unable to go.   Which was very disappointing because the summit looked really great and I was looking forward to meeting the other release engineers and learning about the challenges they face.

        Scale in the Market  ©Clint Mickel, Creative Commons by-nc-sa 2.0

        Although I didn't have the opportunity to give the talk in person, the slides for it are available on slideshare and my mozilla people account   The talk describes how we scaled our continuous integration infrastructure on AWS to handle double the amount of pushes it handled in early 2013, all while reducing our AWS monthly bill by 2/3.

        Cost per push from Oct 2012 until Oct 2014. This does not include costs for on premise equipment. It reflects our monthly AWS bill divided by the number of monthly pushes (commits).  The chart reflects costs from October 2012-2014.

        Thank you to Dinah McNutt and the other program committee members for organizing this summit.  I look forward to watching the talks once they are online.

        November 12, 2014 07:34 PM

        Mozilla pushes - October 2014

        Here's the October 2014 monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

        We didn't have a record breaking month in terms of the number of pushes, however we did have a daily record on October 18 with 715 pushes. 

        12821 pushes, up slightly from the previous month
        414 pushes/day (average)
        Highest number of pushes/day: 715 pushes on October 8
        22.5 pushes/hour (average)

        General Remarks
        Try keeps had around 39% of all the pushes, and gaia-try has about 31%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes

        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes

        November 12, 2014 03:45 PM

        Morgan Phillips (mrrrgn)

        AirMozilla: Distrusting Our Own [Build] Infrastructure

        If you missed last week's AirMozilla broadcast: Why and How of Reproducible Builds: Distrusting Our Own Infrastructure For Safer Software Releases by the Tor Project, consider checking it out.

        The talk is an in depth look at how one can protect release pipelines from being owned by attacks which target build systems. Particularly, attacks where compromised compilers may be used to create unsafe binaries from safe source code.

        Meanwhile RelEng is underway, putting these ideas into practice.

        November 12, 2014 08:54 AM

        A Simple Trusting Trust Attack


        November 12, 2014 07:54 AM

        November 10, 2014

        Morgan Phillips (mrrrgn)

        A Note on Deterministic Builds

        Since I joined Mozilla's Release Engineering team I've had the opportunity to put my face into a firehose of interesting new knowledge and challenges. Maintaining a release pipeline for binary installers and updates used by a substantial portion of the Earth's population is a whole other kind of beast from ops roles where I've focused on serving some kind of SaaS or internal analytics infrastructure. It's really exciting!

        One of the most interesting problems I've seen getting attention lately are deterministic builds, that is, builds that produce the same sequence of bytes from source on a given platform at any time.

        What good are deterministic builds?

        For starters, they aid in detecting "Trusting Trust" attacks. That's where a compromised compiler produces malicious binaries from perfectly harmless source code via replacing certain patterns during compilation. It sort of defeats the whole security advantage of open source when you download binaries right?

        Luckily for us users, a fellow named David A. Wheeler rigorously proved a method for circumventing this class of attacks altogether via a technique he coined "Diverse Double-Compiling" (DDC). The gist of it is, you compile a project's source code with a trusted tool chain then compare a hash of the result with some potentially malicious binary. If the hashes match you're safe.

        DDC also detects the less clever scenario where an adversary patches, otherwise open, source code during the build process and serves up malwareified packages. In either case, it's easy to see that this works if and only if builds are deterministic.

        Aside from security, they can also help projects that support many platforms take advantage of cross building with less stress. That is, one could compile arm packages on an x86_64 host then compare the results to a native build and make sure everything matches up. This can be a huge win for folks who want to cut back on infrastructure overhead.

        How can I make a project more deterministic?

        One bit of good news is, most compilers are already pretty deterministic (on a given platform). Take hello.c for example:

        int main() {
            printf("Hello World!");

        Compile that a million times and take the md5sum. Chances are you'll end up with a million identical md5sums. Scale that up to a million lines of code, and there's no reason why this won't hold true.

        However, take a look at this doozy:

        int main() {
            printf("Hello from %s! @ %s", __FILE__, __TIME__);

        Having timestamps and other platform specific metadata baked into source code is a huge no-no for creating deterministic builds. Compile that a million times, and you'll likely get a million different md5sums.

        In fact, in an attempt to make Linux more deterministic all __TIME__ macros were removed and the makefile specifies a compiler option (-Werror=date-time) that turns any use of it into an error.

        Unfortunately, removing all traces of such metadata in a mature code base could be all but impossible, however, a fantastic tool called gitian will allow you to compile projects within a virtual environment where timestamps and other metadata are controlled.

        Definitely check gitian out and consider using it as a starting point.

        Another trouble spot to consider is static linking. Here, unless you're careful, determinism sits at the mercy of third parties. Be sure that your build system has access to identical libraries from anywhere it may be used. Containers and pre-baked vms seem like a good choice for fixing this issue, but remember that you could also be passing around a tainted compiler!

        Scripts that automate parts of the build process are also a potent breeding ground for non-deterministic behaviors. Take this python snippet for example:

        with open('manifest', 'w') as manifest:
            for dirpath, dirnames, filenames in os.walk("."):
                for filename in filenames:

        The problem here is that os.walk will not always print filenames in the same order. :(

        One also has to keep in mind that certain data structures become very dangerous in such scripts. Consider this pseudo-python that auto generates some sort of source code in a compiled language:

        weird_mapping = dict(file_a=99, file_b=1)
        things_in_a_set = set([thing_a, thing_b, thing_c])
        for k, v in werid_mapping.items():
            ... generate some code ...
        for thing in things_in_a_set:
            ... generate some code ...

        A pattern like this would dash any hope that your project had of being deterministic because it makes use of unordered data structures.

        Beware of unordered data structures in build scripts and/or sort all the things before writing to files.

        Enforcing determinism from the beginning of a project's life cycle is the ideal situation, so, I would highly recommend incorporating it into CI flows. When a developer submits a patch it should include a hash of their latest build. If the CI system builds and the hashes don't match, reject that non-deterministic code! :)


        Of course, this hardly scratches the surface on why deterministic builds are important; but I hope this is enough for a person to get started on. It's a very interesting topic with lots of fun challenges that need solving. :) If you'd like to do some further reading, I've listed a few useful sources below.

        November 10, 2014 07:54 PM

        Justin Wood (Callek)

        Firefox Launches Developer Editon (Minor Papercut Issues)

        So, as you may have heard, Firefox is launching a dev edition.

        This post does not attempt to elaborate on that specifically too much, but it’s more to identify some issues I hit in early testing and the solutions to them.


        While I do admire the changes of the Developer Edition Theme, I’m a guy who likes to stick with “what I know” more than a drastic change like that. What I didn’t realize was that this is possible out of the box in developer edition.

        After the Tour you get, you’ll want to open the Customize panel and then deselect “Use Firefox Developer Edition Theme” (see the following image — arrow added) and that will get you back to what you know.



        As a longtime user, I had “Old Firefox Sync” enabled; this was the one that very few users enabled and even fewer used it across devices.

        Firefox Developer Edition, however, creates a new profile (so you can use it alongside whatever Firefox version you want) and supports setting up only the “New” sync features. Due to creating a new profile, it also leaves you without history or saved passwords.

        To sync my old profile with developer edition, I had to:

        1. Unlink my Desktop Firefox from old sync
        2. Unlink my Android Firefox from old sync
        3. Create a new sync account
        4. Link my old Firefox profile with new sync
        5. Link my Android with new sync
        6. Link Dev Edition with new sync
        7. Profit

        Now other than steps 6 and 7 (yea, how DO I profit?) this is all covered quite well in a SuMo article on the subject. I will happily help guide people through this process, especially in the near future, as I’ve just gone through it!

        (Special Thanks to Erik for helping to copy-edit this post)

        November 10, 2014 04:30 PM

        I’m a wordpress newbie

        If this is on, and so is a “content is password protected” post below it, I’m sorry.

        The post is merely that way because its unfinished but I wanted to share it with a few others for early feedback.

        I’ll delete this post, and unhide that one once things are ready. (Sorry for any confusion)

        November 10, 2014 05:18 AM

        November 06, 2014

        Armen Zambrano G. (@armenzg)

        Setting buildbot up a-la-releng (Create your own local masters and slaves)

        buildbot is what Mozilla's Release Engineering uses to run the infrastructure behind
        buildbot assigns jobs to machines (aka slaves) through hosts called buildbot masters.

        All the different repositories and packages needed to setup buildbot are installed through Puppet and I'm not aware of a way of setting my local machine through Puppet (I doubt I would want to do that!).
        I managed to set this up a while ago by hand [1][2] (it was even more complicated in the past!), however, these one-off attempts were not easy to keep up-to-date and isolated.

        I recently landed few scripts that makes it trivial to set up as many buildbot environments as you want and all isolated from each other.

        All the scripts have been landed under the "community" directory under the "braindump" repository:

        The main two scripts:

        If you call with -w /path/to/your/own/workdir you will have everything set up for you. From there on, all you would have to do is this:
        • cd /path/to/your/own/workdir
        • source venv/bin/activate
        • buildbot start masters/test_master (for example)
        • buildslave start slaves/test_slave
        Each paired master and slave have been setup to talk to each other.

        I hope this is helpful for people out there. It's been great for me when I contribute patches for buildbot (bug 791924).

        As always in Mozilla, contributions are always welcome!

        PS 1 = Only tested on Ubuntu. If you want it to port this to other platforms please let me know and I can give you a hand.

        PS 2 = I know that there is a repository that has docker images called "tupperware", however, I had these set of scripts being worked on for a while. Perhaps someone wants to figure out how to set a similar process through the docker images.

        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        November 06, 2014 02:02 PM

        November 05, 2014

        Massimo Gervasini (mgerva)

        Sign the hash of the bundle, not the full bundle!

        With Bug 1083683,, we are stopping direct processing of .bundle and .source files by our signing servers. This means that in the near future we will not have new *.bundle.asc and and *.source.tar.bz2.asc files on the ftp server.
        Bundles and source files have grown quite a bit and get them signed sometimes ends up in retries and failed jobs, disrupting and delaying the release process. There’s also no benefit on having them signed directly; the source-package job already calculates the hash of the bundle/source files and their MD5/SHA1/SHA512 hashes get included in the .checksum file, which is signed with the release automation key.

        November 05, 2014 04:57 PM

        October 27, 2014

        Kim Moir (kmoir)

        Mozilla pushes - September 2014

        Here's September 2014's monthly analysis of the pushes to our Mozilla development trees.
        You can load the data as an HTML page or as a json file.

        Suprise!  No records were broken this month.

        12267 pushes
        409 pushes/day (average)
        Highest number of pushes/day: 646 pushes on September 10, 2014
        22.6 pushes/hour (average)

        General Remarks
        Try has around 36% of pushes and Gaia-Try comprise about 32%.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 620 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        August 20, 2014 had the highest number of pushes in one day with 690 pushes

        October 27, 2014 09:11 PM

        Release Engineering in the classroom

        The second week of October, I had the pleasure of presenting lectures on release engineering to university students in Montreal as part of the PLOW lectures at École Polytechnique de Montréal.    Most of the students were MSc or PhD students in computer science, with a handful of postdocs and professors in the class as well. The students came from Montreal area universities and many were international students. The PLOW lectures consisted of several invited speakers from various universities and industry spread over three days.

        View looking down from the university

        Université de Montréal administration building

        École Polytechnique building.  Each floor is painted a different colour to represent a differ layer of the earth.  So the ground floor is red, the next orange and finally green.

        The first day, Jack Jiang from York University gave a talk about software performance engineering.
        The second day, I gave a lecture on release engineering in the morning.  The rest of the day we did a lot of labs to configure a Jenkins server to build and run tests on an open source project. Earlier that morning, I had setup m3.large instances for the students on Amazon that they could ssh into and conduct their labs.  Along the way, I talked about some release engineering concepts.  It was really interesting and I learned a lot from their feedback.  Many of the students had not been exposed to release engineering concepts so it was fun to share the information.

        Several students came up to me during the breaks and said "So, I'm doing my PhD in release engineering, and I have several questions for you" which was fun.  Also, some of the students were making extensive use of code bases for Mozilla or other open source projects so that was interesting to learn more about.  For instance one research project looking at the evolution of multi-threading in a Mozilla code bases, and another student was conducting bugzilla comment sentiment analysis.  Are angry bug comments correlated with fewer bug fixes?  Looking forward to the results of this research!

        I ended the day by providing two challenge exercises to the students that they could submit answers to.  One exercise was to setup a build pipeline in Jenkins for another open source project.  The other challenge was to use a the Jenkins REST API to query the Apache projects Jenkins server and present some statistics on their build history.  The results were pretty impressive!

        My slides are on GitHub and the readme file describes how I setup the Amazon instances so Jenkins and some other required packages were installed before hand.  Please use them and distribute them if you are interested in teaching release engineering in your classroom.

        Lessons I learned from this experience:
        The third day there was a lecture by Michel Dagenais of Polytechnique Montréal on tracing heterogeneous cloud instances using (tracing framework for Linux).  The Eclipse trace compass project also made an appearance in the talk. I always like to see Eclipse projects highlighted.  One of his interesting points was that none of the companies that collaborate on this project wanted to sign a bunch of IP agreements so they could collaborate on this project behind closed doors.  They all wanted collaborate via an open source community and source code repository.  Another thing he emphasized was that students should make their work available on the web, via GitHub or other repositories so they have a portfolio of work available.  It was fantastic to seem him promote the idea of students being involved in open source as a way to help their job prospects when they graduate!

        Thank you Foutse and  Bram  for the opportunity to lecture at your university!  It was a great experience!  Also, thanks Mozilla for the opportunity to do this sort of outreach to our larger community on company time!

        Also, I have a renewed respect for teachers and professors.  Writing these slides took so much time.  Many long nights for me especially in the days leading up to the class.  Kudos to you all who do teach everyday.

        The slides are on GitHub and the readme file describes how I setup the Amazon instances for the labs

        October 27, 2014 01:34 PM

        Beyond the Code 2014: a recap

        I started this blog post about a month ago and didn't finish it because well, life is busy.  

        I attended Beyond the Code last September 19.  I heard about it several months ago on twitter.  A one-day conference about celebrating women in computing, in my home town, with an fantastic speaker line up?  I signed up immediately.   In the opening remarks, we were asked for a show of hands to show how many of us were developers, in design,  product management, or students and there was a good representation from all those categories.  I was especially impressed to see the number of students in the audience, it was nice to see so many of them taking time out of their busy schedule to attend.

        View of the Parliament Buildings and Chateau Laurier from the MacKenzie street bridge over the Rideau Canal
        Ottawa Conference Centre, location of Beyond the Code
        There were seven speakers, three workshop organizers, a lunch time activity, and a panel at the end. The speakers were all women.  The speakers were not all white women or all heterosexual women.  There were many young women, not all industry veterans :-) like me.  To see this level of diversity at a tech conference filled me with joy.  Almost every conference I go to is very homogenous in the make up of the speakers and the audience.  To to see ~200 tech women in at conference and 10% men (thank you for attending:-) was quite a role reversal.

        I completely impressed by the caliber of the speakers.  They were simply exceptional.

        The conference started out with Kronda Adair giving a talk on Expanding Your Empathy.  One of the things that struck me from this talk was that she talked about how everyone lives in a bubble, and they don't see things that everyone does due to privilege.  She gave the example of how privilege is like a browser, and colours how we see the world.  For a straight white guy a web age looks great when they're running the latest Chrome on MacOSx.  For a middle class black lesbian, the web page doesn't look as great because it's like she's running IE7.  There is less inherent privilege.  For a "differently abled trans person of color" the world is like running IE6 in quirks mode. This was a great example. She also gave a shout out to the the Ascend Project which she and Lukas Blakk are running in Mozilla Portland office. Such an amazing initiative.

        The next speaker was Bridget Kromhout who gave talk about Platform Ops in the Public Cloud.
        I was really interested in this talk because we do a lot of scaling of our build infrastructure in AWS and wanted to see if she had faced similar challenges. She works at DramaFever, which she described as Netflix for Asian soap operas.  The most interesting things to me were the fact that she used all AWS regions to host their instances, because they wanted to be able to have their users download from a region as geographically close to them as possible.  At Mozilla, we only use a couple of AWS regions, but more instances than Dramafever, so this was an interesting contrast in the services used. In addition, the monitoring infrastructure they use was quite complex.  Her slides are here.

        I was going to summarize the rest of the speakers but Melissa Jean Clark did an exceptional job on her blog.  You should read it!

        Thank you Shopify for organizing this conference.  It was great to meet some many brilliant women in the tech industry! I hope there is an event next year too!

        October 27, 2014 01:33 PM

        October 14, 2014

        Jordan Lund (jlund)

        This week in Releng - Oct 5th, 2014

        Major highlights:

        Completed work (resolution is 'FIXED'):

        In progress work (unresolved and not assigned to nobody):

        October 14, 2014 04:36 AM