Planet Release Engineering

February 25, 2015

Massimo Gervasini (mgerva)

on mixins

We use mixins quite a lot in mozharness.

Mixins are a powerful pattern that allow you to extend your objects, reusing your code (more here). Think about mixin as “plugins”, you can create your custom class and import features just inheriting from a Mixin class for example:

class B2GBuild(LocalesMixin, PurgeMixin, B2GBuildBaseScript,
               GaiaLocalesMixin, SigningMixin, MapperMixin, BalrogMixin):

B2GBuild manages FirefoxOS builds and it knows how to:
* manage locales (LocalesMixin)
* how to deal with repositories (PurgeMixin)
* sign the code (SigningMixin)
* and more…

this is just from the class definition! At this point a we haven’t added any single method or property, but we already know how to do a lot of tasks and it’s almost for free!

So should we use mixins everywhere? Short answer: No.
Long answer Mixins are powerful but also they can lead to some unexpected behavior.

object C and D have exactly the same parents and the same methods but their behavior is different, it depends on how the parents are declared.

This is a side effect of the way python implements the inheritance. Having an object inheriting from too many Mixins can lead to unexpected failures (MRO – method resolution objects) when the object is instantiated, or even worse, at runtime when a method is doing something that is not expected.
When the inheritance becomes obscure, it’s also becomes difficult to write appropriate tests.

How can we write a mozharness module without using mixins? Let’s try to write a generic module that provides some disk informations for example we could create the mozharness.base.diskutils module that provides useful information about the disk size. Our first approach would be writing something as:

class DiskInfoMixin():
    def get_size(self, path):'calculating disk size')
        <code here>

    def other_methods(self):
        <code here>

and then use it in the final class

from mozharness.base.diskutils import DiskInfoMixin

class BuildRepackages(ScriptMixin, LogMixin, ..., DiskInfoMixin):
    disk_info = self.get_size(path)

Easy! But why are we using a mixin here? Because we need to log some operations and to do so, we need to interact with the LogMixin. This mixin provides everything we need to log messages with mozharness, it provides an abstraction layer to make logging consistent among all the mozharness script and it’s very easy to use, just import the LogMixin and start logging!
The same code without the using the LogMixin, would more or less be:

import logging

get_size(path):'calculating disk size')
    return disk_size

Just a function. Even easier.

… and the final script becomes:

from mozharness.base.diskutils import get_size
class BuildRepackages(ScriptMixin, LogMixin, ...,):
     disk_info = get_size(path)

One less mixin!
There’s a problem though. Messages logged by get_size() will be inconsistent with the rest of the logging. How can we use the mozharness logging style in other modules?
The LogMixin, it’s a complex class and it has many methods, but at the end of the day it’s a wrapper around the logging module, so behind the scenes, it must call the logger module. What if we can just ask our logger to use the python log facilities, already configured by mozharness?
getLogger() method is what we need here!

import logger
mhlog = logger.getLogger('Multi')
get_size(path):'calculating disk size')
    return disk_size

Mozharness by default uses this ‘Multi‘ logger for its messages, so we have just hooked up our logger into mozharness one. Now every logger call will follow the mozharness style!
We are half way trough the logging issues for our brand new module: what if we want to log to an arbitrary log level, for example, a quite common pattern in mozharness, is let the caller of a function, decide at what level we want to log, so let’s add a log_level parameter…

import logger
mhlog = logger.getLogger('Multi')
get_size(path, log_level=logger.INFO):
    mhlog.log(lvl=log_level, msg='calculating disk size')
    return disk_size

This will work fine for a generic module but we want to use this module in mozharness so there’s only one more thing to change: mozharness log levels are strings type, logging module levels are integers, we need a function to convert between the two formats.
For convenience, in mozharness.base.log we will explicitly expose the mozharness log levels and add function that converts mozharness log levels to standard log levels.

    DEBUG: logging.DEBUG,
    INFO: logging.INFO,
    WARNING: logging.WARNING,
    ERROR: logging.ERROR,

def numeric_log_level(level):
    """Converts a mozharness log level (string) to the corresponding logger
       level (number). This function makes possible to set the log level
       in functions that do not inherit from LogMixin
    return LOG_LEVELS[level]

our final module becomes:

import logging
from mozharness.base.log import INFO, numeric_log_level
# use mozharness log
mhlog = logging.getLogger('Multi')

def get_size(path, unit, log_level=INFO):
    lvl = numeric_log_level(log_level)
    mhlog.log(lvl=lvl, msg="calculating disk size")

This is just an example on how to use the standard python logging modules.
A real diskutils module is about to land in mozharness (bug 1130336), and shouldn’t be too difficult, following the same pattern to create new modules with no dependencies on LogMixin.

This is a first step in the direction of removing some mixins from the mozharness code (see bug 1101183).
Mixin are not the absolute evil but they must be used carefully. From now on, if I have to write or modify anything in a mozarness module I will try to enforce the following rules:

February 25, 2015 05:00 PM

Kim Moir (kmoir)

Release Engineering special issue now available

The release engineering special issue of IEEE software was published yesterday.  This issue focuses on the current state of release engineering, from both an industry and research perspective. Lots of exciting work happening in this field!

I'm interviewed in the roundtable article on the future of release engineering, along with Chuck Rossi of Facebook and Boris Debic of Google.  Interesting discussions on the current state of release engineering at organizations that scale large number of builds and tests, and release frequently.  As well,  the challenges with mobile releases versus web deployments are discussed. And finally, a discussion of how to find good release engineers, and what the future may hold.

Thanks to the other guest editors on this issue -  Stephany Bellomo, Tamara Marshall-Klein, Bram Adams, Foutse Khomh and Christian Bird - for all their hard work that make this happen!

As an aside, when I opened the issue, the image on the front cover made me laugh.  It's reminiscent of the cover on a mid-century science fiction anthology.  I showed Mr. Releng and he said "Robot birds? That is EXACTLY how I pictured working in releng."  Maybe it's meant to represent that we let software fly free.  In any case, I must go back to tending the flock of robotic avian overlords.

February 25, 2015 03:26 PM

February 24, 2015

Armen Zambrano G. (@armenzg)

Listing builder differences for a buildbot-configs patch improved

Up until now, we updated the buildbot-configs repository to the "default" branch instead of "production" since we normally write patches against that branch.

However, there is a problem with this, buildbot-configs is always to be on the same branch as buildbotcustom. Otherwise, we can have changes land in one repository which require changes on the other one.

The fix was to simply make sure that both repositories are either on default or their associated production branches.

Besides this fix, I have landed two more changes:

  1. Use the production branches instead of 'default'
    • Use -p
  2. Clobber our whole set up (e.g. ~/.mozilla/releng)
    • Use -c

Here are the two changes:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

February 24, 2015 09:45 PM

February 23, 2015

Nick Thomas (nthomas)

FileMerge bug

FileMerge is a nice diff and merge tool for OS X, and I use it a lot for larger code reviews where lots of context is helpful. It also supports intra-line diff, which comes in pretty handy.

filemerge screenshot

However in recent releases, at least in v2.8 which comes as part of XCode 6.1, it assumes you want to be merging and shows that bottom pane. Adjusting it away doesn’t persist to the next time you use it, *gnash gnash gnash*.

The solution is to open a terminal and offer this incantation:

defaults write MergeHeight 0

Unfortunately, if you use the merge pane then you’ll have to do that again. Dear Apple, pls fix!

February 23, 2015 09:23 AM

February 15, 2015

Rail Alliev (rail)

Funsize hacking


The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

Funsize willl solve the problems listed above: to distribute load and to be flexible.

Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

Funsize is split into several pieces:

  • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
  • Celery-based workers to generate partial updates and upload them to S3.
  • SQS or RabbitMQ to coordinate Celery workers.

One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

February 15, 2015 04:32 AM

February 13, 2015

Armen Zambrano G. (@armenzg)

Mozilla CI tools 0.2.1 released - Trigger multiple jobs for a range of revisions

Today I have released a major release of mozci which includes the following:


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

February 13, 2015 04:14 PM

Kim Moir (kmoir)

Mozilla pushes - January 2015

Here's January 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

We're back to regular volume after the holidays. Also, it's really cold outside in some parts of the of the Mozilla world.  Maybe committing code > going outside.

10798 pushes
348 pushes/day (average)
Highest number of pushes/day: 562 pushes on Jan 28, 2015
18.65 pushes/hour (highest)

General Remarks
Try had around around 42% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 24% of all of the pushes

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 

February 13, 2015 04:13 PM

February 09, 2015

Morgan Phillips (mrrrgn)

Gödel, Docker, Bach: Containers Building Containers

As Docker continues to mature, many organizations are striving to run as much of their infrastructure as possible within containers. Of course, this investment results in a lot of docker-centric tooling for deployment, development, etc...

Given that, I think it makes a lot of sense for docker containers themselves to be built within other docker containers. Otherwise, you'll introduce a needless exception into your automation practices. Boo to that!

There are a few ways to run docker from within a container, but here's a neat way that leaves you with access to your host's local images: just mount the docker from your host system.

** note: in cases where arbitrary users can push code to your containers, this would be a dangerous thing to do **
Et voila!

February 09, 2015 10:11 PM

Introducing RelEng Containers: Build Firefox Consistently (For A Better Tomorrow)

From time to time, Firefox developers encounter errors which only appear on our build machines. Meaning -- after they've likely already failed numerous times to coax the failure form their own environment -- they must resort to requesting RelEng to pluck a system from our infrastructure so they can use it for debugging: we call this a slave loan, and they happen frequently.

Case in point: bug #689291

Firefox is a huge open source project: slave loans can never scale enough to serve our community. So, this weekend I took a whack at solving this problem with Docker. So far, five [of an eventual fourteen] containers have been published, which replicate the following aspects of our in house build environments:As usual, you can find my scratch work on GitHub: mozilla/build-environments

What Are These Environments Based On?

For a long time, builds have taken place inside of chroots built with Mock. We have three bare bones mock configs which I used to bake some base platform images: On top of our Mock configs, we further specialize build chroots via build scripts powered by Mozharness. The specifications of each environment are laid out in these mozharness configs. To make use of these, I wrote a simple script which converts a mozharness config into a Dockerfile.

The environments I've published so far:The next step, before I publish more containers, will be to write some documentation for developers so they can begin using them for builds with minimal hassle. Stay tuned!

February 09, 2015 06:09 AM

February 06, 2015

Hal Wine (hwine)

Kaizen the low tech way

Kaizen the low tech way

On Jan 29, I treated myself to a seminar on Successful Lean Teams, with an emphasis on Kanban & Kaizen techniques. I’d read about both, but found the presentation useful. Many of the other attendees were from the Health Care industry and their perspectives were very enlightening!

Hearing how successful they were in such a high risk, multi-disciplinary, bureaucratic, and highly regulated environment is inspiring. I’m inclined to believe that it would also be achievable in a simple-by-comparison low risk environment of software development. ;)

What these hospitals are using is a light weight, self managed process which:

  • ensures visibility of changes to all impacted folks
  • outlines the expected benefits
  • includes a “trial” to ensure the change has the desired impact
  • has a built in feedback system

That sounds achievable. In several of the settings, the traditional paper and bulletin board approach was used, with 4 columns labeled “New Ideas”, “To Do”, “Doing”, and “Done”. (Not a true Kanban board for several reasons, but Trello would be a reasonable visual approximation; CAB uses spreadsheets.)

Cards move left to right, and could cycle back to “New Ideas” if iteration is needed. “New Ideas” is where things start, and they transition from there (I paraphrase a lot in the following):

  1. Everyone can mark up cards in New Ideas & add alternatives, etc.
  2. A standup is held to select cards to move from “New Ideas” to “To Do”
  3. The card stays in “To Do” for a while to allow concerns to be expressed by other stake holders. Also a team needs to sign up to move the change through the remaining steps. Before the card can move to “Doing”, a “test” (pilot or checkpoints) is agreed on to ensure the change can be evaluated for success.
  4. The team moves the card into “Doing”, and performs PSDA cycles (Plan, Do, Study, Adjust) as needed.
  5. Assuming the change yields the projected results, the change is implemented and the card is moved to “Done”. If the results aren’t as anticipated, the card gets annotated with the lessons learned, and either goes to “Done” (abandon) or back to “New Ideas” (try again) as appropriate.

For me, I’m drawn to the 2nd and 3rd steps. That seems to be the change from current practice in teams I work on. We already have a gazillion bugs filed (1st step). We also can test changes in staging (4th step) and update production (5th step). Well, okay, sometimes we skip the staging run. Occasionally that *really* bites us. (Foot guns, foot guns – get your foot guns here!)

The 2nd and 3rd steps help focus on changes. And make the set of changes happening “nowish” more visible. Other stakeholders then have a small set of items to comment upon. Net result - more changes “stick” with less overall friction.

Painting with a broad brush, this Kaizen approach is essentially what the CAB process is that Mozilla IT implemented successfully. I have experienced the CAB reduce the amount of stress, surprises, and self inflicted damage amongst both inside and outside of IT. Over time, the velocity of changes has increased and backlogs have been reduced. In short, it is a “Good Thing(tm)”.

So, I’m going to see if there is a way to “right size” this process for the smaller teams I’m on now. Stay tuned....

February 06, 2015 08:00 AM

February 04, 2015

Rail Alliev (rail)

Deploying your code from github to AWS Elastic Beanstalk using Travis

I have been playing with Funsize a lot recently. One of the goals was iterating faster:

I have hit some challenges with both Travis and Elastic Beanstalk.

The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

# .travis.yml snippet
    - provider: elasticbeanstalk
      app: funsize # Elastic Beanstalk app name
      env: funsize-dev-rail # Elastic Beanstalk env name
      bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
      region: us-east-1
        secure: "encrypted key id"
        secure: "encrypted key"
          repo: rail/build-funsize # Deploy only using my user repo for now
          all_branches: true
          # deploy only if particular jobs in the job matrix passes, not any
          condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis

Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

February 04, 2015 02:09 AM

February 03, 2015

Armen Zambrano G. (@armenzg)

What the current list of buildbot builders is

This becomes very easy with mozilla_ci_tools (aka mozci):
>>> from mozci import mozci
>>> builders = mozci.list_builders()
>>> len(builders)
>>> builders[0]
u'Linux x86-64 mozilla-inbound leak test build'
This and many other ways to interact with our CI will be showing up in the repository.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

February 03, 2015 07:48 PM

Morgan Phillips (mrrrgn)

shutdown -r never part deux

In my last post, I wrote about how runner and cleanslate were being leveraged by Mozilla RelEng to try at eliminating the need for rebooting after each test/build job -- thus reclaiming a good deal of wasted time. Since then, I've had the opportunity to outfit all of our hosts with better logging, and collect live data which highlights the progress that's been made. It's been bumpy, but the data suggests that we have reduced reboots (across all tiers) by around 40% -- freeing up over 72,000 minutes of compute time per day, with an estimated savings of $51,000 per year.

Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.

Collecting Data

With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed ps. To take advantage of this, I wrote a "task hook" feature which passes task state to an external script. From there, I wrote a hook script which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.

Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:

* a started buildbot task generally indicates that a job is active on a machine *

* a global ps! *

* spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *

* we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *

Costs/Time Saved Calculations

To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving.

The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.

I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:

(non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr

(spot) cost: $14277.72 time: 875936hr avg: $0.02/hr

Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.

To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.

You can see all of my raw data, queries, and helper scripts at this github repo:

Why Are We Only Saving 40%?

The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, anything which interacts with a display server.

To take advantage of the jobs which area already working, I added a task "," which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.

Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.

Why Not Use Containers?

First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: task cluster. Another big part of my work will be porting some of our current builds to that system.

What About Windows

If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: bug 1055794

Future Plans

Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc...

Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.

Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis.

February 03, 2015 05:53 PM

January 27, 2015

Justin Wood (Callek)

Release Engineering does a lot…

Hey Everyone,

I spent a few minutes a week over the last month or two working on compiling a list of Release Engineering work areas. Included in that list is identifying which repositories we “own” and work in, as well as where these repositories are mirrored. (We have copies in hg.m.o git.m.o and github, some exclusively in their home).

While we transition to a more uniform and modern design style and philosphy.

My major takeaway here is we have A LOT of things that we do. (this list is explicitly excluding repositories that are obsolete and unused)

So without further ado, I present our page ReleaseEngineering/Repositories

repositoriesYou’ll notice a few things about this, we have a column for Mirrors, and RoR (Repository of Record), “Committable Location” was requested by Hal and is explicitly for cases where “Where we consider our important location the RoR, it may not necessarily be where we allow commits to”

The other interesting thing is we have automatic population of travis and coveralls urls/status icons. This is for free using some magic wiki templates I did.

The other piece of note here, is the table is generated by a list of pages, using “SemanticMediaWiki” so the links to the repositories can be populated with things like “where are the docs” “what applications use this repo”, “who are suitable reviewers” etc. (all those are TODO on the releng side so far).

I’m hoping to be putting together a blog post at some point about how I chose to do much of this with mediawiki, however in the meantime should any team at Mozilla find this enticing and wish to have one for themselves, much of the work I did here can be easily replicated for your team, even if you don’t need/like the multiple repo location magic of our table. I can help get you setup to add your own repos to the mix.

Remember the only fields that are necessary is a repo name, the repo location, and owner(s). The last field can even be automatically filled in by a form on your page (see the end of Release Engineerings page for an example of that form)

Reach out to me on IRC or E-mail (information is on my mozillians profile) if you desire this for your team and we can talk. If you don’t have a need for your team, you can stare at all the stuff Releng is doing and remember to thank one of us next time you see us. (or inquire about what we do, point contributors our way, we’re a friendly group, I promise.)

January 27, 2015 11:11 PM

January 22, 2015

Armen Zambrano G. (@armenzg)

Backed out - Pinning for Mozharness is enabled for the fx-team integration tree

EDIT=We had to back out this change since it caused issues for PGO talos jobs. We will try again after further testing.

Pinning for Mozharness [1] has been enabled for the fx-team integration tree.
Nothing should be changing. This is a no-op change.

We're still using the default mozharness repository and the "production" branch is what is being checked out. This has been enabled on Try and Ash for almost two months and all issues have been ironed out. You can know if a job is using pinning of Mozharness if you see "" in its log.

If you notice anything odd please let me know in bug 1110286.

If by Monday we don't see anything odd happening, I would like to enable it for mozilla-central for few days before enabling it on all trunk trees.

Again, this is a no-op change, however, I want people to be aware of it.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

January 22, 2015 08:57 PM

January 21, 2015

Kim Moir (kmoir)

Reminder: Releng 2015 submissions due Friday, January 23

Just a reminder that submissions for the Releng 2015 conference are due this Friday, January 23. 

It will be held on May 19, 2015 in Florence Italy.

If you've done recent work like
we'd love to hear from you.  Please consider submitting a talk!

In addition, if you have colleagues that work in this space that might have interesting topics to discuss at this workshop, please forward this information. I'm happy to talk to people about the submission process or possible topics if there are questions.

Il Duomo di Firenze by ©eddi_07, Creative Commons by-nc-sa 2.0

Sono nel comitato che organizza la conferenza Releng 2015 che si terrà il 19 Maggio 2015 a Firenze. La scadenza per l’invio dei paper è il 23 Gennaio 2015.

se avete competenze in:
e volete discutere della vostra esperienza, inviateci una proposta di talk!

Per favore inoltrate questa richiesta ai vostri colleghi e alle persone interessate a questi argomenti. Nel caso ci fossero domande sul processo di invio o sui temi di discussione, non esitate a contattarmi.

(Thanks Massimo for helping with the Italian translation).

More information
Releng 2015 web page
Releng 2015 CFP now open

January 21, 2015 08:36 PM

January 16, 2015

Nick Thomas (nthomas)

Plans for 2015 – Revamping the Release Automation

Mozilla’s Release Engineering team has been through several major iterations of our “release automation”, which is how we produce the bits for Firefox betas and releases. With each incarnation, the automation has become more reliable, supported more functionality, and end-to-end time has reduced. If you go back a few years to Firefox 2.0 it took several days to prepare 40 or so locales and three platforms for a release; now it’s less than half a day for 90 locales and four platforms. The last major rewrite was some time ago so it’s time to embark on a big revamp – this time we want to reduce the end-to-end time significantly.

Currently, when a code change lands in the repository (eg mozilla-beta) a large set of compile and test jobs are started. It takes about 5 hours for the slowest platform to complete an optimized build and run the tests, in part because we’re using Profile-Guided Optimization (PGO) and need to link XUL twice. Assuming the tests have passed, or been recognized as an intermittent failure, a Release Manager will kick off the release automation. It will tag the gecko and localization repositories, and a second round of compilation will start, using the official branding and other release-specific settings. Accounting for all the other release work (localized builds, source tarballs, updates, and so on) the automation takes 10 or more hours to complete.

The first goal of the revamp is to avoid the second round of compilation, with all the loss of time and test coverage it brings. Instead, we’re looking at ‘promoting’ the builds we’ve already done (in the sense of rank, not marketing). By making some other improvements along the way, eg fast generation of partial updates using funsize, we may be able to save as much as 50% from the current wall time. So we’ll be able to ship fixes to beta users more often than twice a week, get feedback earlier in the cycle, and be more confident about shipping a new release. It’ll help us to ship security fixes faster too.

We’re calling this ‘Build Promotion’ for short, and you can follow progress in Bug 1118794 and dependencies.

January 16, 2015 10:08 AM

January 10, 2015

Hal Wine (hwine)

ChatOps Meetup

ChatOps Meetup

This last Wednesday, I went to a meetup on ChatOps organized by SF DevOps, hosted by Geekdom (who also made recordings available), and sponsored by TrueAbility.

I had two primary goals in attending: I wanted to understand what made ChatOps special, and I wanted to see how much was applicable to my current work at Mozilla. The two presentations helped me accomplish the first. I’m still mulling over the second. (Ironically, I had to shift focus during the event to clean up a deployment-gone-wrong that was very close to one of the success stories mentioned by Dan Chuparkoff.)

My takeaway on why chatops works is that it is less about the tooling (although modern web services make it a lot easier), and more about the process. Like a number of techniques, it appears to be more successful when teams fully embrace their vision of ChatOps, and make implementation a top priority. Success is enhanced when the tooling supports the vision, and that appears to be what all the recent buzz is about – lots of new tools, examples, and lessons learned make it easier to follow the pioneers.

What are the key differentiators?

Heck, many teams use irc for operational coordination. There are scripts which automate steps (some workflows can be invoked from the web even). We’ve got automated configuration, logging, dashboards, and wikis – are we doing ChatOps?

Well, no, we aren’t.

Here are the differences I noted:
  • ChatOps requires everyone both agreeing and committing to a single interface to all operations. (The opsbot, like hubot, lita or Err.) Technical debt (non-conforming legacy systems) will be reworked to fit into ChatOps.
  • ChatOps requires focus and discipline. There are a small number of channels (chat rooms, MUC) that have very specific uses - and folks follow that. High signal to noise ratio. (No animated gifs in the deploy channel - that’s what the lolcat channel is for.)
  • A commitment to explicitly documenting all business rules as executable code.

What do you get for giving up all those options and flexibility? Here was the “ah ha!” concepts for me:

  1. Each ChatOps room is a “shared console” everyone can see and operate. No more screen sharing over video, or “refresh now” coordination!

  2. There is a bot which provides the “facts” about the world. One view accessible by all.

  3. The bot is also the primary way folks interact and modify the system. And it is consistent in usage across all commands. (The bot extensions perform the mapping to whatever the backend needs. The code adapts, not the human!)

  4. The bot knows all and does all:
    • Where’s the documentation?
    • How do I do X?
    • Do X!
    • What is the status of system Y?
  5. The bot is “fail safe” - you can’t bypass the rules. (If you code in a bypass, well, you loaded that foot gun!)

Thus everything is consistent and familiar for users, which helps during those 03:00 forays into a system you aren’t as familiar with. Nirvana ensues (remember, everyone did agree to drink the koolaid above).

Can you get there from here?

The speaker selection was great – Dan was able to speak to the benefits of committing to ChatOps early in a startup’s life. James Fryman (from StackStorm) showed a path for migrating existing operations to a ChatOps model. That pretty much brackets the range, so yeah, it’s doable.

The main hurdle, imo, would be getting the agreement to a total commitment! There are some tensions in deploying such a system at a highly open operation like Mozilla: ideally chat ops is open to everyone, and business rules ensure you can’t do or see anything improper. That means the bot has (somewhere) the credentials to do some very powerful operations. (Dan hopes to get their company to the “no one uses ssh, ever” point.)

My next steps? Still thinking about it a bit – I may load Err onto my laptop and try doing all my local automation via that.

January 10, 2015 08:00 AM

January 09, 2015

Chris AtLee (catlee)

Upcoming hotness from RelEng

To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.


First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/


Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

We're aiming to have nightly builds use funsize for partial update generation this quarter.

I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

S3 uploads from automation

This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

No reboots

Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

Release build promotion

Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

... and much more!

This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

January 09, 2015 06:35 PM

Ben Hearsum (bhearsum)

UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

(This post has been updated with the new go-live date.)

Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we’re ready to switch the vast majority of our users over. We’ll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

Stick around if you’re interested in some of the load testing we did.

Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn’t a complete surprise given that we hadn’t implemented any form of caching yet. After implementing a simple LRU cache on Balrog’s largest objects, we did another load test. Here’s what the load looked like on one web head:

Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

I’m not sure what to call that other than winning.

January 09, 2015 04:39 PM

January 08, 2015

Kim Moir (kmoir)

Mozilla pushes - December 2014

Here's December 2014's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

There was a low number of pushes this month.  I expect this is due to the Mozilla all-hands in Portland in early December where we were encouraged to meet up with other teams instead of coding :-) and the holidays at the end of the month for many countries.
As as side node, in 2014 we had a total number of 124423 pushes, compared to 79233 in 2013 which represents a growth rate of 57% this year.

7836 pushes
253 pushes/day (average)
Highest number of pushes/day: 706 pushes on Dec 17, 2014
15.25 pushes/hour (highest)

General Remarks
Try had around around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all of the pushes

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 

January 08, 2015 05:14 PM

January 06, 2015

Armen Zambrano G. (@armenzg)

Tooltool fetching can now use LDAP credentials from a file

You can now fetch tooltool files by using an authentication file.
All you have to do is append "--authentication-file file" to your tooltool fetching command.

This is important if you want to use automation to fetch files from tooltool on your behalf.
This was needed to allow Android test jobs to run locally since we need to download tooltool files for it.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

January 06, 2015 04:45 PM

January 05, 2015

Armen Zambrano G. (@armenzg)

Run Android test jobs locally

You can now run Android test jobs on your local machine with Mozharness.

As with any other developer capable Mozharness script, all you have to do is:

An example for this is:
python scripts/ --cfg android/
--test-suite mochitest-gl-1 --blob-upload-branch try
--download-symbols ondemand --cfg

Here's the bug where the work happened.
Here's the documentation on how to run Mozharness as a developer.

Please file a bug under Mozharness if you find any issues.

Here are some other related blog posts:


Bug 1117954- I think that I need a different SDK or emulator version is needed to run Android API 10 jobs.

I wish we run all of our jobs in proper isolation!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

January 05, 2015 08:47 PM

December 22, 2014

Armen Zambrano G. (@armenzg)

Run mozharness talos as a developer (Community contribution)

Thanks to our contributor Simarpreet Singh from Waterloo we can now run a talos job through mozharness on your local machine (bug 1078619).

All you have to add is the following:

To read more about running Mozharness locally go here.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

December 22, 2014 08:10 PM

December 11, 2014

Kim Moir (kmoir)

Releng 2015 CFP now open

Florence, Italy.  Home of beautiful architecture.

Il Duomo di Firenze by ©runner310, Creative Commons by-nc-sa 2.0

Delicious food and drink.

Panzanella by © Pete Carpenter, Creative Commons by-nc-sa 2.0

Caffè ristretto by © Marcelo César Augusto Romeo, Creative Commons by-nc-sa 2.0

And next May, release engineering :-)

The CFP for Releng 2015 is now open.  The deadline for submissions is January 23, 2015.  It will be held on May 19, 2015 in Florence Italy and co-located with ICSE 2015.   We look forward to seeing your proposals about the exciting work you're doing in release engineering!

If you have questions about the submission process or anything else, please contact any of the program committee members. My email is kmoir and I work at

December 11, 2014 09:00 PM

December 09, 2014

Armen Zambrano G. (@armenzg)

Running Mozharness in developer mode will only prompt once for credentials

Thanks to Mozilla's contributor kartikgupta0909 we now only have to enter LDAP credentials once when running the developer mode of Mozharness.

He accomplished it in bug 1076172.

Thank you Kartik!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

December 09, 2014 09:43 PM

December 08, 2014

Armen Zambrano G. (@armenzg)

Test mozharness changes on Try

You can now push to your own mozharness repository (even a specific branch) and have it be tested on Try.

Few weeks ago we developed mozharness pinning (aka mozharness.json) and recently we have enabled it for Try. Read the blog post to learn how to make use of it.

NOTE: This currently only works for desktop, mobile and b2g test jobs. More to come.
NOTE: We only support named branches, tags or specific revisions. Do not use bookmarks as it doesn't work.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

December 08, 2014 06:59 PM

December 04, 2014

Morgan Phillips (mrrrgn)

shutdown -r never

For the past month I've worked on achieving the effects of a reboot without actually doing one. Sort of a "virtual" reboot. This isn't a usual optimization; but in Mozilla's case it's likely to create a huge impact on performance.

Mozilla build/test infrastructure is complex. The jobs can be expensive and messy. So messy that, for a while now, machines have been rebooted after completing tasks to ensure that environments remain fresh.

This strategy works marvelously at preventing unnecessary failures; but wastes a lot of resources. In particular, with reboots taking something like two minutes to complete, and at around 100k jobs per day, a whopping 200,000 minutes of machine time. That's nearly five months - yikes!1

Yesterday I began rolling out these "virtual" reboots for all of our Linux hosts, and it seems to be working well [edit: after a few rollbacks]. By next month I should also have it turned on for OSX and Windows machines.

What does a "virtual" reboot look like?

For starters [pun intended], each job requires a good amount of setup and teardown, so, a sort of init system is necessary. To achieve this a utility called runner has been created. Runner is a project that manages starting tasks in a defined order. If tasks fail, the chain can be retried, or halted. Many tasks that once lived in /etc/init.d/ are now managed by runner including buildbot itself.

Among runner's tasks are various scripts for cleaning up temporary files, starting/restarting services, and also a utility called cleanslate. Cleanslate resets a users running processes to a previously recorded state.

At boot, cleanslate takes a snapshot of all running processes, then, before each job it kills any processes (by name) which weren't running when the system was fresh. This particular utility is key to maintaining stability and may be extended in the future to enforce other kinds of system state as well.

The end result is this:

old work flow

Boot + init -> Take Job -> Reboot (2-5 min)

new work flow

Boot + Runner -> Take Job -> Shutdown Buildslave
(runner loops and restarts slave)

[1] What's more, this estimate does not take into account the fact that jobs run faster on a machine that's already "warmed up."

December 04, 2014 06:54 PM

December 03, 2014

Kim Moir (kmoir)

Mozilla pushes - November 2014

Here's November's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

Not a record breaking month, in fact we are down over 2000 pushes since the last month.

10376 pushes
346 pushes/day (average)
Highest number of pushes/day: 539 pushes on November 12
17.7 pushes/hour (average)

General Remarks
Try keeps had around 38% of all the pushes, and gaia-try has about 30%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all the pushes.

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes    

December 03, 2014 09:41 PM

November 24, 2014

Armen Zambrano G. (@armenzg)

Pinning mozharness from in-tree (aka mozharness.json)

Since mozharness came around 2-3 years ago, we have had the same issue where we test a mozharness change against the trunk trees, land it and get it backed out because we regress one of the older release branches.

This is due to the nature of the mozharness setup where once a change is landed all jobs start running the same code and it does not matter on which branch that job is running.

I have recently landed some code that is now active on Ash (and soon on Try) that will read a manifest file that points your jobs to the right mozharness repository and revision. We call this process to "pin mozhaness". In other words, what we do is to fix an external factor to our job execution.

This will allow you to point your Try pushes to your own mozharness repository.

In order to pin your jobs to a repository/revision of mozharness you have to change a file called mozharness.json which indicates the following two values:

This is a similar concept as talos.json introduced which locks every job to a specific revision of talos. The original version of it landed in 2011.

Even though we have a similar concept since 2011, that doesn't mean that it was as easy to make it happen for mozharness. Let me explain a bit why:

Coming up:
  • Enable on Try
  • Free up Ash and Cypress
    • They have been used to test custom mozharness patches and the default branch of Mozharness (pre-production)
Long term:
  • Enable the feature on all remaining Gecko trees
    • We would like to see this run at scale for a bit before rolling it out
    • This will allow mozharness changes to ride the trains
If you are curious, the patches are in bug 791924.

Thanks for Rail for all his patch reviews and Jordan for sparking me to tackle it.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2014 05:35 PM

November 12, 2014

Kim Moir (kmoir)

Scaling capacity while saving cash

There was a very interesting release engineering summit this Monday held in concert with LISA in Seattle.  I was supposed fly there this past weekend so I could give a talk on Monday but late last week I became ill and was unable to go.   Which was very disappointing because the summit looked really great and I was looking forward to meeting the other release engineers and learning about the challenges they face.

Scale in the Market  ©Clint Mickel, Creative Commons by-nc-sa 2.0

Although I didn't have the opportunity to give the talk in person, the slides for it are available on slideshare and my mozilla people account   The talk describes how we scaled our continuous integration infrastructure on AWS to handle double the amount of pushes it handled in early 2013, all while reducing our AWS monthly bill by 2/3.

Cost per push from Oct 2012 until Oct 2014. This does not include costs for on premise equipment. It reflects our monthly AWS bill divided by the number of monthly pushes (commits).  The chart reflects costs from October 2012-2014.

Thank you to Dinah McNutt and the other program committee members for organizing this summit.  I look forward to watching the talks once they are online.

November 12, 2014 07:34 PM

Mozilla pushes - October 2014

Here's the October 2014 monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

We didn't have a record breaking month in terms of the number of pushes, however we did have a daily record on October 18 with 715 pushes. 

12821 pushes, up slightly from the previous month
414 pushes/day (average)
Highest number of pushes/day: 715 pushes on October 8
22.5 pushes/hour (average)

General Remarks
Try keeps had around 39% of all the pushes, and gaia-try has about 31%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes

November 12, 2014 03:45 PM

Morgan Phillips (mrrrgn)

AirMozilla: Distrusting Our Own [Build] Infrastructure

If you missed last week's AirMozilla broadcast: Why and How of Reproducible Builds: Distrusting Our Own Infrastructure For Safer Software Releases by the Tor Project, consider checking it out.

The talk is an in depth look at how one can protect release pipelines from being owned by attacks which target build systems. Particularly, attacks where compromised compilers may be used to create unsafe binaries from safe source code.

Meanwhile RelEng is underway, putting these ideas into practice.

November 12, 2014 08:54 AM

A Simple Trusting Trust Attack


November 12, 2014 07:54 AM

November 10, 2014

Morgan Phillips (mrrrgn)

A Note on Deterministic Builds

Since I joined Mozilla's Release Engineering team I've had the opportunity to put my face into a firehose of interesting new knowledge and challenges. Maintaining a release pipeline for binary installers and updates used by a substantial portion of the Earth's population is a whole other kind of beast from ops roles where I've focused on serving some kind of SaaS or internal analytics infrastructure. It's really exciting!

One of the most interesting problems I've seen getting attention lately are deterministic builds, that is, builds that produce the same sequence of bytes from source on a given platform at any time.

What good are deterministic builds?

For starters, they aid in detecting "Trusting Trust" attacks. That's where a compromised compiler produces malicious binaries from perfectly harmless source code via replacing certain patterns during compilation. It sort of defeats the whole security advantage of open source when you download binaries right?

Luckily for us users, a fellow named David A. Wheeler rigorously proved a method for circumventing this class of attacks altogether via a technique he coined "Diverse Double-Compiling" (DDC). The gist of it is, you compile a project's source code with a trusted tool chain then compare a hash of the result with some potentially malicious binary. If the hashes match you're safe.

DDC also detects the less clever scenario where an adversary patches, otherwise open, source code during the build process and serves up malwareified packages. In either case, it's easy to see that this works if and only if builds are deterministic.

Aside from security, they can also help projects that support many platforms take advantage of cross building with less stress. That is, one could compile arm packages on an x86_64 host then compare the results to a native build and make sure everything matches up. This can be a huge win for folks who want to cut back on infrastructure overhead.

How can I make a project more deterministic?

One bit of good news is, most compilers are already pretty deterministic (on a given platform). Take hello.c for example:

int main() {
    printf("Hello World!");

Compile that a million times and take the md5sum. Chances are you'll end up with a million identical md5sums. Scale that up to a million lines of code, and there's no reason why this won't hold true.

However, take a look at this doozy:

int main() {
    printf("Hello from %s! @ %s", __FILE__, __TIME__);

Having timestamps and other platform specific metadata baked into source code is a huge no-no for creating deterministic builds. Compile that a million times, and you'll likely get a million different md5sums.

In fact, in an attempt to make Linux more deterministic all __TIME__ macros were removed and the makefile specifies a compiler option (-Werror=date-time) that turns any use of it into an error.

Unfortunately, removing all traces of such metadata in a mature code base could be all but impossible, however, a fantastic tool called gitian will allow you to compile projects within a virtual environment where timestamps and other metadata are controlled.

Definitely check gitian out and consider using it as a starting point.

Another trouble spot to consider is static linking. Here, unless you're careful, determinism sits at the mercy of third parties. Be sure that your build system has access to identical libraries from anywhere it may be used. Containers and pre-baked vms seem like a good choice for fixing this issue, but remember that you could also be passing around a tainted compiler!

Scripts that automate parts of the build process are also a potent breeding ground for non-deterministic behaviors. Take this python snippet for example:

with open('manifest', 'w') as manifest:
    for dirpath, dirnames, filenames in os.walk("."):
        for filename in filenames:

The problem here is that os.walk will not always print filenames in the same order. :(

One also has to keep in mind that certain data structures become very dangerous in such scripts. Consider this pseudo-python that auto generates some sort of source code in a compiled language:

weird_mapping = dict(file_a=99, file_b=1)
things_in_a_set = set([thing_a, thing_b, thing_c])
for k, v in werid_mapping.items():
    ... generate some code ...
for thing in things_in_a_set:
    ... generate some code ...

A pattern like this would dash any hope that your project had of being deterministic because it makes use of unordered data structures.

Beware of unordered data structures in build scripts and/or sort all the things before writing to files.

Enforcing determinism from the beginning of a project's life cycle is the ideal situation, so, I would highly recommend incorporating it into CI flows. When a developer submits a patch it should include a hash of their latest build. If the CI system builds and the hashes don't match, reject that non-deterministic code! :)


Of course, this hardly scratches the surface on why deterministic builds are important; but I hope this is enough for a person to get started on. It's a very interesting topic with lots of fun challenges that need solving. :) If you'd like to do some further reading, I've listed a few useful sources below.

November 10, 2014 07:54 PM

Justin Wood (Callek)

Firefox Launches Developer Editon (Minor Papercut Issues)

So, as you may have heard, Firefox is launching a dev edition.

This post does not attempt to elaborate on that specifically too much, but it’s more to identify some issues I hit in early testing and the solutions to them.


While I do admire the changes of the Developer Edition Theme, I’m a guy who likes to stick with “what I know” more than a drastic change like that. What I didn’t realize was that this is possible out of the box in developer edition.

After the Tour you get, you’ll want to open the Customize panel and then deselect “Use Firefox Developer Edition Theme” (see the following image — arrow added) and that will get you back to what you know.



As a longtime user, I had “Old Firefox Sync” enabled; this was the one that very few users enabled and even fewer used it across devices.

Firefox Developer Edition, however, creates a new profile (so you can use it alongside whatever Firefox version you want) and supports setting up only the “New” sync features. Due to creating a new profile, it also leaves you without history or saved passwords.

To sync my old profile with developer edition, I had to:

  1. Unlink my Desktop Firefox from old sync
  2. Unlink my Android Firefox from old sync
  3. Create a new sync account
  4. Link my old Firefox profile with new sync
  5. Link my Android with new sync
  6. Link Dev Edition with new sync
  7. Profit

Now other than steps 6 and 7 (yea, how DO I profit?) this is all covered quite well in a SuMo article on the subject. I will happily help guide people through this process, especially in the near future, as I’ve just gone through it!

(Special Thanks to Erik for helping to copy-edit this post)

November 10, 2014 04:30 PM

I’m a wordpress newbie

If this is on, and so is a “content is password protected” post below it, I’m sorry.

The post is merely that way because its unfinished but I wanted to share it with a few others for early feedback.

I’ll delete this post, and unhide that one once things are ready. (Sorry for any confusion)

November 10, 2014 05:18 AM

November 06, 2014

Armen Zambrano G. (@armenzg)

Setting buildbot up a-la-releng (Create your own local masters and slaves)

buildbot is what Mozilla's Release Engineering uses to run the infrastructure behind
buildbot assigns jobs to machines (aka slaves) through hosts called buildbot masters.

All the different repositories and packages needed to setup buildbot are installed through Puppet and I'm not aware of a way of setting my local machine through Puppet (I doubt I would want to do that!).
I managed to set this up a while ago by hand [1][2] (it was even more complicated in the past!), however, these one-off attempts were not easy to keep up-to-date and isolated.

I recently landed few scripts that makes it trivial to set up as many buildbot environments as you want and all isolated from each other.

All the scripts have been landed under the "community" directory under the "braindump" repository:

The main two scripts:

If you call with -w /path/to/your/own/workdir you will have everything set up for you. From there on, all you would have to do is this:
  • cd /path/to/your/own/workdir
  • source venv/bin/activate
  • buildbot start masters/test_master (for example)
  • buildslave start slaves/test_slave
Each paired master and slave have been setup to talk to each other.

I hope this is helpful for people out there. It's been great for me when I contribute patches for buildbot (bug 791924).

As always in Mozilla, contributions are always welcome!

PS 1 = Only tested on Ubuntu. If you want it to port this to other platforms please let me know and I can give you a hand.

PS 2 = I know that there is a repository that has docker images called "tupperware", however, I had these set of scripts being worked on for a while. Perhaps someone wants to figure out how to set a similar process through the docker images.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 06, 2014 02:02 PM

November 05, 2014

Massimo Gervasini (mgerva)

Sign the hash of the bundle, not the full bundle!

With Bug 1083683,, we are stopping direct processing of .bundle and .source files by our signing servers. This means that in the near future we will not have new *.bundle.asc and and *.source.tar.bz2.asc files on the ftp server.
Bundles and source files have grown quite a bit and get them signed sometimes ends up in retries and failed jobs, disrupting and delaying the release process. There’s also no benefit on having them signed directly; the source-package job already calculates the hash of the bundle/source files and their MD5/SHA1/SHA512 hashes get included in the .checksum file, which is signed with the release automation key.

November 05, 2014 04:57 PM

October 31, 2014

Chris Cooper (coop)

10.8 testing disabled by default on Try

Mountain LionIf you’ve recently submitted patches to the Mozilla Try server, you may have been dismayed by the turnaround time for your test results. Indeed, last week we had reports from some developers that they were waiting more than 24 hours to get results for a single Try push in the face of backlogs caused by tree closures.

The chief culprit here was Mountain Lion, or OS X 10.8, which is our smallest pool (99) of test machines. It was not uncommon for there to be over 2,000 pending test jobs for Mountain Lion at any given time last week. Once we reach a pending count that high, we cannot make headway until the weekend when check-in volume drops substantially.

In the face of these delays, developers started landing some patches on mozilla-inbound before the corresponding jobs had finished on Try, and worse still, not killing the obsolete pending jobs on Try. That’s just bad hygiene and practice. Sheriffs had to actively look for the duplicate jobs and kill them up to help decrease load.

We cannot easily increase the size of the Mountain Lion pool. Apple does not allow you to install older OS X versions on new hardware, so our pool size here is capped at the number of machines we bought when 10.8 was released over 2 years ago or what we can scrounge from resellers.

To improve the situation, we made the decision this week to disable 10.8 testing by default on Try. Developers must now select 10.8 explicitly from the “Restrict tests to platform(s)” list on TryChooser if they want to run Mountain Lion tests. If you have an existing Mac Try build that you’d like to back-fill with 10.8 results, please ping the sheriff on duty (sheriffduty) in #developers or #releng and they can help you out *without* incurring another full Try run.

Please note that we do plan to stand up Yosemite (10.10) testing as a replacement for Mountain Lion early in 2015. This is a stop-gap measure until we’re able to do so.

October 31, 2014 08:25 PM

October 27, 2014

Kim Moir (kmoir)

Mozilla pushes - September 2014

Here's September 2014's monthly analysis of the pushes to our Mozilla development trees.
You can load the data as an HTML page or as a json file.

Suprise!  No records were broken this month.

12267 pushes
409 pushes/day (average)
Highest number of pushes/day: 646 pushes on September 10, 2014
22.6 pushes/hour (average)

General Remarks
Try has around 36% of pushes and Gaia-Try comprise about 32%.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 620 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
August 20, 2014 had the highest number of pushes in one day with 690 pushes

October 27, 2014 09:11 PM

Release Engineering in the classroom

The second week of October, I had the pleasure of presenting lectures on release engineering to university students in Montreal as part of the PLOW lectures at École Polytechnique de Montréal.    Most of the students were MSc or PhD students in computer science, with a handful of postdocs and professors in the class as well. The students came from Montreal area universities and many were international students. The PLOW lectures consisted of several invited speakers from various universities and industry spread over three days.

View looking down from the university

Université de Montréal administration building

École Polytechnique building.  Each floor is painted a different colour to represent a differ layer of the earth.  So the ground floor is red, the next orange and finally green.

The first day, Jack Jiang from York University gave a talk about software performance engineering.
The second day, I gave a lecture on release engineering in the morning.  The rest of the day we did a lot of labs to configure a Jenkins server to build and run tests on an open source project. Earlier that morning, I had setup m3.large instances for the students on Amazon that they could ssh into and conduct their labs.  Along the way, I talked about some release engineering concepts.  It was really interesting and I learned a lot from their feedback.  Many of the students had not been exposed to release engineering concepts so it was fun to share the information.

Several students came up to me during the breaks and said "So, I'm doing my PhD in release engineering, and I have several questions for you" which was fun.  Also, some of the students were making extensive use of code bases for Mozilla or other open source projects so that was interesting to learn more about.  For instance one research project looking at the evolution of multi-threading in a Mozilla code bases, and another student was conducting bugzilla comment sentiment analysis.  Are angry bug comments correlated with fewer bug fixes?  Looking forward to the results of this research!

I ended the day by providing two challenge exercises to the students that they could submit answers to.  One exercise was to setup a build pipeline in Jenkins for another open source project.  The other challenge was to use a the Jenkins REST API to query the Apache projects Jenkins server and present some statistics on their build history.  The results were pretty impressive!

My slides are on GitHub and the readme file describes how I setup the Amazon instances so Jenkins and some other required packages were installed before hand.  Please use them and distribute them if you are interested in teaching release engineering in your classroom.

Lessons I learned from this experience:
The third day there was a lecture by Michel Dagenais of Polytechnique Montréal on tracing heterogeneous cloud instances using (tracing framework for Linux).  The Eclipse trace compass project also made an appearance in the talk. I always like to see Eclipse projects highlighted.  One of his interesting points was that none of the companies that collaborate on this project wanted to sign a bunch of IP agreements so they could collaborate on this project behind closed doors.  They all wanted collaborate via an open source community and source code repository.  Another thing he emphasized was that students should make their work available on the web, via GitHub or other repositories so they have a portfolio of work available.  It was fantastic to seem him promote the idea of students being involved in open source as a way to help their job prospects when they graduate!

Thank you Foutse and  Bram  for the opportunity to lecture at your university!  It was a great experience!  Also, thanks Mozilla for the opportunity to do this sort of outreach to our larger community on company time!

Also, I have a renewed respect for teachers and professors.  Writing these slides took so much time.  Many long nights for me especially in the days leading up to the class.  Kudos to you all who do teach everyday.

The slides are on GitHub and the readme file describes how I setup the Amazon instances for the labs

October 27, 2014 01:34 PM

Beyond the Code 2014: a recap

I started this blog post about a month ago and didn't finish it because well, life is busy.  

I attended Beyond the Code last September 19.  I heard about it several months ago on twitter.  A one-day conference about celebrating women in computing, in my home town, with an fantastic speaker line up?  I signed up immediately.   In the opening remarks, we were asked for a show of hands to show how many of us were developers, in design,  product management, or students and there was a good representation from all those categories.  I was especially impressed to see the number of students in the audience, it was nice to see so many of them taking time out of their busy schedule to attend.

View of the Parliament Buildings and Chateau Laurier from the MacKenzie street bridge over the Rideau Canal
Ottawa Conference Centre, location of Beyond the Code
There were seven speakers, three workshop organizers, a lunch time activity, and a panel at the end. The speakers were all women.  The speakers were not all white women or all heterosexual women.  There were many young women, not all industry veterans :-) like me.  To see this level of diversity at a tech conference filled me with joy.  Almost every conference I go to is very homogenous in the make up of the speakers and the audience.  To to see ~200 tech women in at conference and 10% men (thank you for attending:-) was quite a role reversal.

I completely impressed by the caliber of the speakers.  They were simply exceptional.

The conference started out with Kronda Adair giving a talk on Expanding Your Empathy.  One of the things that struck me from this talk was that she talked about how everyone lives in a bubble, and they don't see things that everyone does due to privilege.  She gave the example of how privilege is like a browser, and colours how we see the world.  For a straight white guy a web age looks great when they're running the latest Chrome on MacOSx.  For a middle class black lesbian, the web page doesn't look as great because it's like she's running IE7.  There is less inherent privilege.  For a "differently abled trans person of color" the world is like running IE6 in quirks mode. This was a great example. She also gave a shout out to the the Ascend Project which she and Lukas Blakk are running in Mozilla Portland office. Such an amazing initiative.

The next speaker was Bridget Kromhout who gave talk about Platform Ops in the Public Cloud.
I was really interested in this talk because we do a lot of scaling of our build infrastructure in AWS and wanted to see if she had faced similar challenges. She works at DramaFever, which she described as Netflix for Asian soap operas.  The most interesting things to me were the fact that she used all AWS regions to host their instances, because they wanted to be able to have their users download from a region as geographically close to them as possible.  At Mozilla, we only use a couple of AWS regions, but more instances than Dramafever, so this was an interesting contrast in the services used. In addition, the monitoring infrastructure they use was quite complex.  Her slides are here.

I was going to summarize the rest of the speakers but Melissa Jean Clark did an exceptional job on her blog.  You should read it!

Thank you Shopify for organizing this conference.  It was great to meet some many brilliant women in the tech industry! I hope there is an event next year too!

October 27, 2014 01:33 PM

October 14, 2014

Jordan Lund (jlund)

This week in Releng - Oct 5th, 2014

Major highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

October 14, 2014 04:36 AM

October 07, 2014

Ben Hearsum (bhearsum)

Redo 1.3 is released – now with more natural syntax!

We’ve been using the functions packaged in Redo for a few years now at Mozilla. One of the things we’ve been striving for with it is the ability to write the most natural code possible. In it’s simplest form, retry, a callable that may raise, the exceptions to retry on, and the callable to run to cleanup before another attempt – are all passed in as arguments. As a result, we have a number of code blocks like this, which don’t feel very Pythonic:

retry(self.session.request, sleeptime=5, max_sleeptime=15,
      kwargs=dict(method=method, url=url, data=data,
                  config=self.config, timeout=self.timeout,
                  auth=self.auth, params=params)

It’s particularly unfortunate that you’re forced to let retry do your exception handling and cleanup – I find that it makes the code a lot less readable. It’s also not possible to do anything in a finally block, unless you wrap the retry in one.

Recently, Chris AtLee discovered a new method of doing retries that results in much cleaner and more readable code. With it, the above block can be rewritten as:

for attempt in retrier(attempts=self.retries):
        self.session.request(method=method, url=url, data=data,
                             timeout=self.timeout, auth=self.auth,
    except (requests.HTTPError, requests.ConnectionError), e:

retrier simply handles the the mechanics of tracking attempts and sleeping, leaving your code to do all of its own exception handling and cleanup – just as if you weren’t retrying at all. It’s important to note that the break at the end of the try block is important, otherwise self.session.request would run even if it succeeded.

I released Redo 1.3 with this new functionality this morning – enjoy!

October 07, 2014 12:48 PM

October 02, 2014

Hal Wine (hwine)

bz Quick Search

October 02, 2014 07:00 AM

September 29, 2014

Jordan Lund (jlund)

This Week In Releng - Sept 21st, 2014

Major Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

September 29, 2014 06:08 PM

This Week In Releng - Sept 7th, 2014

Major Highlights

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

September 29, 2014 05:44 PM

September 25, 2014

Armen Zambrano G. (@armenzg)

Making mozharness easier to hack on and try support

Yesterday, we presented a series of proposed changes to Mozharness at the bi-weekly meeting.

We're mainly focused on making it easier for developers and allow for further flexibility.
We will initially focus on the testing side of the automation and make ground work for other further improvements down the line.

The set of changes discussed for this quarter are:

  1. Move remaining set of configs to the tree - bug 1067535
    • This makes it easier to test harness changes on try
  2. Read more information from the in-tree configs - bug 1070041
    • This increases the number of harness parameters we can control from the tree
  3. Use structured output parsing instead of regular where it applies - bug 1068153
    • This is part of a larger goal where we make test reporting more reliable, easy to consume and less burdening on infrastructure
    • It's to establish a uniform criteria for setting a job status based on a test result that depends on structured log data (json) rather than regex-based output parsing
    • "How does a test turn a job red or orange?" 
    • We will then have a simple answer that is that same for all test harnesses
  4. Mozharness try support - bug 791924
    • This will allow us to lock which repo and revision of mozharnes is checked out
    • This isolates mozharness changes to a single commit in the tree
    • This give us try support for user repos (freedom to experiment with mozharness on try)

Even though we feel the pain of #4, we decided that the value gained for developers through #1 & #2 gave us immediate value while for #4 we know our painful workarounds.
I don't know if we'll complete #4 in this quarter, however, we are committed to the first three.

If you want to contribute to the longer term vision on that proposal please let me know.

In the following weeks we will have more updates with regards to implementation details.

Stay tuned!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 25, 2014 07:42 PM

September 23, 2014

Ben Hearsum (bhearsum)

Stop stripping (OS X builds), it leaves you vulnerable

While investigating some strange update requests on our new update server, I discovered that we have thousands of update requests from Beta users on OS X that aren’t getting an update, but should. After some digging I realized that most, if not all of these are coming from users who have installed one of our official Beta builds and subsequently stripped out the architecture they do not need from it. In turn, this causes our builds to report in such a way that we don’t know how to serve updates for them.

We’ll look at ways of addressing this, but the bottom line is that if you want to be secure: Stop stripping Firefox binaries!

September 23, 2014 05:38 PM

September 19, 2014

Ben Hearsum (bhearsum)

New update server has been rolled out to Firefox/Thunderbird Beta users

Yesterday marked a big milestone for the Balrog project when we made it live for Firefox and Thunderbird Beta users. Those with a good long term memory may recall that we switched Nightly and Aurora users over almost a year ago. Since then, we’ve been working on and off to get Balrog ready to serve Beta updates, which are quite a bit more complex than our Nightly ones. Earlier this week we finally got the last blocker closed and we flipped it live yesterday morning, pacific time. We have significantly (~10x) more Beta users than Nightly+Aurora, so it’s no surprise that we immediately saw a spike in traffic and load, but our systems stood up to it well. If you’re into this sort of thing, here are some graphs with spikey lines:
The load average on 1 (of 4) backend nodes:

The rate of requests to 1 backend node (requests/second):

Database operations (operations/second):

And network traffic to the database (MB/sec):

Despite hitting a few new edge cases (mostly around better error handling), the deployment went very smoothly – it took less than 15 minutes to be confident that everything was working fine.

While Nick and I are the primary developers of Balrog, we couldn’t have gotten to this point without the help of many others. Big thanks to Chris and Sheeri for making the IT infrastructure so solid, to Anthony, Tracy, and Henrik for all the testing they did, and to Rail, Massimo, Chris, and Aki for the patches and reviews they contributed to Balrog itself. With this big milestone accomplished we’re significantly closer to Balrog being ready for Release and ESR users, and retiring the old AUS2/3 servers.

September 19, 2014 02:31 PM

September 17, 2014

Kim Moir (kmoir)

Mozilla Releng: The ice cream

A week or so ago, I was commenting in IRC that I was really impressed that our interns had such amazing communication and presentation skills.  One of the interns, John Zeller said something like "The cream rises to the top", to which I replied "Releng: the ice cream of CS".  From there, the conversation went on to discuss what would be the best ice cream flavour to make that would capture the spirit of Mozilla releng.  The consensus at the end was was that Irish Coffee (coffee with whisky) with cookie dough chunks was the favourite.  Because a lot of people like on the team like coffee, whisky makes it better and who doesn't like cookie dough?

I made this recipe over the weekend with some modifications.  I used the coffee recipe from the Perfect Scoop.  After it was done churning in the ice cream maker,  instead of whisky, which I didn't have on hand, I added Kahlua for more coffee flavour.  I don't really like cookie dough in ice cream but cooked chocolate chip cookies cut up with a liberal sprinkling of Kahlua are tasty.

Diced cookies sprinkled with Kahlua

Ice cream ready to put in freezer

Finished product
I have to say, it's quite delicious :-) If I open source ever stops being fun, I'm going to start a dairy empire.  Not really. Now back to bugzilla...

September 17, 2014 01:43 PM

September 16, 2014

Armen Zambrano G. (@armenzg)

Which builders get added to buildbot?

To add/remove jobs on, we have to modify buildbot-configs.

Making changes can be learnt by looking at previous patches, however, there's a bit of an art to it to get it right.

I just landed a script that sets up buildbot for you inside of a virtualenv and you can pass a buildbot-config patch and determine which builders get added/removed.

You can run this by checking out braindump and running something like this:
buildbot-related/ -j path_to_patch.diff

NOTE: This script does not check that the job has all the right parameters once live (e.g. you forgot to specify the mozharness config for it).

Happy hacking!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 16, 2014 03:26 PM

September 11, 2014

Armen Zambrano G. (@armenzg)

Run tbpl jobs locally with Http authentication ( - take 2

Back in July, we deployed the first version of Http authentication for mozharness, however, under some circumstances, the initial version could fail and affect production jobs.

This time around we have:

If you read How to run Mozharness as a developer you should see the new changes.

As quick reminder, it only takes 3 steps:

  1. Find the command from the log. Copy/paste it.
  2. Append --cfg
  3. Append --installer-url/--test-url with the right values
To see a real example visit this

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 11, 2014 12:45 PM

September 10, 2014

Kim Moir (kmoir)

Mozilla pushes - August 2014

Here's August 2014's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

It was another record breaking month.  No surprise here!


General Remarks
Both Try and Gaia-Try have about 36% each of the pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes.

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 620 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
August 20, 2014 had the highest number of pushes in one day with 690 pushes

September 10, 2014 02:09 PM

September 09, 2014

Nick Thomas (nthomas)

ZNC and Mozilla IRC

ZNC is great for having a persistent IRC connection, but it’s not so great when the IRC server or network has a blip. Then you can end up failing to rejoin with

nthomas (…) has joined #releng
nthomas has left … (Max SendQ exceeded)

over and over again.

The way to fix this is to limit the number of channels ZNC can connect to simultaneously. In the Web UI, you change ‘Max Joins’ preference to something like 5. In the config file use ‘MaxJoins = 5′ in a <User foo> block.

September 09, 2014 10:19 AM

September 08, 2014

Jordan Lund (jlund)

This Week In Releng - Sept 1st, 2014

Major Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

September 08, 2014 04:58 AM

September 06, 2014

Hal Wine (hwine)

New Hg Server Status Page

New Hg Server Status Page

Just a quick note to let folks know that the Developer Services team continues to make improvements on Mozilla’s Mercurial server. We’ve set up a status page to make it easier to check on current status.

As we continue to improve monitoring and status displays, you’ll always find the “latest and greatest” on this page. And we’ll keep the page updated with recent improvements to the system. We hope this page will become your first stop whenever you have questions about our Mercurial server.

September 06, 2014 07:00 AM

September 01, 2014

Nick Thomas (nthomas)

Deprecating our old rsync modules

We’ve removed the rsync modules mozilla-current and mozilla-releases today, after calling for comment a few months ago and hearing no objections. Those modules were previously used to deliver Firefox and other Mozilla products to end users via a network of volunteer mirrors but we now use content delivery networks (CDN). If there’s a use case we haven’t considered then please get in touch in the comments or on the bug.

September 01, 2014 10:09 PM

August 26, 2014

Chris AtLee (catlee)

Gotta Cache 'Em All


Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

August 26, 2014 02:21 PM

August 18, 2014

Jordan Lund (jlund)

This week in Releng - Aug 11th 2014

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

August 18, 2014 06:38 AM

August 12, 2014

Ben Hearsum (bhearsum)

Upcoming changes to Mac package layout, signing

Apple recently announced changes to how OS X applications must be packaged and signed in order for them to function correctly on OS X 10.9.5 and 10.10. The tl;dr version of this is “only mach-O binaries may live in .app/Contents/MacOS, and signing must be done on 10.9 or later”. Without any changes, future versions of Firefox will cease to function out-of-the-box on OS X 10.9.5 and 10.10. We do not have a release date for either of these OS X versions yet.

Changes required:
* Move all non-mach-O files out of .app/Contents/MacOS. Most of these will move to .app/Contents/Resources, but files that could legitimately change at runtime (eg: everything in defaults/) will move to .app/MozResources (which can be modified without breaking the signature): This work is in progress, but no patches are ready yet.
* Add new features to the client side update code to allow partner repacks to continue to work. (
* Create and use 10.9 signing servers for these new-style apps. We still need to use our existing 10.6 signing servers for any builds without these changes. ( and
* Update signing server code to support new v2 signatures.

We are intending to ship the required changes with Gecko 34, which ships on November 25th, 2014. The changes required are very invasive, and we don’t feel that they can be safely backported to any earlier version quickly enough without major risk of regressions. We are still looking at whether or not we’ll backport to ESR 31. To this end, we’ve asked that Apple whitelist Firefox and Thunderbird versions that will not have the necessary changes in them. We’re still working with them to confirm whether or not this can happen.

This has been cross posted a few places – please send all follow-ups to the newsgroup.

August 12, 2014 05:05 PM

August 11, 2014

Jordan Lund (jlund)

This Week In Releng - Aug 4th, 2014

Major Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

August 11, 2014 01:09 AM

August 08, 2014

Kim Moir (kmoir)

Mozilla pushes - July 2014

Here's the July 2014 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.
Like every month for the past while, we had a new record number of pushes. In reality, given that July is one day longer than June, the numbers are quite similar.


General remarks
Try keeps on having around 38% of all the pushes. Gaia-Try is in second place with around 31% of pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

July 2014 was the month with most pushes (12,755 pushes)
June 2014 has the highest pushes/day average with 662 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with 662 

August 08, 2014 06:16 PM

August 07, 2014

Kim Moir (kmoir)

Scaling mobile testing on AWS

Running tests for Android at Mozilla has typically meant running on reference devices.  Physical devices that run jobs on our continuous integration farm via test harnesses.  However, this leads to the same problem that we have for other tests that run on bare metal.  We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations.  In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep?  ©Bill McIntyre, Creative Commons by-nc-sa 2.0
Recently, we started running Android 2.3 tests on emulators in AWS.  This works well for unit tests (correctness tests).  It's not really appropriate for performance tests, but that's another story.  This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests. 

We run many Linux based tests, including Android emulators on AWS spot instances.  Spot instances are AWS excess capacity that you can bid on.  If someone outbids the price you have paid for your spot instance, you instance can be terminated.  But that's okay because we retry jobs if they fail for infrastructure reasons.  The overall percentage of spot instances that are terminated is quite small.  The huge advantage to using spot instances is price.  They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill

We have a wide variety of unit tests that run on emulators for mobile on AWS.  We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances.   Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully.  At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run.  So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity.  But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool.  This meant that people had to wait a long time for their test results.  Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully.  Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work.  In his earlier work the tests did not complete successfully on this instance type.  We are unsure as to the reasons why.  One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end.  So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool.  Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated.  The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available.  For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server.  I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2.  When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs.  After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The script watches the number of pending jobs that on AWS and creates instances as required.  We also have scripts to terminate or stop idle instances when we don't get them.  Why pay for machines when you don't need them now?  After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.

Royal Border Bridge.  Also, release engineers love to see green builds and tests.  ©Jonathan Combe, Creative Commons by-nc-sa 2.0
The end result is that some Android 2.3 tests run on m1.medium or (tst-linux64-spot instances), such as mochitests.

And some Android 2.3 tests run on c3.xlarge or (tst-emulator64-spot instances), such as crashtests.


In enabling this slave class within our configs, we were also able to reuse it for some b2g tests which also faced the same problem where they needed a more powerful instance type for the tests to complete.

Lessons learned:
Use the minimum (cheapest) instance type required to complete your tests
As usual, test on a branch before full deployment
Scaling mobile tests doesn't mean more racks of reference cards

Future work:
Bug 1047467 c3.xlarge instance types are expensive, let's test running those tests on a range of instance types that are cheaper

Further reading:
AWS instance types 
Chris Atlee wrote about how we Now Use AWS Spot Instances for Tests
Taras Glek wrote How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 months
Rail Aliiev 
Bug 980519 Experiment with other instance types for Android 2.3 jobs 
Bug 1024091 Address high pending count in in-house Linux64 test pool 
Bug 1028293 Increase Android 2.3 mochitest chunks, for aws 
Bug 1032268 Experiment with c3.xlarge for Android 2.3 jobs
Bug 1035863 Add two new Linux64 masters to accommodate new emulator slaves
Bug 1034055 Implement c3.xlarge slave class for Linux64 test spot instances
Bug 1031083 Buildbot changes to run selected b2g tests on c3.xlarge
Bug 1047467 c3.xlarge instance types are expensive, let's try running those tests on a range of instance types that are cheaper

August 07, 2014 06:24 PM

August 04, 2014

Jordan Lund (jlund)

This Week In Releng - July 28th, 2014

Major Highlights:

Completed Work (marked as resolved):

In progress work (unresolved and not assigned to nobody):

August 04, 2014 04:22 PM

July 28, 2014

Kim Moir (kmoir)

2014 USENIX Release Engineering Summit CFP now open

The CFP for the 2014 Release Engineering summit (Western edition) is now open.  The deadline for submissions is September 5, 2014 and speakers will be notified by September 19, 2014.  The program will be announced in late September.  This one day summit on all things release engineering will be held in concert with LISA, in Seattle on November 10, 2014. 

Seattle skyline © Howard Ignatius, Creative Commons by-nc-sa 2.0

From the CFP

"Suggestions for topics include (but are not limited to):
URES '14 West is looking for relevant and engaging speakers and workshop facilitators for our event on November 10, 2014, in Seattle, WA. URES brings together people from all areas of release engineering—release engineers, developers, managers, site reliability engineers, and others—to identify and help propose solutions for the most difficult problems in release engineering today."

War and horror stories. I like to see that in a CFP.  Describing how you overcame problems with  infrastructure and tooling to ship software are the best kinds of stories.  They make people laugh. Maybe cry as they realize they are currently living in that situation.  Good times.  Also, I think talks around scaling high volume continuous integration farms will be interesting.  Scaling issues are a lot of fun and expose many issues you don't see when you're only running a few builds a day. 

If you have any questions surrounding the CFP, I'm happy to help as I'm on the program committee.   (my irc nick is kmoir (#releng) as is my email id at

July 28, 2014 09:28 PM

July 25, 2014

Aki Sasaki (aki)

on leaving mozilla

Today's my last day at Mozilla. It wasn't an easy decision to move on; this is the best team I've been a part of in my career. And working at a company with such idealistic principles and the capacity to make a difference has been a privilege.

Looking back at the past five-and-three-quarter years:

I will stay a Mozillian, and I'm looking forward to see where we can go from here!

comment count unavailable comments

July 25, 2014 07:26 PM

July 18, 2014

Kim Moir (kmoir)

Reminder: Release Engineering Special Issue submission deadline is August 1, 2014

Just a friendly reminder that the deadline for the Release Engineering Special Issue is August 1, 2014.  If you have any questions about the submission process or a topic that's you'd like to write about, the guest editors, including myself, are happy to help you!

July 18, 2014 10:03 PM

Mozilla pushes - June 2014

Here's June 2014's  analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file

This was another record breaking month with a total of 12534 pushes.  As a note of interest, this is is over double the number of pushes we had in June 2013. So big kudos to everyone who helped us scale our infrastructure and tooling.  (Actually we had 6,433 pushes in April 2013 which would make this less than half because June 2013 was a bit of a dip.  But still impressive :-)


General Remarks
The introduction of Gaia-try in April has been very popular and comprised around 30% of pushes in June compared to 29% last month.
The Try branch itself consisted of around 38% of pushes.
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes, compared to 22% in the previous month.

June 2014 was the month with most pushes (12534 pushes)
June 2014 has the highest pushes/day average with
418 pushes/day
June 2014 has the highest average of "pushes-per-hour" is
23.17 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with
662 pushes

July 18, 2014 09:46 PM

July 15, 2014

Armen Zambrano G. (@armenzg)

Developing with GitHub and remote branches

I have recently started contributing using Git by using GitHub for the Firefox OS certification suite.

It has been interestting switching from Mercurial to Git. I honestly believed it would be more straight forward but I have to re-read again and again until the new ways sink in with me.

jgraham shared with me some notes (Thanks!) with regards what his workflow looks like and I want to document it for my own sake and perhaps yours:
git clone

# Time passes

# To develop something on master
# Pull in all the new commits from master

git fetch origin

# Create a new branch (this will track master from origin,
# which we don't really want, but that will be fixed later)

git checkout -b my_new_thing origin/master

# Edit some stuff

# Stage it and then commit the work

git add -p
git commit -m "New awesomeness"

# Push the work to a remote branch
git push --set-upstream origin HEAD:jgraham/my_new_thing

# Go to the GH UI and start a pull request

# Fix some review issues
git add -p
git commit -m "Fix review issues" # or use --fixup

# Push the new commits
git push

# Finally, the review is accepted
# We could rebase at this point, however,
# we tend to use the Merge button in the GH UI
# Working off a different branch is basically the same,
# but you replace "master" with the name of the branch you are working off.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 15, 2014 09:04 PM

July 11, 2014

Armen Zambrano G. (@armenzg)

Introducing Http authentication for Mozharness.

A while ago, I asked a colleague (you know who you are! :P) of mine how to run a specific type of test job on tbpl on my local machine and he told me with a smirk, "With mozharness!"

I wanted to punch him (HR: nothing to see here! This is not a literal punch, a figurative one), however he was right. He had good reason to say that, and I knew why he was smiling. I had to close my mouth and take it.

Here's the explanation on why he said that: most jobs running inside of tbpl are being driven by Mozharness, however they're optimized to run within the protected network of Release Engineering. This is good. This is safe. This is sound. However, when we try to reproduce a job outside of the Releng network, it becomes problematic for various reasons.

Many times we have had to guide people who are unfamiliar with mozharness as they try to run it locally with success. (Docs: How to run Mozharness as a developer). However, on other occasions when it comes to binaries stored on private web hosts, it becomes necessary to loan a machine. A loaned machine can reach those files through internal domains since it is hosted within the Releng network.

Today, I have landed a piece of code that does two things:
This change, plus the recently-introduced developer configs for Mozharness, makes it much easier to run mozharness outside of continuous integration infrastructure.

I hope this will help developers have a better experience reproducing the environments used in the tbpl infrastructure. One less reason to loan a machine!

This makes me *very* happy (see below) since I don't have VPN access anymore.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 11, 2014 07:42 PM

Using developer configs for Mozharness

To help run mozharness by developers I have landed some configs that can be appended to the command appearing on tbpl.

All you have to do is:
  • Find the mozharness script line in a log from tbpl (search for "script/scripts")
  • Look for the --cfg parameter and add it again but it should end with ""
    • e.g. --cfg android/ --cfg android/
  • Also add the --installer-url and --test-url parameters as explained in the docs
Developer configs have these things in common:
  • They have the same name as the production one but instead end in ""
  • They overwrite the "exes" dict with an empty dict
    • This allows to use the binaries in your personal $PATH
  • They overwrite the "default_actions" list
    • The main reason is to remove the action called read-buildbot-configs
  • They fix URLs to point to the right public reachable domains 
Here are the currently available developer configs:
You can help by adding more of them!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 11, 2014 07:15 PM

July 04, 2014

Kim Moir (kmoir)

This week in Mozilla Releng - July 4, 2014

This is a special double issue of this week in releng. I was so busy in the last week that I didn't get a chance to post this last week.  Despite the fireworks for Canada Day and Independence Day,  Mozilla release engineering managed to close some bugs. 

Major highlights:
 Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

July 04, 2014 09:39 PM

July 03, 2014

Armen Zambrano G. (@armenzg)

Tbpl's blobber uploads are now discoverable

What is blobber? Blobber is a server and client side set of tools that allow Releng's test infrastructure to upload files without requiring to deploy ssh keys on them.

This is useful since it allows uploads of screenshots, crashdumps and any other file needed to debug what failed on a test job.

Up until now, if you wanted your scripts determine the files uploaded in a job, you would have to download the log and parse it to find the TinderboxPrint lines for Blobbler uploads, e.g.
15:21:18 INFO - (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to
Now, you can look for the set of files uploaded by looking at the uploaded_files.json that we upload at the end of all uploads. This can be discovered by inspecting the buildjson files or by listening to the pulse events. The key used is called "blobber_manifest_url" e.g.
"blobber_manifest_url": "",
In the future, this feature will be useful when we start uploading structured logs. It will help us not to download logs to extract meta-data about the jobs!

No, your uploads are not this ugly
This work was completed in bug 986112. Thanks to aki, catlee, mtabara and rail to help me get this out the door. You can read more about Blobber by visiting: "Blobber is live - upload ALL the things!" and "Blobber - local environment setup".

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 03, 2014 12:02 PM