Planet Release Engineering

November 26, 2015

Armen Zambrano G. (@armenzg)

Mozhginfo/Pushlog client released

If you've ever spent time trying to query metadata from hg with regards to revisions, you can now use a Python library we've released to do so.

In bug 1203621 [1], our community contributor @MikeLing has helped us release the module we had written for Mozilla CI tools.

You can find the pushlog_client package in here [3] and you can find the code in here [4]

Thanks MikeLing!


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 26, 2015 03:17 PM

November 24, 2015

Armen Zambrano G. (@armenzg)

Welcome F3real, xenny and MikeLing!

As described by jmaher, we started this week our first week of mozci's quarter of contribution.

I want to personally welcome Stefan, Vaibhav and Mike to mozci. We hope you get to learn and we thank you for helping Mozilla move forward in this corner of our automation systems.

I also want to give thanks to Alice for committing at mentoring. This could not be possible without her help.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:58 PM

Mozilla CI tools meet up

In order to help the contributors' of mozci's quarter of contribution, we have set up a Mozci meet up this Friday.

If you're interested on learning about Mozilla's CI, how to contribute or how to build your own scheduling with mozci come and join us!

9am ET -> other time zones
Vidyo room:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:52 PM

Kim Moir (kmoir)

USENIX Release Engineering Summit 2015 recap

November 13th, I attended the USENIX Release Engineering Summit in Washington, DC.  This summit was along side the larger LISA conference at the same venue. Thanks to Dinah McNutt, Gareth Bowles, Chris Cooper,  Dan Tehranian and John O'Duinn for organizing.

I gave two talks at the summit.  One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.

Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)

Scaling mobile testing on AWS: Emulators all the way down from Kim Moir

I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city.  We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic

Distributed Systems at Scale: Reducing the Fail from Kim Moir

In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems.  In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
I'd also like to thank all the members of Mozilla releng/ateam who reviewed my slides and provided feedback before I gave the presentations.
The attendees of the summit attended the same keynote as the LISA attendees.  Jez Humble, well known for his Continuous Delivery and Lean Enterprise books provided a keynote on Lean Configuration Management which I really enjoyed. (Older version of slides from another conference, are available here and here.)

In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release".  He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success.  For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".

Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing.  Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.

At the release engineering summit itself,   Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company.  This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.

Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!

Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.

For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US. 

Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines.   Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster

There were many more great talks, hopefully more slides will be up soon!

November 24, 2015 03:57 PM

November 19, 2015

Chris Cooper (coop)

Clarification about our “Build and Release Intern - Toronto” position

We’ve had lots of interest already in our advertised internship position, and that’s great. However, many of the applications I’ve looked at won’t pan out because they overlooked a key line in the posting:

*Only local candidates will be considered for this role.*

That’s right, we’re only able to accept interns who are legally able to work in Canada.

The main reason behind this is that all of our potential mentors are in Toronto, and having an engaged, local mentor is one of the crucial determinants of a successful internship. In the past, it was possible for Mozilla to sponsor foreign students to come to Canada for internships, but recent changes to visa and international student programs has made the bureacratic process (and concomitant costs) a nightmare to manage. Many applicants simply aren’t eligible any more under the new rules either.

I’m not particularly happy about this, but it’s the reality of our intern hiring landscape. Some of our best former interns have come from abroad, and I’ve already seen some impressive resumes this year from international students. Hopefully one of the non-Toronto-based positions will still appeal to them.

November 19, 2015 07:17 PM

Armen Zambrano G. (@armenzg)

Buildapi client released - thanks F3real!

When we wrote Mozilla CI tools, we created a module called in order to schedule Buildbot jobs via Self-serve/Buildapi.

We recently ported it as a Python package and released it:

This was all thanks to F3real, who joined us from Mozilla's community and released his first Python package. He has also brought forth the integration tests we wrote for it. Here's the issue and  PR if you're curious.

F3real will now be looking at removing the buildapi module from mozci and making use of the python package instead.

Thanks F3real!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 19, 2015 04:31 PM

Chris AtLee (catlee)

MozFest 2015

I had the privilege of attending MozFest last week. Overall it was a really great experience. I met lots of really wonderful people, and learned about so many really interesting and inspiring projects.

My biggest takeaway from MozFest was how important it is to provide good APIs and data for your systems. You can't predict how somebody else will be able to make use of your data to create something new and wonderful. But if you're not making your data available in a convenient way, nobody can make use of it at all!

It was a really good reminder for me. We generate a lot of data in Release Engineering, but it's not always exposed in a way that's convenient for other people to make use of.

The rest of this post is a summary of various sessions I attended.


Friday night started with a Science Fair. Lots of really interesting stuff here. Some of the projects that stood out for me were:

  • naturebytes - a DIY wildlife camera based on the raspberry pi, with an added bonus of aiding conservation efforts.
  • histropedia - really cool visualizations of time lines, based on data in Wikipedia and Wikidata. This was the first time I'd heard of Wikidata, and the possibilities were very exciting to me! More on this later, as I attended a whole session on Wikidata.
  • Several projects related to the Internet-of-Things (IOT)


On Saturday, the festival started with some keynotes. Surman spoke about how MozFest was a bit chaotic, but this was by design. In a similar way that the web is an open platform that you can use as a platform for building your own ideas, MozFest should be an open platform so you can meet, brainstorm, and work on your ideas. This means it can seem a bit disorganized, but that's a good thing :) You get what you want out of it.

I attended several good sessions on Saturday as well:

  • Ending online tracking. We discussed various methods currently used to track users, such as cookies and fingerprinting, and what can be done to combat these. I learned, or re-learned, about a few interesting Firefox extensions as a result:

    • privacybadger. Similar to Firefox's tracking protection, except it doesn't rely on a central blacklist. Instead, it tries to automatically identify third party domains that are setting cookies, etc. across multiple websites. Once identified, these third party domains are blocked.
    • https everywhere. Makes it easier to use HTTPS by default everywhere.
  • Intro to D3JS. d3js is a JS data visualization library. It's quite powerful, but something I learned is that you're expected to do quite a bit of work up-front to make sure it's showing you the things you want. It's not great as a data exploration library, where you're not sure exactly what the data means, and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.

  • 6 kitchen cases for IOT We discussed the proposed IOT design manifesto briefly, and then split up into small groups to try and design a product, using the principles outlined in the manifesto. Our group was tasked with designing some product that would help connect hospitals with amateur chefs in their local area, to provide meals for patients at the hospital. We ended up designing a "smart cutting board" with a built in display, that would show you your recipes as you prepared them, but also collect data on the frequency of your meal preparations, and what types of foods you were preparing.

    Going through the exercise of evaluating the product with each of the design principles was fun. You could be pretty evil going into this and try and collect all customer data :)


  • How to fight an internet shutdown - we role played how we would react if the internet was suddenly shut down during some political protests. What kind of communications would be effective? What kind of preparation can you have done ahead of time for such an event?

    This session was run by Deji from accessnow. It was really eye opening to see how internet shutdowns happen fairly regularly around the world.

  • Data is beaufitul Introduction to wikidata Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody can edit and query the database. One of the really interesting features of Wikidata is that localization is kind of built-in as part of the design. Each item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas Adams. The description for each item is simply a table of locale -> localized description. There's no inherent bias towards English, or any other language. The beauty of this is that you can reference the same piece of data from multiple languages, only having to focus on localizing the various descriptions. You can imagine different translations of the same Wikipedia page right now being slightly inconsistent due to each one having to be updated separately. If they could instead reference the data in Wikidata, then there's only one place to update the data, and all the other places that reference that data would automatically benefit from it.

    The query language is quite powerful as well. A simple demonstration was "list all the works of art in the same room in the Louvre as the Mona Lisa."

    It really got me thinking about how powerful open data is. How can we in Release Engineering publish our data so others can build new, interesting and useful tools on top of it?

  • Local web Various options for purely local web / networks were discussed. There are some interesting mesh network options available commotion was demo'ed. These kind of distributions give you file exchange, messaging, email, etc. on a local network that's not necessarily connected to the internet.

November 19, 2015 01:35 PM

November 18, 2015

Chris Cooper (coop)

Welcome back, Mihai!

Mr. KotterThis is *not* Mihai.

I’ve been remiss in (re)introducing our latest hire in release engineering here at Mozilla.

Mihai Tabara is a two-time former intern who joins us again, now in a full-time capacity, after a stint as a release engineer at Hortonworks. He’s in Toronto this week with some other members of our team to sprint on various aspects of release promotion.

After a long hiring drought for releng, it’s great to be able to welcome someone new to the team, and even better to be able to welcome someone back. Welcome, Mihai!

November 18, 2015 10:44 PM

November 16, 2015

Nick Thomas (nthomas)

The latest on firefox/releases/latest

The primary way to download Firefox is at, but Mozilla’s Release Engineering team has also maintained directories like

to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.

Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:

Modifying the product, os, and/or lang allow other combinations. This is described in the README.txt files for beta, release, and esr, as well as the Thunderbird equivalents release and beta.

Please adapt your scripts to use links. We hope it will help you simplify at the same time, as scraping to determine the current version is no longer necessary.

PS. We’ve also removed some latest- directories which were old and crufty, eg firefox/releases/latest-3.6.

November 16, 2015 11:37 PM

November 13, 2015

Hal Wine (hwine)

Complexity & * Practices

Complexity & * Practices

I was fortunate enough to be able to attend Dev Ops Days Silicon Valley this year. One of the main talks was given by Jason Hand, and he made some great points. I wanted to highlight two of them in this post:

  1. Post Mortems are really learning events, so you should hold them when things go right, right? RIGHT!! (Seriously, why wouldn’t you want to spot your best ideas and repeat them?)
  2. Systems are hard – if you’re pushing the envelope, you’re teetering on the line between complexity and chaos. And we’re all pushing the envelope these days - either by getting fancy or getting lean.

Post Mortems as Learning Events

Our industry has talked a lot about “Blameless Post Mortems”, and techniques for holding them. Well, we can call them “blameless” all we want, but if we only hold them when things go wrong, folks will get the message loud and clear.

If they are truly blameless learning events, then you would also hold them when things go right. And go meh. Radical idea? Not really - why else would sports teams study game films when they win? (This point was also made in a great Ignite by Katie Rose: GridIronOps - go read her slides.)

My $0.02 is - this would also give us a chance to celebrate success. That is something we do not do enough, and we all know the dedication and hard work it takes to not have things go sideways.

And, by the way, terminology matters during the learning event. The person who is accountable for an operation is just that: capable of giving an account of the operation. Accountability is not responsibility.

Terminology and Systems – Setting the right expectations

Part way through Jason’s talk, he has this awesome slide about how system complexity relates to monitoring which relates to problem resolution. Go look at slide 19 - here’s some of what I find amazing in that slide:

  • It is not a straight line with a destination. Your most stable system can suddenly display inexplicable behavior due to any number of environmental reasons. And you’re back in the chaotic world with all that implies.
  • Systems can progress out of chaos, but that is an uphill battle. Knowing which stage a system is in (roughly) informs the approach to problem resolution.
  • Note the wording choices: “known” vs “unknowable” – for all but the “obvious” case, it will be confusing. That is a property of the system, not a matter of staff competency.

While not in his slide, Jason spoke to how each level really has different expectations. Or should have, but often the appropriate expectation is not set. Here’s how he related each level to industry terms.

Best Practices:

The only level with enough certainty to be able to expect the “best” is the known and familiar one. This is the “obvious” one, because we’ve all done exactly this before over a long enough time period to fully characterize the system, its boundaries, and abnormal behavior.

Here, cause and effect are tightly linked. Automation (in real time) is possible.

Good Practices:

Once we back away from such certainty, it is only realistic to have less certainty in our responses. With the increased uncertainty, the linkage of cause and effect is more tenuous.

Even if we have all the event history and logs in front of us, more analysis is needed before appropriate corrective action can be determined. Even with automation, there is a latency to the response.

Emergent Practices:

Okay, now we are pushing the envelope. The system is complex, and we are still learning. We may not have all the data at hand, and may need to poke the system to see what parts are stuck.

Cause and effect should be related, but how will not be visible until afterwards. There is much to learn.

Novel Practices:
For chaotic systems, everything is new. A lot is truly unknowable because that situation has never occurred before. Many parts of the system are effectively black boxes. Thus resolution will often be a process of trying something, waiting to see the results, and responding to the new conditions.
Next Steps

There is so much more in that diagram I want to explore. The connecting of problem resolution behavior to complexity level feels very powerful.

<hand_waving caffeine_level=”deprived”>

My experience tells me that many of these subjective terms are highly context sensitive, and in no way absolute. Problem resolution at 0300 local with a bad case of the flu just has a way of making “obvious” systems appear quite complex or even chaotic.

By observing the behavior of someone trying to resolve a problem, you may be able to get a sense of how that person views that system at that time. If that isn’t the consensus view, then there is a gap. And gaps can be bridged with training or documentation or experience.


November 13, 2015 08:00 AM

October 30, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - October 30, 2015

Much of Q4 is spent planning and budgeting for the next year, so there’s been lots of discussion about which efforts will need support next year and which things we can accommodate with existing resources and which will need additional resources.

And if planning and budgeting doesn’t scare some Hallowe'en spirit into you, I don’t know what will.

Modernize infrastructure: Q got most of the automated installation of w10 working and is now focusing on making sure that jobs run.

Improve CI pipeline: Andrew (with help from Dustin) will be running a bunch of test suites in TaskCluster (based on TaskCluster-built binaries) at Tier 2.

Release: Callek built v1.5 of the OpenH264 Plugin, and pushed it out to the testing audience. Expecting to go live to our users in the next few weeks.

Callek managed to get “final verify” (an update test with all live urls) working on taskcluster in support of the “release promotion” project.

Firefox 42, our big moment-in-time release for the second half of 2015, gets released to the general public next week. Fingers are crossed.

Operational: Kim and Amy decommissioned about 50% of the remaining panda infrastructure (physical mobile testing boards) after shifting the load to AWS.

We repurposed 30 of our Linux64 talos machines for Windows 7 testing in preparation for turning on some e10s tests.

Kim turned off WinXP tests off by default on try to try to reduce some of our windows backlog (

Kim implemented some changes to SETA which would allow us to configure the SETA parameters on a per platform basis (

Rail performed the mozilla-central train uplifts a week early when the Release Management plans shifted, turning Nightly into Gecko 45. FirefoxOS v2.5 branch based on Gecko 44 has been created as a part of the uplift.

Callek investigated a few hours of nothing being scheduled on try on Tuesday, to learn there was an issue with a unicode character in the commit message which broke JSON importing of pushlog. And then did work to reschedule all those jobs (

Industry News: In addition to the work we do at Mozilla, a number of our people are leaders in industry and help organize, teach, and speak. These are some of the upcoming events people are involved with:

SUntil next week, remember: don’t take candy from strangers!

October 30, 2015 09:04 PM

October 23, 2015

Nick Thomas (nthomas)

Updates disabled for Android Nightly and Aurora

Due to a bug with the new ftp server we’ve had to disable updates for

They’ll resume just as soon as we can get the fix landed.

Update (Oct 25th): Updates are re-enabled, thanks to Mike Shal for the fix.

October 23, 2015 08:39 AM

October 21, 2015

Nick Thomas (nthomas)

Try Server – please use up-to-date code to avoid upload failures

Today we started serving an important set of directories on using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.

Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.

October 21, 2015 10:02 AM

October 15, 2015

Morgan Phillips (mrrrgn)

Better Secret Injection with TaskCluster Secrets

Many secret injection services simply store a private key and encrypt data for users. The users then add those encrypted payloads to a job, and the payload is decrypted using the private key associated with their account at run time, I see a few problems with this system:Today we deployed TaskCluster Secrets, a new service I've been working on for the last two weeks which stores encrypted json payloads on behalf of taskcluster clients. I'm excited about this service because it's going to form the foundation for a new method of secret injection which solves all of the problems listed above.
How does it work?

In TaskCluster Secrets, each submitted payload (encrypted at rest) is associated with an OAuth scope. The scope also defines which actions a user may make against the secret. For example, to write a secret named 'someNamespace:foo' you'd need an OAuth scope 'secrets:set:someNamespace:foo,' to read it you'd need 'secrets:get:someNamespace:foo,' and so on.

Tying each secret to a scope, we're able to generate an interesting work flow for access from within tasks. In short, we can generate and inject temporary credentials with read only access. This forces secrets to be accessed via the api and yields the following benefits:What's more, we can store the temporary OAuth credentials in an http proxy running alongside of a task instead of within it, so that even the credentials are not exposed by default. This way someone could have a snapshot of your task at startup and not gain access to any private data. \o/

Case Study: Setting/Getting a secret for TaskCluster GitHub jobs

1.) Submit a write request to the TaskCluster GitHub API : PUT {githubToken: xxxx, secret: {password: xxxx}, expires: '2015-12-31'}

2.) GitHub OAuth token is used to verify org/repo ownership. Once verified, we generate a temporary account and write the following secret on behalf of our repo owner : myOrg/myRepo:myKey {secret: {password: xxxx}, expires: '2015-12-31'}

3.) CI jobs are launched alongside HTTP Proxies which will attach an OAuth header to outgoing requests to The attached token will have a scope: secrets:get:myOrg/myRepo:* which allows any secret set by the owner of the myOrg/myRepo repository to be accessed.

4.) Within a CI task, a secret may be retrieved by simple HTTP calls such as: curl $SECRETS_PROXY_URL/v1/secret/myOrg/myRepo:myKey

Easy, secure, and 100% logged.

October 15, 2015 08:40 PM

October 13, 2015

Armen Zambrano G. (@armenzg)

mozci 0.15.1: Buildbot Bridge support + bug fixes

It's been a while since our last announced released and I would like to highlight some of the significant changes since then.
A major highlight for the latest release is the support to schedule Buildbot jobs through TaskCluster making use of the Buildbot Bridge. For more information about this please read in here.


@nikkisquared is our latest contributor who landed more formalized exceptions and error handling. Thanks!
As always, many thanks to @adusca and @vaibhavmagarwal for their endless contributions.

How to update

Run "pip install -U mozci" to update

New Features

  • Query Treeherder for hidden jobs
  • BuildbotBridge support
    • It allows submitting a graph of Buildbot jobs with dependencies
    • The graph is submitted to TaskCluster

Minor changes

  • Fix backfilling issue
  • Password prompt improvements
  • More specific exceptions
  • Skip Windows 10 jobs for talos

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

October 13, 2015 07:55 PM

October 09, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - October 9, 2015

The beginning of October means autumn in the Northern hemisphere. Animals get ready for winter as the leaves change colour, and managers across Mozilla struggle with deliverables for Q4. Maybe we should just investigate that hibernating thing instead.

Modernize infrastructure: Releng, Taskcluster, and A-team sat down a few weeks ago to hash out an updated roadmap for the buildbot-to-taskcluster migration ( As you can see from the document, our nominal goal this quarter is to have 64-bit linux builds *and* tests running side-by-side with the buildbot equivalents, with a stretch goal to actually turn off the buildbot versions entirely. We’re still missing some big pieces to accomplish this, but Morgan and the Taskcluster team are tackling some key elements like hooks and coalescing schedulers over the coming weeks.

Aside from the Taskcluster, the most pressing releng concern is release promotion. Release promotion entails taking an existing set of builds that have already been created and passed QA and “promoting” them to be used as a release candidate. This represents a fundamental shift in how we deliver Firefox to end users, and as such is both very exciting and terrifying at the same time. Much of the team will be working on this in Q4 because it will greatly simplify a future transition of the release process to Taskcluster.

Improve CI pipeline: Vlad and Alin have 10.10.5 tests running on try and are working on greening up tests (

Kim started discussion on dev.planning regarding reducing frequency of linux32 builds and tests!topic/ (Related bug:

Windows tests take a long time, in case you hadn’t noticed. This is largely due to e10s ( which has effectively doubled the number of tests we need to run per-push. We’ve been able to absorb this extra testing on other platforms, but Windows 7 and Windows 8 have been particularly hard hit by the increased demand, often taking more than 24 hours to work through backlog from the try server. While e10s is a product decision and ultimately in the best interest for Firefox, we realize the current situation is terrible in terms of turnaround time for developer changes. Releng will be investigating updating our hardware pool for Windows machines in the new year. In the interim, please be considerate with your try usage, i.e. don’t test on Windows unless you really need to. If you can help fix e10s bugs so to make that the default on beta/release ASAP, that would be awesome.

Release: The big “moment-in-time” release of Firefox 42 approaches. Rail is on the hook for releaseduty for this cycle, and is overseeing beta 5 builds currently.

Operational: Kim increased size of tst-emulator64 spot pool ( so we’ll be able to enable additional Android 4.3 tests on debug once we have when SETA data for them (

Coop (me) spent last week in Romania getting to know our Softvision contractors in person. Everyone was very hospitable and took good care of me. Alin and Vlad took full advantage of the visit to get better insight into how the various releng systems are interconnected. Hopefully this will pay off with them being able to take on more challenging bugs to advance the state of buildduty. Already they’re starting to investigate how they could help contribute to the slave loan tool. Alin and Vlad will also be joining us for Mozlando in December, so look forward to more direct interaction with them there.

See you next week!

October 09, 2015 09:58 PM

October 02, 2015

Hal Wine (hwine)

duo MFA & viscosity no-cell setup

duo MFA & viscosity no-cell setup

The Duo application is nice if you have a supported mobile device, and it’s usable even when you you have no cell connection via TOTP. However, getting Viscosity to allow both choices took some work for me.

For various reasons, I don’t want to always use the Duo application, so would like for Viscosity to alway prompt for password. (I had already saved a password - a fresh install likely would not have that issue.) That took a bit of work, and some web searches.

  1. Disable any saved passwords for Viscosity. On a Mac, this means opening up “Keychain Access” application, searching for “Viscosity” and deleting any associated entries.

  2. Ask Viscosity to save the “user name” field (optional). I really don’t need this, as my setup uses a certificate to identify me. So it doesn’t matter what I type in the field. But, I like hints, so I told Viscosity to save just the user name field:

    defaults write com.viscosityvpn.Viscosity RememberUsername -bool true

With the above, you’ll be prompted every time. You have to put “something” in the user name field, so I chose to put “push or TOTP” to remind me of the valid values. You can put anything there, just do not check the “Remember details in my Keychain” toggle.

October 02, 2015 07:00 AM

September 25, 2015

Kim Moir (kmoir)

The mystery of high pending counts

In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android).  High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs.  The time developers have to wait for their test results is longer than ideal.

Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests.  However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches.  This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.

The work to determine the cause of high pending counts is always an interesting mystery.
Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem.  We have determined that since the end of August a number of new test jobs were enabled that increased the compute time per push on Windows by 13% or 2.5 hours per push.  Most of these new test jobs are for e10s
Increase in seconds that new jobs added to the total compute time per push.  (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)
The e10s initiative is an important initiative for Mozilla to make Firefox performance and security even better.  However, since new e10s and old tests will continue to run in parallel, we need to get creative on how to have acceptable wait times given the limitations of our current Windows tests pools.  (All of our Windows test run on bare metal in our datacentre, not on Amazon).
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives: 

To reduce Linux pending counts:
  • Added 200 new instances to the tst-emulator64 pool (run Android test jobs on Linux emulators) (bug 1204756)
  • In process of adding more Linux32 and Linux64 buildbot masters (bug 1205409) which will allow us to expand our capacity more

Ongoing work to reduce the Windows pending counts:

How can you help? 

Please be considerate when invoking try pushes and only select the platforms that you explicitly require to test.  Each try push for all platforms and all tests invokes over 800 jobs.

September 25, 2015 07:47 PM

September 22, 2015

Hal Wine (hwine)

Using Password Store

Using Password Store

Password Store (aka “pass”) is a very handy wrapper for dealing with pgp encrypted secrets. It greatly simplifies securely working with multiple secrets. This is still true even if you happen to keep your encrypted secrets in non-password-store managed repositories, although that setup isn’t covered in the docs. I’ll show my setup here. (See the Password Store page for usage: “pass show -c <spam>” & “pass search <eggs>” are among my favorites.)

Short version:
  1. Have gpg installed on your machine.

  2. Install Password Store on your machine. There are OS specific instructions. Be sure to enable tab completion for your shell!

  3. Setup a local password store. Scroll down in the usage section to “Setting it up” for instructions.

  4. Clone your secrets repositories to your normal location. Do not clone inside of ~/.password-store/.

  5. Set up symlinks inside of ~/.password-store/ to directories inside your clone of the secrets repository. I did:

    ln -s ~/path/to/secrets-git/passwords rePasswords
    ln -s ~/path/to/secrets-git/keys reKeys
  6. Enjoy command line search and retrieval of all your secrets. (Use the regular method for your separate secrets repository to add and update secrets.)


  • By using symlinks, pass will not allow me to create or update secrets in the other repositories. That prevents mistakes, as the process is different for each of those alternate stores.
  • I prefer to have just one tree of secrets to search, rather than the “multiple configuration” approach documented on the Password Store site.
  • By using symlinks, I can control the global namespace, and use names that make sense to me.
  • I’ve migrated from using KeePassX to using pass for my personal secret management. That is my “main” password-store setup (backed by a git repo).


  • If you’d prefer a GUI, there’s qtpass which also works with the above setup.

September 22, 2015 07:00 AM

September 21, 2015

Armen Zambrano G. (@armenzg)

Minimal job scheduling

One of the biggest benefits of moving the scheduling into the tree is that you can adjust the decisions on what to schedule from within the tree.

As chmanchester and I were recently discussing, this is very important as we believe we can do much better on deciding what to schedule on try.

Currently, too many developers push to try with  -p all -u all (which schedules all platforms and all tests). It is understandable, it is the easiest way to reduce the risk of your change being backed out when it lands on one of the integration trees (e.g. mozilla-inbound).

In-tree scheduling analysis

What if your changes would get analysed and we would determine the best educated guess set of platforms and test jobs required to test your changes in order to not be backed out on an integration tree?

For instance, when I push Mozharness changes to mozilla-inbound, I wish I could tell the system that I only need these set of platforms and not those other ones.

If everyone had the minimum amount of jobs added to their pushes, our systems would be able to return results faster (less load) and no one would need to take short cuts.

This would be the best approximation and we would need to fine tune the logic over time to get things as right as possible. We would need to find the right balance of some changes being backed out because we did not get the right scheduling on try and getting results faster for everyone.

Prioritized tests

There is already some code that chmanchester landed where we can tell the infrastructure to run a small set of tests based on files changed. In this case we hijack one of the jobs (e.g. mochitest-1) to run the most relevant tests to your changes which would can normally be tested on different chunks. Once the prioritized tests are run, we can run the remaining tests as we would normally do. Prioritized tests also applies to suites that are not chunked (run a subset of tests instead of all).

There are some UI problems in here that we would need to figure out with Treeherder and Buildbot.

Tiered testing

Soon, we will have all technological pieces to create a multi tiered job scheduling system.

For instance, we could run things in this order (just a suggestion):

This has the advantage of using prioritized tests as a canary job which would prevent running the remaining tests if we fail the canary (shorter) job.

Post minimal run (automatic) precise scheduling (manual)

This is not specifically to scheduling the right thing automatically but to extending what gets scheduled automatically.
Imagine that you're not satisfied with what gets scheduled automatically and you would like to add more jobs (e.g. missing platforms or missing suites).
You will be able to add those missing jobs later directly from Treeherder by selecting which jobs are missing.
This will be possible once bug 1194830 lands.

NOTE: Mass scheduling (e.g. all mochitests across all platforms) would be a bit of a pain to do through Treeherder. We might want to do a second version of try-extender.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 21, 2015 01:15 PM

September 18, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - September 18, 2015

Pending job numbers continued to be a concern this week. Investigations are underway to look for slowdowns unrelated to the enabling of e10s tests, which on its own has double the number of test run in many cases. More information below.

Modernize infrastructure: Dustin participated in the TaskCluster work-week, discussing plans for TaskCluster itself and for Releng’s work to port the CI and release processes to run on the TaskCluster platform.

Morgan gave a fantastic presentation on air mozilla describing how github / TaskCluster integration works:

Improve CI pipeline: We’re ready to un-hide OS X and Linux64 builds via TaskCluster in TreeHerder, elevating them to “tier 2” status. This is a necessary precursor to replacing the buildbot-generated versions of these builds.

Jordan landed a patch to enable bundleclone for mock-based builds, which may help fix problems with the Android nightly builds. (

Alin and Vlad are working on releng configs to add new 10.10 hardware to the test pool (

Release: Ben continues to work out a plan to cope with SHA-1 certificate deprecation.(

We are entering the end-game for Firefox 41. Release candidate builds are underway.

Operational: Kim and Vlad increased the size of the tst-emulator-64 pool by 200 instances which has significantly reduced the wait times for Android tests that use this instance type. (

Kim is also in the process of bringing up four new buildbot masters to serve these expanding pools and reduce some of the buildbot lag we have seen in our monitoring tools (

We have had high pending counts for the past few weeks which have significantly increased wait times, especially for Windows tests on Try. Joel Maher (from Developer Productivity team) and Kim analyzed the data for the end to end test times for Windows for the past month. They discovered that total compute time per push has increased by around 13% or 2.5 compute hours on Windows, primarily driven by the addition of new e10s tests. Given that our pool of Windows machines has a fixed size, we are looking at ways to reduce the wait times given existing hardware constraints.

See you again next week!

September 18, 2015 09:00 PM

Armen Zambrano G. (@armenzg)

Mozharness' support for Buildbot Bridge triggered test jobs

I have landed today [1] some code which allows Buildbot *test* jobs triggered through the Buildbot Bridge (BBB) to work properly.

In my previous post I explain a bit on how Mozci works with the Buildbot Bridge.
In this post I will only explain what we fixed on the Mozharness side.

Buildbot Changes

If a Buildbot test job is scheduled through TaskCluster (The Buildbot Bridge supports this), then the generated Buildbot Change associated to a test job does not have the installer and
test urls necessary for Mozharness to download for a test job.

What is a Buildbot Change? It is an object which represents the files touched associated to a code push. For the build jobs, this value gets set as part of the process of polling the Mercurial repositories, however, the test jobs are triggered via  a "buildbot sendchange" step part of the build job.
This sendchange creates the Buildbot Change for the test job which Mozharness can then use.
The BBB does not listen to sendchanges, hence, jobs triggered via the BBB have an empty changes object. Therefore, we can't download the files needed to run a test job and fail to execute.

In order to overcome this limtation, we have to detect when a Buildbot job is triggered normally or through the Buildbot Bridge.
Buildbot Bridge triggered jobs have a 'taskId' property defined (this represents the task associated to this Buildbot job). Through this 'taskId' we can determine the parent task and find a file called properties.json [2], which, it is uploaded by every BBB triggered job.
In such file, we can find both the installer and test urls.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:26 PM

Mozilla CI tools: Upcoming support for Buildbot Bridge

What is the Buildbot Bridge?

The Buildbot Bridge (BBB) allows scheduling Buildbot jobs through TaskCluster.
In other words, you can have taskcluster tasks which represent Buildbot jobs.
This allows having TaskCluster graphs composed of tasks which will be executed either on Buildbot or TaskCluster, hence, allowing for *all* relationships between tasks to happen in TaskCluster.

Read my recent post on the benefits of scheduling every job via TaskCluster.

The next Mozilla CI tools (mozci) release will have support for BBB.

Brief explanation

You can see in this try push both types of Buildbot jobs [1].
One set of jobs were triggered through Buildbot's analysis of the try syntax in the commit message while two of the jobs should not have been scheduled.

Those two jobs were triggered off-band via Mozci submitting a task graph.
You can see the TaskCluster graph representing them in here [2].

These jobs were triggered using this simple data structure:
  'Linux x86-64 try build': [
    'Ubuntu VM 12.04 x64 try opt test mochitest-1'

Mozci turns this simple graph into a TaskCluster graph.
The graph is composed of tasks which follow this structure [3]

Notes about the Buildbot Bridge

bhearsum's original post, recording and slides:

Some notes which Selena took about the topic:

The repository is in here:


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:16 PM

September 17, 2015

Armen Zambrano G. (@armenzg)

Platform Operations lightning talks (Whistler 2015)

You can read and watch in here about the Platform Operations lighting talks:

Here the landing pages for the various Platform Operations teams:

PS = I don't know what composes platform operations so feel free to add your team if I'm missing it.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 17, 2015 03:57 PM

September 11, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - September 11, 2015

Not sure how it was for you, but that was a deceptively busy week.

Modernize infrastructure: Amy created a new OS X 10.10.5 deployment procedure and installed the first 64 of our 200 new mac minis (for Firefox/Thunderbird testing). Further work needs to be done to validate the move to new hardware and upgrade to 10.10.5 and to rebase the timing tests.

Jonas rolled out support for remote signature validation and

Jordan is working on adding some Android variant builds to TaskCluster (TC). As part of that process, he’s also documenting his efforts to create a HOWTO for devs so they can self-serve in TC in the future.

Ted hooked up cross-compiled Mac builds running in TC to try. This is the first step to moving Mac build load off of physical hardware. This is huge. (

Improve CI pipeline: Our intern, Anthony, gave his end-of internship presentation on Thursday with details about the various improvements in made to TC over the summer. If you missed it, you can watch it over on

Release: Firefox 41.0 beta 9 is in the pipe this week, along with Thunderbird 41.0 beta 1 (build #2).

Operational: Amy tracked down a bunch of configuration warnings on our puppet servers, filed bugs to get them fixed, and set up some notifications from our log hosts so that we learn about such known problems within 10 minutes.

Greg is rolling out a change to taskcluster-vcs to reduce parallelization for “repo”, and hopefully improve TaskCluster’s behavior relative to when 500s are thrown. So far, performance changes appear to be a wash, with some jobs taking slightly longer and others slowing.

Some faulty puppet changes this week caused tree closures on two separate days: the initial landing caused all POSIX systems to loop indefinitely in runner, and then that same change propagated into the new AMIs for spot instance the next day. Morgan has been working on a way to do tiered roll-outs of new AMIs using “canary” instances to avoid this kind of cascade puppet failure in the future:

See you next week!

September 11, 2015 11:21 PM

Morgan Phillips (mrrrgn)

TaskCluster GitHub Has Landed

TaskCluster based CI has landed for Mozilla developers. One can begin using the system today by simply dropping a .taskcluster.yml file into the base of their repository. For an example configuration file, and other documentation please see:

To get started ASAP steal this config file and replace npm install . && npm test section with whatever commands will run your project's test suite. :)

September 11, 2015 08:55 PM

September 10, 2015

Armen Zambrano G. (@armenzg)

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 10, 2015 02:13 PM

September 08, 2015

Chris Cooper (coop)

URES’15 Call for Participation Extended

By request, the deadline for submissions for the third USENIX Release Engineering Summit (URES ‘15) has been extended. URES ‘15 will take place during LISA15, on November 13, 2015, in Washington, D.C.

If you would like to present a full-length or lightning talk on a release engineering topic, you can find more details on the submission process here:

September 08, 2015 06:15 PM

September 04, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - September 04, 2015

catlee and coop returned from PTO, and lo, there was much rejoicing, mostly from Amy who held down the fort while the rest of her management team was off doing other stuff.

Modernize infrastructure: Thanks to Morgan, we can now trigger TaskCluster (TC) jobs based on GitHub pushes and pull requests. See the TC docs for more info:

Dustin started a discussion thread about changing how we do Linux builds in TC:!topic/

Improve CI pipeline: Created a development environment for the Buildbot Taskcluster Bridge, which will make development of it go faster and safer. (

Various improvements to the Buildbot Taskcluster Bridge

We made some upgrades to releng systems to allow them to take advantage of mercurial clones being served from the CDN. See gps’ blog post for more details:

Release: We put the finishing touches on a couple of releases this week, namely Thunderbird 38.2.0 and Firefox 41.0b7. Jordan just stepped into the releaseduty role for the Firefox 41 cycle and is doing great work by all accounts.

Operational: The tree-closing window (TCW) came and eventually went over the weekend. A few things went sideways:

Thanks to everyone who helped wrestle our systems back under control: Hal, Rail, Jordan, and especially Nick who spent a substantial portion of his New Zealand weekend getting things working again.

See you again next week!

September 04, 2015 09:22 PM

RelEng & RelOps Weekly highlights - August 29, 2015

Happy weekend, friends of releng!

It’s been a very release and ops heavy week for lots of people on our teams. We were heads down building multiple Firefox releases and dealing with some issues related to load on the Mozilla VCS infrastructure and issues due to infrastructure configuration changes. We’re also working jointly with IT, as we speak, to perform a number of maintenance tasks during our regular tree closing window.

Modernize infrastructure: Dustin and Anthony are working on TTL support in tooltool.

Q brought up another hand-built windows 10 machine to double the available pool size on try. At the same time Callek continues to coordinate with developers in his efforts to green up Windows 10 testing.

Q updated Microsoft Deployment Tools (MDT) to 2013 update 1. XP booting is still being worked on but we took a big step forward in being able to support Windows 10.

Rob determined why our AWS Windows spot instances weren’t spinning up and patched a few bugs in our cloud-tool and slavealloc configurations. Now we need to work on making sure builds function properly. One step closer to having Windows builds in AWS.

Improve CI pipeline: Dustin has been working on running linux builds in a docker image running the latest version of CentOS.

Ben has been working on Buildbot Taskcluster Bridge improvements:

Rail has funsize reporting to production treeherder (

Rail got funsize to generate partial updates 30-50% faster after enabling local diffing cache.

Release: Jordan was point on handling multiple go to build requests and shipping important point releases (40.0.3, 38.2.1).

Operational: Dustin is working on a plan to migrate TreeStatus out of PHX1 and into relengapi in SCL3 as part of the PHX1 exit strategy.

Rail regenerated our git-shared tarball and increased the disk space on our linux64 AWS AMIs in effort to help with the load issues we’ve been experiencing recently on our internal git servers (

See you again next week!

September 04, 2015 02:00 AM

September 03, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - August 29, 2015

Happy weekend, friends of releng!

It’s been a very release and ops heavy week for lots of people on our teams. We were heads down building multiple Firefox releases and dealing with some issues related to load on the Mozilla VCS infrastructure and issues due to infrastructure configuration changes. We’re also working jointly with IT, as we speak, to perform a number of maintenance tasks during our regular tree closing window.

Modernize infrastructure: Dustin and Anthony are working on TTL support in tooltool.

Q brought up another hand-built windows 10 machine to double the available pool size on try. At the same time Callek continues to coordinate with developers in his efforts to green up Windows 10 testing.

Q updated Microsoft Deployment Tools (MDT) to 2013 update 1. XP booting is still being worked on but we took a big step forward in being able to support Windows 10.

Rob determined why our AWS Windows spot instances weren’t spinning up and patched a few bugs in our cloud-tool and slavealloc configurations. Now we need to work on making sure builds function properly. One step closer to having Windows builds in AWS.

Improve CI pipeline: Dustin has been working on running linux builds in a docker image running the latest version of CentOS.

Ben has been working on Buildbot Taskcluster Bridge improvements:

Rail has funsize reporting to production treeherder (

Rail got funsize to generate partial updates 30-50% faster after enabling local diffing cache.

Release: Jordan was point on handling multiple go to build requests and shipping important point releases (40.0.3, 38.2.1).

Operational: Dustin is working on a plan to migrate TreeStatus out of PHX1 and into relengapi in SCL3 as part of the PHX1 exit strategy.

Rail regenerated our git-shared tarball and increased the disk space on our linux64 AWS AMIs in effort to help with the load issues we’ve been experiencing recently on our internal git servers (

See you again next week!

September 03, 2015 10:54 PM

RelEng & RelOps Weekly highlights - August 23, 2015

Welcome to a double issue of our weekly highlights email, covering the last two action-packed weeks in RelEng and RelOps!

Modernize infrastructure:Rob debugged the spot deployment process and checked in changes to allow cloud tools to allocate windows spot instances: b-2008-spot (build) & y-2008-spot (try). Testing is ongoing to validate that these instances are capable of performing build work.

Greg and Jonas have been working closely with Treeherder folks and Ed Morley pushed a change that dramatically improves UX for Sheriffs looking at TC jobs.



Load testing for separating auth into its own service for TaskCluster is complete. See Jonas for details.

Mike rolled out the new indexing to unify the routes between Taskcluster and Buildbot builds (

Callek continues to make progress on getting windows 10 tests green ( on our infra. Special thanks to the developer support he has received thus far in getting the various issues addressed. At the same Q has been reworking our infra to support windows 10 in our deployment toolchains (which will enable us to bring up more windows 10 machines to begin to meet capacity needs).

Improve release pipeline: Ben and Rail made a breakthrough in how to test the new Release Promotion code, which will let us move faster and with more confidence on that project.

The Ship It/release runner development environment is now fully functional, which lets us easily do end to end testing of Release Promotion.

Improve CI pipeline: Callek disabled the jetpack tests running on try that have been broken since addon signing landed. (,

Kim disabled Android 4.3 debug robocop tests on try were which were broken and hidden (

Kim changed the instance type to c3.xlarge that Android 4.3 mochitests run on which will allow us to run media tests on emulators (

Release: Released Firefox 40.0, 40.0.2, 41.0b1, 41.0b2, (we only built 40.0.1 not released it) for both Desktop and Android. We also built Thunderbird 38.2.

Operational: Amy reimaged the remaining MacOSX 10.8 machines as 10.10 and Kim deployed them to our Yosemite pool which increased the size of this pool by 10% ( Kim also removed the remaining 10.8 configs from our code base (

Rail was the bug ( that was preventing us from reimaging linux32 talos machines and Alin was able to reimage these machines and add them to our production pool.

Jordan performed the merge-day [uplift-day] config changes (

Callek removed support for building esr31 now that it is EOL, allowing us to cleanup a bunch of now-obsolete complexity. (

After discussions with the A-team and sheriffs, Kim increased the parameters for SETA so that tests that historically don’t identify regressions will only run on every 7th push or every 60 minutes on m-i and m-c. ( We hope to revisit this in a few weeks to increase it back to every 10th push and every 90 minutes. Thanks to Alice and Armen for writing mozci tools to allow the sheriffs to backfill jobs to avoid the backout issues on merges that SETA revealed when it was first implemented.

See you next week!

September 03, 2015 10:38 PM

August 07, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - August 07, 2015

Wow, what a week. Between tree closures and some unexpected security releases, release engineering was stretched pretty thin. Here’s hoping for a more “normal” week next week as we try to release Firefox 40.

Modernize infrastructure: Greg Arndt deployed a fix to mozilla-taskcluster to eliminate noisy ‘deadline-exceeded’ dependant tasks whose parent tasks fail. This improves sheriffing and is part of the work to make TaskCluster and TreeHerder the best of pals.

Jake Watkins implemented timezone and w32time configurations via puppet and compiled ntpdate for Windows (for stepping the clock on systems where w32time has been disabled). This allows us to control time on Windows machines without need of the AD domain.

Q and Callek have managed to get one Windows 10 host connected to the Try server with buildbot. The machine has successfully run a selection of tests so far, some of which are even passing! We still have more work to do in order to green up the other tests and need to create more machines for the pool before we can open the platform up for general use.

Dustin got a Linux build running in a CentOS 6.6 docker image within TaskCluster. There are lots of things to fix, but this will produce much more compatible builds than the earlier attempts with an Ubuntu 14.04 docker image.

Improve release pipeline: Nick increased the l10n and update test chunking, shaving multiple hours off of the release process. This helped immensely as we prepared the release builds for Firefox 40.

Improve CI pipeline: Morgan tweeted about her latest work on .taskclusterrc, bringing WebMaker CI over:

Release: Urgent Firefox security releases shipped on multiple branches in under 24 hours! This all happened in the shadow of next week’s milestone release of Firefox 40, and the regular parade of beta builds. Kudos to those on release duty, specifically catlee and Nick, for getting them all out the door without tripping over each other.

Operational: John Ford and Jonas Jensen debugged a frustrating problem in the TC provisioner causing it to fail unexpectedly. Many yaks were shaved. The cause was an obscure bit of logic calling a buggy library in one of our dependencies. Since deploying the fix last week, the TC provisioner has been stable.

Greg blogged about TaskCluster and try:

Dustin deployed a new version of relengapi that includes support for database migrations (with Alembic), improvements to Archiver, and a new (but not-yet-used) implementation of treestatus.

The buildduty contractors continue to make strides as we knock down access hurdles for them. They are now able to handle slave loans with minimal intervention from releng or IT, and can now update python packages on our internal servers when requested by developers.

See you next week!

August 07, 2015 09:35 PM

July 31, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 31, 2015

Welcome back to the weekly releng Friday update! Here’s what we’ve been up to this week.

Modernize infrastructure: Rob checked in code to integrate Windows with our AWS cloud-tools software so that we now have userdata for deploying spot instances (https://bugzil/la/1166448) as well as creating the golden AMI for those instances.

Mark checked in code to update our puppet-managed Windows machines with a newer version of mecurial, working around some installation oddities (

Now that Windows 10 has been officially released, Q can more easily tackle the GPOs that configure our test machines, verifying which don’t need changes, and which will need an overhaul ( Callek is working to get buildbot setup on Windows 10 so we can start figuring out which suites are failing and engage developers for help.

Improve CI pipeline: With the last security blockers resolved and a few weeks of testing under his belt, Rail is planning to enable Funsize on mozilla-central next Tuesday (

Release: Uplift starts next week, followed by the official go-to-build for Firefox 40. Beta 9 is out now.

Operational: Buildduty contractors started this week! Alin (aselagea) and Vlad (vladC) from Softvision are helping releng with buildduty tasks. Kim and Coop are trying to get them up-to-speed as quickly as possible. They’re finding lots of undocumented assumptions built into our existing release engineering documentation.

Dustin has migrated our celery backend for relengapi to mysql since we were seeing reliability issues on the rabbit cluster we had been using (

Our intern, Anthony Miyaguchi, added database upgrade/downgrade ability to relengapi via alembic, making future schema changes painless. (

Amy has finished replacing the two DeployStudio servers with newer hardware, OS, and deployment software, and we are now performing local Timemachine backups of the their data ( Offsite backups will follow once Bacula releases a new version of their software that correctly supports TLS 1.2.

The new Windows machines we setup last week are now in production, increasing capacity by 10 machines each in the Windows XP, Windows 7, and Windows 8 test pools (

See you next week!

July 31, 2015 07:33 PM

July 30, 2015

Hal Wine (hwine)

Decoding Hashed known_hosts Files

Decoding Hashed known_hosts Files

tl;dr: You might find this gist handy if you enable HashKnownHosts

Modern ssh comes with the option to obfuscate the hosts it can connect to, by enabling the HashKnownHosts option. Modern server installs have that as a default. This is a good thing.

The obfuscation occurs by hashing the first field of the known_hosts file - this field contains the hostname,port and IP address used to connect to a host. Presumably, there is a private ssh key on the host used to make the connection, so this process makes it harder for an attacker to utilize those private keys if the server is ever compromised.

Super! Nifty! Now how do I audit those files? Some services have multiple IP addresses that serve a host, so some updates and changes are legitimate. But which ones? It’s a one way hash, so you can’t decode.

Well, if you had an unhashed copy of the file, you could match host keys and determine the host name & IP. [1] You might just have such a file on your laptop (at least I don’t hash keys locally). [2] (Or build a special file by connecting to the hosts you expect with the options “-o HashKnownHosts=no -o UserKnownHostsFile=/path/to/new_master”.)

I through together a quick python script to do the matching, and it’s at this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.

Bonus Tip:

Is a very nice way to manage gists from the command line.


[1]A lie - you’ll only get the host name and IP’s that you have connected to while building your reference known_hosts file.
[2]I use other measures to keep my local private keys unusable.

July 30, 2015 07:00 AM

July 27, 2015

Chris Cooper (coop)

The changing face of buildduty, Summer 2015 edition


Buildduty is the Mozilla release engineering (releng) equivalent of front-line support. It’s made up of a multitude of small tasks, none of which on their own are particulary complex or demanding, but taken in aggregate can amount to a lot of work.

It’s also non-deterministic. One of the most important buildduty tasks is acting as information brokers during tree closures and outages, making sure sheriffs, developers, and IT staff have the information they need. When outages happen, they supercede all other work. You may have planned to get through the backlog of buildduty tasks today, but congratulations, now you’re dealing with a network outage instead.

Releng has struggled to find a sustainable model for staffing buildduty. The struggle has been two-fold: finding engineers to do the work, and finding a duration for a buildduty rotation that doesn’t keep the engineer out of their regular workflow for too long.

I’m a firm believer that engineers *need* to be exposed to the consequences of the software they write and the systems they design:

I also believe that it’s a valuable skill to be able to design a system and document it sufficiently so that it can be handed off to someone else to maintain.

Starting this week, we’re trying something new. We’re shifting at least part of the burden to management: I am now managing a pair of contractors who will be responsible for buildduty for the rest of 2015.

Alin and Vlad are our new contractors, and are both based in Romania. Their offset from Mozilla Standard Time (aka PST) will allow them to tackle the asynchronous activities of buildduty, namely slave loans, non-urgent developer requests, and maintaining the health of the machine pools.

It will take them a few weeks to find their feet since they are unfamiliar with any of the systems. You can find them on IRC in the usual places (#releng and #buildduty). Their IRC nicks are aselagea and vladC. Hopefully they will both be comfortable enough to append |buildduty to those nicks soon. :)

While Alin and Vlad get up to speed, buildduty continues as usual in #releng. If you have an issue that needs buildduty assistance, please ask in #releng, and someone from releng will assist you as quickly as possible. For less urgent requests, please file a bug.

July 27, 2015 10:07 PM

July 24, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 24, 2015

Welcome back. When we last left our heroes, they were battling the combined forces technical debt and a lack of self-service options. We join the fight already in progress…

Kapow! width=100%

Modernize infrastructure: To pave the way for creating continuous integration (CI) automation for Windows 10, Q is auditing all of our Windows 8 GPOs to determine which will work as-is on WIndows 10, which will no longer be needed, and which will require rewriting to work on the new platform (

Dustin has completed the taskcluster scope handling audit and has reporting his findings back to the team and filed bugs for remediation.

Rail has deployed a change that allows us to specify docker images by their sha256 in TaskCluster, reducing the risk of MITM attacks. This was one of the hard security blockers for Funsize (

After much discussion, we’ve chosen to move forward with installing hg as an EXE on Windows for the time being. Mark is implementing this method so we can continue progress towards moving Windows 2008 builds into AWS (

Morgan has a prototype of some github/TaskCluster integration: if your project lives in github/mozilla, you can drop a .taskclusterrc file in the base of your repository and the jobs will just start running after each pull request (

Dustin is migrating treestatus to relengapi, removing one of the blockers to existing the PHX1 datacenter and centralizing another of our many web apps (

Amy is working at replacing the servers we use to image mac builders and testers. This will allow us to perform backups of critical information and will prepare us for new OS X 10.10 hardware that’s in the purchasing pipeline now (

Kim disabled Android 4.0 test jobs by default on Try as another step toward to disabling Pandas as a test platform as we move Android 4.3 test jobs to emulators on AWS (

Today is the last day of Anhad’s internship. :( His end-of-internship presentation is now available on Air Mozilla: This week he met with Anthony to hand-off his work on the getting Windows builds working with the generic worker in TaskCluster. (

Improve release pipeline: Ben worked with OpSec to generate a new GPG signing key (replacing our expired one) and deploy it to our Nightly and Release signing servers. We are also working to improve the monitoring around signing key expiry to avoid future fire drills.

Improve CI pipeline: Jordan has re-deployed the change that switches over all future CI and release jobs to using a Gecko-based copy of Mozharness (

Ben has been working towards stopping automated builds of XULRunner for the Firefox 42.0 cycle, starting September 22nd, 2015 (!topic/

Release: Firefox 40 is currently in beta. We’re up to b7 now.

Operational: Amy and Coop have worked with DCops to re-balance the Linux/Windows test pools, removing 30 machines from the Linux talos pools increasing capacity by 10 machines each in the Windows XP, Windows 7, and Windows 8 test pools (

We giveth: This week we enabled the B2G 2.2r branch. (

…and we taketh away: We also disabled many obsolete B2G builds/branches to improve throughput and reclaim capacity.

vcs-sync is now running in AWS! Hal made the official switch this week after running both setups in parallel for a while. This allows us to retire some ancient hardware in the datacenter.

Callek touched over 55 bugs this week as buildduty, many of them during triage and resolution of machine loans. (

Will our heroes emerge victorious? Tune in next week!

July 24, 2015 04:57 PM

July 23, 2015

Morgan Phillips (mrrrgn)

git push origin taskcluster

If you've been around Mozilla during the past two years, you've probably heard talk about TaskCluster - the fancy task execution engine that a few awesome folks built for B2G - which will be used to power our entire CI/Build infrastructure soonish.

Up until now, there have been a few ways for developers to schedule custom CI jobs in TaskCluster; but nothing completely general. However, today I'd like to give sneak peak at a project I've been working on to make the process of running jobs on our infra extremely simple: TaskCluster GitHub.

Why Should You Care?

1.) The service watches an entire organization at once: if your project lives in github/mozilla, drop a .taskclusterrc file in the base of your repository and the jobs will just start running after each pull request - dead simple.

2.) TaskCluster will give you more control over the platform/environment: you can choose your own docker container by default, but thanks to the generic worker we'll also be able to run your jobs on Windows XP-10 and OSX.

3.) Expect integration with other Mozilla services: For a mozilla developer, using this service over travis or circle should make sense, since it will continue to evolve and integrate with our infrastructure over time.

It's Not Ready Yet: Why Did You Post This? :(

Because today the prototype is working, and I'm very excited! I also feel that there's no harm in spreading the word about what's coming.

When this goes into production I'll do something more significant than a blog post, to let developers know they can start using the system. In the meantime here it is handling a replay of this pull request. \o/ note: The finished version will do nice things, like automatically leave a comment with a link to the job and its status..

July 23, 2015 04:26 AM

July 22, 2015

Armen Zambrano G. (@armenzg)

Few mozci releases + reduced memory usage

While I was away adusca released few releases of mozci.

From the latest release I want to highlight that we're replacing the standard json library for ijson since it solves some memory leak issues we were facing for pulse_actions (bug 1186232).

This was important to fix since our Heroku instance for pulse_actions has an upper limit of 1GB of RAM.

Here are the release notes and the highlights of them:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 22, 2015 02:21 PM

July 21, 2015

Jordan Lund (jlund)

Mozharness now lives in Gecko

What's changed?

continuous-integration and release jobs that use Mozharness will now get Mozharness from the Gecko repo that the job is running against.


Whether the job is a build (requires a full gecko checkout) or a test (only requires a Firefox/Fennec/Thunderbird/B2G binary), automation will first grab a copy of Mozharness from the gecko tree, even before checking out the rest of the tree. Effectively minimizing changes to our current infra.

This is thanks to a new relengapi endpoint, Archiver, and's sub directory archiving abilities. Essentially Archiver will get a tar ball of Mozharness from within a target gecko repo, rev, and sub-repo-directory and upload it to Amazon's S3.

What's nice about Achiver is that it is not restricted to just grabbing Mozharness. You could, for example, put in the Gecko tree or, improving on our model, simply grab subdirectories from within the testing/* part of the tree and request them on a suite by suite basis.

What does this mean for you?

it depends. if you are...

1) developing on Mozharness

You will need to checkout gecko and patches will now land like any other gecko patch: 1) land on a development tree-branch (e.g. mozilla-inbound) 2) ride the trains. This now means:

This also means:

2) just needing to deploy Mozharness or get a copy of it without gecko

Like the usage docs linked to Archiver above, you could hit the API directly. But I recommend using the client that buildbot uses. The client will wait until the api call is complete, download the archive from a response location, and unpack it to a specified destination.

Let's take a look at that in action: say you want to download and unpack a copy of mozharness based on mozilla-beta at 93c0c5e4ec30 to some destination.

python mozharness --repo releases/mozilla-beta --rev 93c0c5e4ec30 --destination /home/jlund/downloads/mozharness  

Note: if that was the first time Archiver was polled for that repo + rev, it might take a few seconds as it has to download Mozharness from hgmo and then upload it to S3. Subsequent calls will happen near instantly

Note 2: if your --destination path already exists with a copy of Mozharness or something else, the client won't rm that path, it will merge (just like unpacking a tarball behaves)

3) a Release Engineering service that is still using

Not all Mozharness scripts are used for continuous integration / release jobs. There are a number of Releng services that are based on Mozharness: e.g. Bumper, vcs-sync, and merge_day. As these services transition to using Archiver, they will continue to use hgmo/build/mozharness as the Repository of Record (RoR).

If certain services that can not use gecko based Mozharness, then we can fork Mozharness and setup a separate repo. That will of course mean such services won't receive upstream changes from the gecko copy so we should avoid this if possible.

If you are an owner or major contributor to any of these releng services, we should meet and talk about such a transition. Archiver and its client should make deployments pretty painless in most cases.

Have something that may benefit from Archiver?

If you want to move something into a larger repository or be able to pull something out of such a repository for lightweight deployments, feel free to chat to me about Archiver and Relengapi.

As always, please leave your questions, comments, and concerns below

July 21, 2015 11:28 PM

July 20, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 17, 2015

Welcome to the weekly releng Friday update, only this time on a Monday!

I’ve done away with the gory details section. It was basically a thin filter for bugzilla search results, and we all spend enough time in bugzilla already.


TaskCluster: Funsize is generating partial updates for nightly/aurora builds now! We’re generating partial updates for up to 4 days in the past: link to TreeHerder results, which are hidden by default.

You can set your update channel to ‘nightly-funsize’ to test.

This quarter, we’re working on a scopes and authentication/credentials audit of TaskCluster to make sure it’s secure enough to move build/testing load from buildbot to TaskCluster. Hal is leading this effort with the OpSec team.

Our interns are also hard at work on migrations to TaskCluster. Anhad finished his work migrating spidermonkey builds and tests (, while Anthony is working on uploading symbols via a separate task (

Modernize infrastructure: Runner is now enabled on all our Windows build machines. One of the biggest benefits of this is that runner is performing most clobber/purge work before buildbot starts and so build jobs don’t need to waste so much time clobbering build directories or freeing up space.

We’re starting to investigate what the requirements are to stand up Windows 10 CI infrastructure. We’re attacking both the build integration side and the OS installation and configuration side simultaneously.

We’ve finished collecting performance data for Windows in AWS and have chosen the c3.2xlarge platform as our base for future 2008 instances.

New proposal for TaskCluster routes for buildbot/TaskCluster uploads: Mike is looking for feedback about how we organize builds in the TaskCluster index. These routes will make it possible to find builds via various parameters like platform, revision, or build date.

Mozharness in-tree: The mozharness archiver was deployed but encountered problems with celery task proliferation. Jordan wrote some code to better track and expire the celery tasks, and deployed it late last week. We hope to resume the in-tree migration this week.

Improve release pipeline: Ben has been working on killing XULRunner builds and replacing them with the Firefox SDK we’re already producing. This will really simplify our release pipeline, and clean up our codebase as well.

Improve CI pipeline: Ted got 64-bit OS X cross-compiling in one of the existing docker containers! He still needs to figure out universal builds, but this is a big step forward.

Release: Firefox 40 is currently in beta. We’re up to b5 now.

Operational: A bad commit landed on upstream master for “repo” caused trees to be closed for many hours last Wednesday. We eventually got back in business by stripping commits on the master. There are bugs on file now to improve how we handle these repos in automation going forward to avoid precisely this kind of problem.

I took particular solace in this bug because somewhere, someone decided that naming a git repo “repo” was a good idea. Releng is not the only group that is terrible at naming things.

We’ve fixed some bugs in and bundled Metric Collective, our OS-level metrics collection software on Windows, into an exe for use with our puppet-managed Windows servers.

We’ve gotten a nuget repo set up on our configuration management servers and work is starting to make that the default package manager for puppet-managed Windows hosts.

There was a big, disruptive, tree-closing window (TCW) over the weekend, and everything went smoothly from our perspective.

See you next week!

July 20, 2015 04:31 PM

July 17, 2015

Kim Moir (kmoir)

Learning Data Science and evidence based teaching methods

This spring, I took several online courses on the topic of data science.  I became interested in expanding my skills in this area because as release engineers, we deal with a lot of data.  I wanted to learn new tools to extract useful information from the distributed systems behemoth we manage.

This xckd reminded me of the challenges of managing our buildfarm somedays :-)

I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes).  There are quizzes and assignments that are due each week.  Each course below was about four weeks long.

The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R.  Most of the quizzes and examples used publicly available data for the programming exercises.  I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments.  The lectures didn't provide a lot of the material needed to complete the assignments.  Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R.  Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing  R programs.  The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.

There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning.   My next coursera  class is Introduction to Systems Engineering which I'll start in a couple of weeks.  I've really become interested in learning more about this subject after reading Thinking in Systems.

The other course I took this spring was the Software Carpentry Instructor training course.   The Software Carpentry Foundation teachers researchers basic software skills.  For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others.  These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively.  The instructor course was excellent, thanks Greg Wilson for your work teaching us.

We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material.  How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more.  Really interesting research.

For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises.  We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix.  Then we are ready to sign up to teach a Software Carpentry course!

In conclusion, data science is a great skill to have if you are managing large distributed systems.  Also, using evidence based teaching methods to help others learn is the way to go!

Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy

July 17, 2015 03:41 PM

July 07, 2015

Rail Alliev (rail)

Funsize is ready for testing!

Funsize is very close to be enabled in production! It has undergone a Rapid Risk Assessment procedure, which found a couple of potential issues. Most of them are either resolved or waiting for deployment.

To make sure everything works as expected and to catch some last-minute bugs, I added new update channels for Firefox Nightly and Developer Edition. If you are brave enough, you can use the following instructions and change your update channels to either nightly-funsize (for Firefox Nightly) or aurora-funsize (for Firefox Developer Edition).

TL;DR instruction look like this:

  • Enable update logging. Set app.update.log to true in about:config. This step is optional, but it may help with debugging possible issues.
  • Shut down all running instances of Firefox.
  • Edit defaults/pref/channel-prefs.js in your Firefox installation directory and change the line containing with:
pref("", "aurora-funsize"); // Firefox Developer Edition


pref("", "nightly-funsize"); // Firefox Nightly
  • Continue using you Firefox

You can check your channel in the About dialog and it should contain the channel name:

About dialog

Reporting Issues

If you see any issues, please report them to Bugzilla.

July 07, 2015 02:44 PM

July 06, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - July 3, 2015

Welcome to the weekly releng Friday update, Whistler hangover edition.

Half the team took time off after Whistler. With a few national holidays sprinkled in, things were pretty slow last week. Still, those of us who were still around took advantage of the lull to get stuff done.


Taskcluster: Our new intern, Anthony Miyaguchi, started in San Francisco and will working on crash symbols uploads in TaskCluster. Our other intern, Anhad, has almost finished his work migrating spidermonkey to taskcluster. Morgan and Jonas are investigating task graph creation directly from github. Dustin continues to make efficiency improvements in the Fennec Taskcluster builds.

Modernize infrastructure: Mark, Q, and Rob continue to work on standing up our new Windows build platform in AWS. This includes measuring some unexpected performance improvements.

Improve release pipeline: We’re standing up a staging version of Ship-It to make it easier to iterate. Ben’s working on a new-and-improved checksum builder for S3, and Mike fixed a problem with l10n updates.

Improve CI pipeline: Jordan pushed the archiver relengapi endpoint and client live. They are now being actively used for mozharness on the ash project branch. catlee deployed the hg bundleclone extension to our Mac and Linux platforms, and Rail deployed a new version of fun size with many integrity improvements.

Release: Firefox 39.0 is in the wild!

Tune in again next week!

And here are all the details:


Operational work

Modernize infrastructure

Improve release pipeline

Improve CI pipeline


See you next week!

July 06, 2015 09:23 PM

Armen Zambrano G. (@armenzg)

mozci 0.8.2 - Allow using TreeHerder as a query source

In this release we have added an experimental feature where you can use Treeherder as your source for jobs' information instead of using BuildApi/Buildjson.
My apologies as this should have been a minor release (0.9.0) instead of a security release (0.8.2).


Thanks to @adusca @vaibhavmagarwal and @chmanchester for their contributions.
Our latest new contributor is @priyanklodha - thank you!

How to update

Run "pip install -U mozci" to update

Major highlights

  • Added --query-source option to get data from Treeherder or Buildapi
  • Improved usage of OOP to allow for different data sources seamlessly

Minor improvements

  • Better documentation of --times
  • Cleaning up old builds-*.js files
  • Enforced line character limit

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 06, 2015 02:58 PM

June 27, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 26, 2015

Friday, foxyeah!

It’s been a very busy and successful work week here in beautiful Whistler, BC. People are taking advantage of being in the same location to meet, plan, hack, and socialize. A special thanks to Jordan for inviting us to his place in beautiful Squamish for a BBQ!

(Note: No release engineering folks were harmed by bears in the making of this work week.)


Whistler: Keynotes were given by our exec team and we learned we’re focusing on quality, dating our users to get to know them better, and that WE’RE GOING TO SPACE!! We also discovered that at LEGO, Everything is Awesome now that they’re thinking around the box instead of inside or outside of it. Laura’s GoFaster project sounds really exciting, and we got a shoutout from her on the way we manage the complexity of our systems. There should be internal videos of the keynotes up next week if you missed them.

Internally, we talked about Q3 planning and goals, met with our new VP, David, met with our CEO, Chris, presented some lightning talks, and did a bunch of cross-group planning/hacking. Dustin, Kim, and Morgan talked to folks at our booth at the Science Fair. We had a cool banner and some cards (printed by Dustin) that we could hand out to tell people about try. SHIP IT!

Taskcluster: Great news; the TaskCluster team is joining us in Platform! There was lots of evangelism about TaskCluster and interest from a number of groups. There were some good discussions about operationalizing taskcluster as we move towards using it for Firefox automation in production. Pete also demoed the Generic Worker!

Puppetized Windows in AWS: Rob got the nxlog puppet module done. Mark is working on hg and NSIS puppet modules in lieu of upgrading to MozillaBuild 2.0. Jake is working on the metric-collective module. The windows folks met to discuss the future of windows package management. Q is finishing up the performance comparison testing in AWS. Morgan, Mark, and Q deployed runner to all of the try Windows hosts and one of the build hosts.

Operational: Amy has been working on some additional nagios checks. Ben, Rail, and Nick met and came up with a solid plan for release promotion. Rail and Nick worked on releasing Firefox 39 and two versions of Firefox ESR. Hal spent much of the week working with IT. Dustin and catlee got some work on on migrating treestatus to relengapi. Hal, Nick, Chris, and folks from IT, sheriffs, dev-services debugged problems with b2g jobs. Callek deployed a new version of slaveapi. Kim, Jordan, Chris, and Ryan worked on a plan for addons. Kim worked with some new buildduty folks to bring them up to speed on operational procedures.

Thank you all, and have a safe trip home!

And here are all the details:


Puppetized Windows in AWS


See you next week!

June 27, 2015 03:19 PM

June 19, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 19, 2015

Happy Friday once again, releng enthusiasts!

The release engineering and operations teams are heads-down this week trying to get quarterly deliverables done *before* heading off to Whistler for a Mozilla-wide work week. There’s lots of work in-flight, although getting updates has occasionally been like pulling teeth.

Because almost everyone will be in Whistler next week, next week’s update will focus less on completed or in-progress work and more on what releng team members took away from their time together in Whistler.


Taskcluster: Morgan got 32-bit Linux builds working! Rail reports that funsize update generation is ready to go, pending an AWS whitelist update by IT. Ted reproduced mshal’s previous work to get OS X builds cross-compiling in one of Morgan’s existing desktop build containers.

Puppetized Windows in AWS: Jake and Rob are working on additional puppet modules for Windows. Q is running performance tests on jobs in AWS after the networking modifications mentioned last week.

Operational: MozillaBuild 2.0 is out! Mark deployed NSIS 3.0b1 to our windows build/try pools. Kim and Netops have finished up the SCL3 switch upgrades. Jake rolled out changes to enable statsd on our POSIX systems. Dustin’s talk on fwuint was accepted to LISA 15. Dustin merged all of the relengapi blueprints into a single repository and released relengapi 3.0.0.

Whistler: There’s been a bunch of planning around Whistler, and props to catlee, naveed, and davidb for getting our stuff into the sched site (the tag for releng/relops/ateam/relman is platform-ops). Be sure to take a look and pick some planning, presentation, and hacking sessions to go attend!

Thank you all!

And here are all the details:


Puppetized Windows in AWS


See you next week!

June 19, 2015 06:39 PM

June 16, 2015

Kim Moir (kmoir)

Test job reduction by the numbers

In an earlier post,  I wrote how we had reduced the amount of test jobs that run on two branches to allow us to scale our infrastructure more effectively.  We run the tests that historically identify regressions more often.  The ones that don't, we skip on every Nth push.  We now have data on how this reduced the number of jobs we run since we began implementation in April.

We run SETA on two branches (mozilla-inbound and fx-team) and on 18 types of builds.  Collectively, these two branches represent about 20% of pushes each month.  Implementing SETA allowed us to move  from ~400 -> ~240 jobs per push on these two branches1 We run the tests identified as not reporting regressions on every 10th commit or 90 minutes since the last test was scheduled.  We run the critical tests on every commit.2

Reduction in number of jobs per push on mozilla-inbound as SETA scheduling is rolled out

A graph for the fx-team branch shows a similar trend. It was a staged rollout starting in early April, as I enabled platforms and as the SETA data became available. The dip in early June reflects where I enabled SETA for Android 4.3.

This data will continue to be updated in our scheduling configuration as it evolves and is updated by the code that Joel and Vaibhav wrote to analyze regressions. The analysis identifies that there were

Jobs to ignore: 440
Jobs to run: 114
Total number of jobs: 554

which is significant.  Our buildbot configurations are updated the latest SETA data with every reconfig, which occurs usually occurs every couple of days.

The platforms configured to run fewer tests for both opt and debug are

        MacOSX (10.6, 10.10)
        Windows (XP, 7, 8)
        Ubuntu 12.04 for linux32, linux64 and ASAN x64
        Android 2.3 armv7 API 9
        Android 4.3 armv7 API 11+

Additional info
1Tests may have been disabled/added at the same time,  this is not taken into account
2There still some scheduling issues to be fixed see bug 1174870  and bug 1174746 for further details

June 16, 2015 08:52 PM

Armen Zambrano G. (@armenzg)

mozci 0.8.0 - New feature -- Trigger coalesced jobs + improved performance

Beware! This release is full awesome!! However, at the same time new bugs might pop up so please let us know :)
You can trigger now all coalesced jobs on a revision:
mozci-trigger --coalesced --repo-name REPO -r REVISION


Thanks to @adusca @glandium @vaibhavmagarwal @chmanchester for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • #259 - New feature - Allow triggering
  • #227 - Cache files as gzip files instead of uncompressed
    • Less disk print
  • #227 - Allow using multiple builders
  • 1e591bf - Make sure that we do not fetch files if they are not newer
    • We were failing to double check that the last modification date of a file was the same as the one in the server
    • Hence, we were downloading files more often than needed
  • Caching builds-4hr on memory for improved performance

Minor improvements

  • f72135d - Backfilling did not accept dry_run or to trigger more than once
  • More tests and documents
  • Support for split tests (test packages json file)
  • Some OOP refactoring

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 16, 2015 05:38 PM

Chris AtLee (catlee)

RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing
  • Entropy of the changes
  • The amount of time since the files being changed were last modified
  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring
  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

June 16, 2015 03:39 PM

June 15, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 12, 2015

Happy Monday!

Release Engineering has a lot going on. To help spread good news and keep everyone informed, we’re trying an experiment in communication.

Managers put together a list of what we all have been working on, highlighting wins in the last week or so. That list gets sent out to the public release-engineering mailing list and then gets reblogged here.

Please send feedback - did you learn anything, what else should we have included, and what topics might need some additional explanation or context.


Taskcluster: Morgan now has working opt, debug, PGO and ASan 64-bit Linux builds for TaskCluster. This work enables developers to experiment with linux try jobs on their local systems! Dustin debugged and confirmed that inter-region S3 transfers are capped at 1 MB/sec; he also stood up a relengapi proxy for accessing private files in tooltool. Rail just finished work on deploying signing workers for TaskCluster, important for deploying funsize.

Puppetized Windows in AWS: Mark debugged and worked around an annoying ACL and netsh bugs in puppet that were blocking forward progress on Windows puppetization. Amy and Rob are generating 2008R2 puppetized AMIs in AWS via cloud-tools.


Amy, Hal, Nick, Ben, and Kim responded to the many issues caused by the NetApp outage, including unexpected buildbot database corruption. Hal, Ben, Nick and Rail have been working hard to enable 38.0.6, in support of the spring release which included a bunch of features we needed to get out the door to users.

Q worked on making S3 uploads from Windows complete successfully and worked with Sheriffs to debug and fix a Start Screen problem on Windows 8 that was causing test failures. Jake is saving us money by retiring diamond (shout out to shutting things off!). Rob & Q ensured that all systems now send logs to Papertrail, EVEN XP!

Mike removed the one of the last blockers to getting off FTP and over to S3: porting Android buildsto mozharness. In addition to freeing us from the buildbot factories, this also uses TaskCluster’s index service for uploading artifacts to S3. Hal and Anhad (our amazing intern!) now have vcs-sync conversions all running in parallel from AWS. As part of moving mozharness in-tree, Jordan is doing to work allow consumers to create and fetch mozharness bundles automatically from relengapi.

Thank you all!

And here are all the details:


Puppetized Windows in AWS


June 15, 2015 05:04 PM

Brace yourselves

June 15, 2015 04:34 PM

June 12, 2015

Kim Moir (kmoir)

Mozilla pushes - May 2015

Here's May 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.


The number of pushes decreased from those recorded in the previous month (8894) with a total of 8363. 


General Remarks


June 12, 2015 07:28 PM

June 10, 2015

Morgan Phillips (mrrrgn)

service "Dockerized Firefox Builds" status

After 160 [work] days, 155 patches, and approximately 8,977 cups of coffee; I'm now wrapping up my third quarter with Mozilla -- the time has flown by! In these first eight months I've written some fun code, shaved enough yaks to knit new wool sweater, and become acquainted with the ghosts of Mozilla RelEng past, present, and future; to great effect.

"Behold child, 'tis buildbot"
Quarter one was spent getting familiar with Mozilla RelEng's legacy: re-writing/taking ownership of existing services like clobberer. The second was all about optimizing our existing infrastructure: principally by rolling out runner. Then, this one has been dedicated new beginnings: porting jobs from our old buildbot based CI infra to the shiny new TaskCluster based one.

In moving [Linux] jobs from buildbot to TaskCluster, I've worked on docker containers which will build Firefox with all of the special options that RelEng needs. This is really cool because it means developers can download our images and work within them as well, thus creating parity between our CI infrastructure and their local environments (making it easier to debug certain bugs). So, what's my status update?

The good news: the container for Linux64 jobs is in tree, and working for both Desktop and Android builds!

The better news: these new jobs are already working in the Try tree! They're hidden in treeherder, but you can reveal them with the little checkbox in the upper right hand corner of the screen. You can also just use this link:

[ The job is under Opt Linux64: Tc(B) ]

# note: These are running alongside the old buildbot jobs for now, and hidden. The container is still changing a few times a week (sometimes breaking jobs), so the training wheels will stay on like this for a little while.

The best news: You can run the same job that the try server runs, in the same environment simply by installing docker and running the bash script below.

Bonus: A sister 32 bit container will be coming along shortly.

#!/bin/bash -e
# WARNING: this is experimental mileage may vary!

# Fetch docker image
docker pull mrrrgn/desktop-build:16

# Find a unique container name
export NAME='task-CCJHSxbxSouwLZE_mZBddA-container';

# Run docker command
docker run -ti \
--name $NAME \
-e TOOLTOOL_CACHE='/home/worker/tooltool-cache' \
-e RELENGAPI_TOKEN='ce-n-est-pas-necessaire' \
-e MH_BUILD_POOL='taskcluster' \
-e MOZHARNESS_SCRIPT='mozharness/scripts/' \
-e MOZHARNESS_CONFIG='builds/' \
-e NEED_XVFB='true' \
mrrrgn/desktop-build:16 \
/bin/bash -c /home/worker/bin/ \

# Delete docker container
docker rm -v $NAME;

June 10, 2015 02:10 PM

June 03, 2015

Armen Zambrano G. (@armenzg)

mozci 0.7.2 - Support b2g jobs that still run on Buildbot

There are a lot of b2g (aka Firefox OS) jobs that still run on Buildbot .
Interestingly enough we had not tried before to trigger one with mozci.
This release adds support for it.
This should have been a minor release (0.8.0) rather than a security release (0.7.2). My apologies!
All jobs that start with "b2g_" in all_builders.txt are b2g jobs that still run on Buildbot instead of TaskCluster (docs - TC jobs on treeherder).

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 03, 2015 07:23 PM

mozci 0.7.1 - regression fix - do not look for files for running jobs

This release mainly fixes a regression we introduced in the release 0.7.0.
The change (#220) we introduced checked completed and running jobs for files that have been uploaded in order to trigger tests.
The problem is that running jobs do not have any metadata until they actually complete.
We fixed this on #234.


Thanks to @adusca and @glandium for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • #234 - (bug fix) - Do not try to find files for running jobs
  • #228 - For try, only trigger talos jobs on existing build jobs ** rather than triggering builds for platforms that were not requested
  • #238 - Read credentials through environment variables

Minor improvements

  • #226 - (bug fix) Properly cache downloaded files
  • #228 - (refactor) Move SCHEDULING_MANAGER
  • #231 - Doc fixes

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 03, 2015 07:14 PM

May 28, 2015

Armen Zambrano G. (@armenzg)

mozci 0.7.0 - Less network fetches - great speed improvements!

This release is not large in scope but it has many performance improvements.
The main improvement is to have reduced the number of times that we fetch for information and use a cache where possible. The network cost was very high.
You can read more about in here:


Thanks to @adusca @parkouss @vaibhavmagarwal for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • Reduce drastically the number of requests by caching where possible
  • If a failed build has uploaded good files let's use them
  • Added support for retriggering and cancelling jobs
  • Retrigger a job once with a count of N instead of triggering individually N times

Minor improvements

  • Documenation updates
  • Add badge

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 28, 2015 02:01 PM

May 27, 2015

Armen Zambrano G. (@armenzg)

Welcome adusca!

It is my privilege to announce that adusca (blog) joined Mozilla (since Monday) as an Outreachy intern for the next 4 months.

adusca has an outstanding number of contributions over the last few months including Mozilla CI Tools (which we're working on together).

Here's a bit about herself from her blog:
Hi! I’m Alice. I studied Mathematics in college. I was doing a Master’s degree in Mathematical Economics before getting serious about programming.
She is also a graduate from Hacker's School.

Even though Alice has not been a programmer for many years, she has shown already lots of potential. For instance, she wrote a script to generate scheduling relations for buildbot; for this and many other reasons I tip my hat.

adusca will initially help me out with creating a generic pulse listener to handle job cancellations and retriggers for Treeheder. The intent is to create a way for Mozilla CI tools to manage scheduling on behalf of TH, make the way for more sophisticated Mozilla CI actions and allow other people to piggy back to this pulse service and trigger their own actions.

If you have not yet had a chance to welcome her and getting to know her, I highly encourage you to do so.

Welcome Alice!

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 27, 2015 05:10 PM

May 26, 2015

Chris AtLee (catlee)

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.


"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.


Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.
    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?
    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.
    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards
  • actionable metrics - people need to know how to move the needle
  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.
  • standardized quality metrics across the company
  • informing engineering teams about risk
  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

May 26, 2015 06:51 PM

May 15, 2015

Armen Zambrano G. (@armenzg)

mozci 0.6.0 - Trigger based on Treeherder filters, Windows support, flexible and encrypted password managament

In this release of mozci we have a lot of developer facing improvements like Windows support or flexibility on password management.
We also have our latest experimental script mozci-triggerbyfilters (

How to update

Run "pip install -U mozci" to update.


We have move all scripts from scripts/ to mozci/scripts/.
Note that you can now use "pip install" and have all scripts available as mozci-name_of_script_here in your PATH.


We want to welcome @KWierso as our latest contributor!
Our gratitude @Gijs for reporting the Windows issues and for all his feedback.
Congratulations to @parkouss for making the first project using mozci as its dependency.
In this release we had @adusca and @vaibhavmagarwal as our main and very active contributors.

Major highlights

  • Added script to trigger jobs based on Treeherder filters
    • This allows using filters like --include "web-platform-tests" and that will trigger all matching builders
    • You can also use --exclude to exclude builders you don't want
  • With the new trigger by filters script you can preview what will be triggered:
233 jobs will be triggered, do you wish to continue? y/n/d (d=show details) d
05/15/2015 02:58:17 INFO: The following jobs will be triggered:
Android 4.0 armv7 API 11+ try opt test mochitest-1
Android 4.0 armv7 API 11+ try opt test mochitest-2
  • Remove storing passwords in plain-text (Sorry!)
    • We now prompt the user if he/she wants to store their password enctrypted
  • When you use "pip install" we will also install the main scripts as mozci-name_of_script_here binaries
    • This makes it easier to use the binaries in any location
  • Windows issues
    • The python module is uncapable of decompressing large binaries
    • Do not store buildjson on a temp file and then move

Minor improvements

  • Updated docs
  • Improve wording when triggering a build instead of a test job
  • Loosened up the python requirements from == to >=
  • Added filters to

All changes

You can see all changes in here:

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 15, 2015 08:13 PM

May 08, 2015

Armen Zambrano G. (@armenzg)

mozci 0.5.0 released - Store password in keyring, prevent corrupted data, progress bar and many small improvements

In this release we have many small improvements that help with issues we have found.

The main improvement is that we now don't store credentials in plain-text (sorry!) but use keyring to store it encrypted.

We also prevent partially downloading any data (corrupted data) and added progress bar to downloads.

Congrats to @chmanchester as our latest contributor!
Our usual and very appreciated contributions are by @adusca @jmaher and @vaibhavmagarwal

Minor improvements:
  • Lots of test changes and increased coverage
  • Do not use the root logger but a mozci logger
  • Allow passing custom files to a triggered job
  • Work around buildbot status corruptions (Issue 167)
  • Allow passing buildernames with lower case and removing trailing spaces (since we sometimes copy/paste from TH)
  • Added support to use build a buildername based on trychooser syntax
  • Allow passing extra properties when scheduling a job on Buildbot
You can see all changes in here:

Link to official release notes.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 08, 2015 10:15 PM

May 05, 2015

Chris Cooper (coop)

Automated reconfigs and you

In an effort to offload yet more work from buildduty, today I deployed scripts to automatically reconfig masters when relevant repos are updated. The process works as follows:

Pretty simple, right?

So how does this change affect you?

In practice, it doesn’t.

Unless you already have an environment setup to run the script, you’ll probably still want to ask buildduty to perform the merge for you. The script has been changed to *only* merge repos by default, but the script also updates the wiki and bugzilla, which is important for maintaining an audit trail.

However, if you *are* willing to perform these extra steps, you can have your changes automatically deployed within the hour.

The eventual goal is to become comfortable enough with our travis coverage that we can move the production tags automatically when the tests pass.

Our tests seem pretty solid now to me, but maybe others have thoughts about how aggressive or cautious we should be here.

May 05, 2015 10:23 PM

May 02, 2015

Morgan Phillips (mrrrgn)

To Serve Developers

The neatest thing about release engineering, is the fact that our pipeline forms the primary bridge between users and developers. On one end, we maintain the CI infrastructure that engineers rely on for thorough testing of their code, and, on the other end, we build stable releases and expose them for the public to download. Being in this position means that we have the opportunity to impact the experiences of both contributors and users by improving our systems (it also makes working on them a lot of fun).

Lately, I've become very interested in improving the developer experience by bringing our CI infrastructure closer to contributors. In short, I would like developers to have access to the same environments that we use to test/build their code. This will make it:
[The release pipeline from 50,000ft]


The first part of my plan revolves around integrating release engineering's CI system with a tool that developers are already using: mach; starting with a utility called: mozbootstrap -- a system that detects its host operating system and invokes a package manager for installing all of the libraries needed to build firefox desktop or firefox android.

The first step here was to make it possible to automate the bootstrapping process (see bug: 1151834 "allow users to bootstrap without any interactive prompts"), and then integrate it into the standing up of our own systems. Luckily, at the moment I'm also porting some of our Linux builds from buildbot to TaskCluster (see bug: 1135206), which necessitates scrapping our old chroot based build environments in favor of docker containers. This fresh start has given me the opportunity begin this transition painlessly.

This simple change alone strengthens the interface between RelEng and developers, because now we'll be using the same packages (on a given platform). It also means that our team will be actively maintaining a tool used by contributors. I think it's a huge step in the right direction!

What platforms/distributions are you supporting?

Right now, I'm only focusing on Linux, though in the future I expect to support OSX as well. The bootstrap utility supports several distributions (Debian/Ubuntu/CentOS/Arch), though, I've been trying to base all of release engineering's new docker containers on Ubuntu 14.04 -- as such, I'd consider this our canonical distribution. Our old builders were based on CentOS, so it would have been slightly easier to go with that platform, but I'd rather support the platform that the majority of our contributors are using.

What about developers who don't use Ubuntu 14.04, and/or have a bizarre environment

One fabulous side effect of using TaskCluster is that we're forced to create docker containers for running our jobs, in fact, they even live in mozilla-central. That being the case, I've started a conversation around integrating our docker containers into mozbootstrap, giving it the option to pull down a releng docker container in lieu of bootstrapping a host system.

On my own machine, I've been mounting my src directory inside of a builder and running ./mach build, then ./mach run within it. All of the source, object files, and executables live on my host machine, but the actual building takes place in a black box. This is a very tidy development workflow that's easy to replicate and automate with a few bash functions [which releng should also write/support].

[A simulation of how I'd like to see developers interacting with our docker containers.]

Lastly, as the final nail in the coffin of hard to reproduce CI bugs, I'd like to make it possible for developers to run our TaskCluster based test/build jobs on their local machines. Either from mach, or a new utility that lives in /testing.

If you'd like to follow my progress toward creating this brave new world -- or heckle me in bugzilla comments -- check out these tickets:

May 02, 2015 06:54 AM

May 01, 2015

Kim Moir (kmoir)

Mozilla pushes - April 2015

Here's April 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.  

The number of pushes decreased from those recorded in the previous month with a total of 8894.  This is due to the fact that gaia-try is managed by taskcluster and thus these jobs don't appear in the buildbot scheduling databases anymore which this report tracks.


General Remarks


I've changed the graphs to only track 2015 data.  Last month they were tracking 2014 data as well but it looked crowded so I updated them.  Here's a graph showing the number of pushes over the last few years for comparison.

May 01, 2015 04:44 PM

April 28, 2015

Kim Moir (kmoir)

Releng 2015 program now available

Releng 2015 will take place in concert with ICSE in Florence, Italy on May 19, 2015. The program is now available. Register here!

via romana in firenze by ©pinomoscato, Creative Commons by-nc-sa 2.0

April 28, 2015 07:54 PM

Less testing, same great Firefox taste!

Running a large continuous integration farm forces you to deal with many dynamic inputs coupled with capacity constraints. The number of pushes increase.  People add more tests.  We build and test on a new platform.  If the number of machines available remains static, the computing time associated with a single push will increase.  You can scale this for platforms that you build and test in the cloud (for us - Linux and Android on emulators), but this costs more money.  Adding hardware for other platforms such as Mac and Windows in data centres is also costly and time consuming.

Do we really need to run every test on every commit? If not, which tests should be run?  How often do they need to be run in order to catch regressions in a timely manner (i.e. able to bisect where the regression occurred)

Several months ago, jmaher and vaibhav1994, wrote code to analyze the test data and determine the minimum number of tests required to run to identify regressions.  They named their software SETA (search for extraneous test automation). They used historical data to determine the minimum set of tests that needed to be run to catch historical regressions.  Previously, we coalesced tests on a number of platforms to mitigate too many jobs being queued for too few machines.  However, this was not the best way to proceed because it reduced the number of times we ran all tests, not just less useful ones.  SETA allows us to run a subset of tests on every commit that historically have caught regressions.  We still run all the test suites, but at a specified interval. 

SETI – The Search for Extraterrestrial Intelligence by ©encouragement, Creative Commons by-nc-sa 2.0
In the last few weeks, I've implemented SETA scheduling in our our buildbot configs to use the data that the analysis that Vaibhav and Joel  implemented.  Currently, it's implemented on mozilla-inbound and fx-team branches which in aggregate represent around 19.6% (March 2015 data) of total pushes to the trees.  The platforms configured to run fewer tests for both opt and debug are

As we gather more SETA data for newer platforms, such as Android 4.3, we can implement SETA scheduling for it as well and reduce our test load.  We continue to run the full suite of tests on all platforms other branches other than m-i and fx-team, such as mozilla-central, try, and the beta and release branches. If we did miss a regression by reducing the tests, it would appear on other branches mozilla-central. We will continue to update our configs to incorporate SETA data as it changes.

How does SETA scheduling work?
We specify the tests that we would like to run on a reduced schedule in our buildbot configs.  For instance, this specifies that we would like to run these debug tests on every 10th commit or if we reach a timeout of 5400 seconds between tests.

Previously, catlee had implemented a scheduling in buildbot that allowed us to coallesce jobs on a certain branch and platform using EveryNthScheduler.  However, as it was originally implemented, it didn't allow us to specify tests to skip, such as mochitest-3 debug on MacOSX 10.10 on mozilla-inbound.  It would only allow us to skip all the debug or opt tests for a certain platform and branch.

I modified to parse the configs and create a dictionary for each test specifying the interval at which the test should be skipped and the timeout interval.  If the tests has these parameters specified, it should be scheduled using the  EveryNthScheduler instead of the default scheduler.
There are still some quirks to work out but I think it is working out well so far. I'll have some graphs in a future post on how this reduced our test load. 

Further reading
Joel Maher: SETA – Search for Extraneous Test Automation

April 28, 2015 06:47 PM

April 27, 2015

Armen Zambrano G. (@armenzg)

mozci hackday - Friday May 1st, 2015

I recently blogged about mozci and I was gladly surprised that people have curiosity about it.

I want to spend Friday fixing some issues on the tool and I wonder if you would like to join me to learn more about it and help me fix some of them.

I will be available as armenzg_mozci from 9 to 5pm EDT on IRC (#ateam channel).
I'm happy to jump on Vidyo to give you a hand understanding mozci.

I hand picked some issues that I could get a hand with.
Documentation and definition of the project in readthedocs.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 27, 2015 05:30 PM

April 24, 2015

Armen Zambrano G. (@armenzg)

What Mozilla CI tools is and what it can do for you (aka mozci)

Mozci (Mozilla CI tools) is a python library, scripts and package which allows you to trigger jobs on
Not all jobs can be triggered but those that are run on Release Engineering's Buildbot setup. Most (if not all) Firefox desktop and Firefox for Android jobs can be triggered. I believe some B2G jobs can still be triggered.

NOTE: Most B2G jobs are not supported yet since they run on TaskCluster. Support for it will be given on this quarter.

Using it

Once you check out the code:
git clone
python develop
you can run scripts like this one (click here for other scripts):
python scripts/ \
  --buildername "Rev5 MacOSX Yosemite 10.10 fx-team talos dromaeojs" \
  --rev e16054134e12 --times 10
which would trigger a specific job 10 times.

NOTE: This is independent if a build job exist to trigger the test job. mozci will trigger everything which is required to get you what you need.

One of the many other options is if you want to trigger the same job for the last X revisions, this would require you to use --back-revisions X.

There are many use cases and options listed in here.

A use case for developers

One use case which could be useful to developers (thanks @mike_conley!) is if you pushed to try and used this try syntax: "try: -b o -p win32 -u mochitests -t none". Unfortunately, you later determine that you really need this one: "try: -b o -p linux64,macosx64,win32 -u reftest,mochitests -t none".

In normal circumstances you would go and push again to the try server, however, with mozci (once someone implements this), we could simply pass the new syntax to a script (or with ./mach) and trigger everything that you need rather than having to push again and waster resources and your time!

If you have other use cases, please file an issue in here.

If you want to read about the definition of the project, vision, use cases or FAQ please visit the documentation.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 08:23 PM

Firefox UI update testing

We currently trigger manually UI update tests for Firefox releases. There are automated headless update verification tests but they don't test the UI of Firefox.

The goal is to integrate this UI update testing as part of the Firefox releases.
This will require changes to firefox-ui-tests, buildbot scheduling changes, Marionette changes and other Mozbase packages. The ultimate goal is to speed up our turn around on releases.

The update testing code was recently ported from Mozmill to use Marionette to drive the testing.

I've already written some documentation on how to run the update verification using Release Engineering configuration files. You can use my tools repository until the code lands (update_testing is the branch to be used).

My deliverable is to ensure that the update testing works reliably on Release Engineering infrastructure and there is existing scheduling code for it.

You can read more about this project in bug 1148546.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 02:42 PM

April 21, 2015

Nick Thomas (nthomas)

Changes coming to has been around for a long time in the world of Mozilla, dating back to original source release in 1998. Originally it was a single server, but it’s grown into a cluster storing more than 60TB of data, and serving more than a gigabit/s in traffic. Many projects store their files there, and there must be a wide range of ways that people use the cluster.

This quarter there is a project in the Cloud Services team to move (and related systems) to the cloud, which Release Engineering is helping with. It would be very helpful to know what functionality people are relying on, so please complete this survey to let us know. Thanks!

April 21, 2015 02:47 AM

April 20, 2015

Chris AtLee (catlee)

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates



All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server,, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!


When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... is going away...

brace yourselves...ftp is going away its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.


In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.


Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

April 20, 2015 11:00 AM

April 15, 2015

Kim Moir (kmoir)

Mozilla pushes - March 2015

Here's March 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

The number of pushes increased from those recorded in the previous month with a total of 10943. 


General Remarks


April 15, 2015 02:18 PM

April 07, 2015

Justin Wood (Callek)

Find our footing on python best practices, of yesteryear.

In the beginning there was fire buildbot. This was Wed, 13 Feb 2008 for the first commit in the repository buildbot-configs.

For context, at this time:

In picking buildbot as our tool we were improving vastly on the decade old technology we had at the time (tinderbox) which was also written in oft-confusing and not-as-shiny perl (we love to hate it now, but it was a good language) [see relevant: image of then-new cutting edge technology but strung together in clunky ways]

As such, we at Mozilla Release Engineering, while just starting to realize the benefits of CI for tests in our main products (like Firefox), were not accustomed to it.

We were writing our buildbot-related code in 3 main repositories at the time (buildbot-configs, buildbotcustom, and tools) all of which we still use today.

Fast forward 5 years and you would have seen a some common antipatterns in large codebases… (over 203k lines of code! )  It was hard to even read most code, let alone hack on it. Each patch was requiring lots of headspace. And we would consistently break things with patches that were not well tested. (even when we tried)

It was at a workweek here in 2013 that catlee got our group agreement on trying to improve that situation by continually running autopep8 over the codebase until there was no (or few) changes with each pass.

Thus began our first, attempt, at bringing our processes to what we call our modern practices.

This reduced, in buildbotcustom and tools alone our pep8 error rate from ~7,139 to ~1,999. (In contrast our current rate for those two repos is ~1485).

(NOTE: This is a good contributor piece, to drive pep8 errors/warnings down to 0 for any of our repos, such as these. We can then make our current tests fail if pep8 fails. Though newer repos started with pep8 compliance, older ones did not. See List of Repositories to pick some if you want to try. — Its not glorious work, but makes everyone more productive once its done.)

The one agreement we decided where pep8 wasn’t for us was line length, we have had many cases where a single line (or even url) barely fits in 80 characters for legit reasons, and felt that arbitrarily limiting variable names or depth just to satisfy that restriction was going to reduce readability. Therefore we generally use –max-line-length of ~159 when validating against pep8.  (The above numbers do not account for –max-line-length)

Around this time we had also setup an internal only jenkins instance as a test for validating at least pep8 and its trends, we have since found jenkins to not be suitable for what we wanted.

Stay tuned to this blog for more history and how we arrived at some best practices that most don’t take for granted these days.

April 07, 2015 01:52 AM

March 31, 2015

Rail Alliev (rail)

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.


There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

March 31, 2015 12:47 PM

March 28, 2015

Jordan Lund (jlund)

Mozharness is moving into the forest

Since its beginnings, Mozharness has been living in its own world (repo). That's about to change. Next quarter we are going to be moving it in-tree.

what's Mozharness?

it's a configuration driven script harness

why in tree?

  1. First and foremost: transparency.
    • There is an overarching goal to provide developers the keys to manage and stand up their own builds & tests (AKA self-serve). Having the automation step logic side by side to the compile and test step logic provides developers transparency and a sense of determinism. Which leads to reason number 2.
  2. deterministic builds & tests
    • This is somewhat already in place thanks to Armen's work on pinning specific Mozharness revisions to in-tree revisions. However the pins can end up behind the latest Mozharness revisions so we end up often landing multiple changes to Mozharness at once to one in-tree revsion.
  3. Mozharness automated build & test jobs are not just managed by Buildbot anymore. Taskcluster is starting to take the weight off Buildbot's hands and, because of its own behaviour, Mozharness is better suited in-`tree.
  4. ateam is going to put effort this quarter into unifying how we run tests locally vs automation. Having mozharness in-tree should make this easier

this sounds great. why wouldn't we want to do this?

There are downsides. It arguably puts extra strain on Release Engineering for managing infra health. Though issues will be more isolated, it does become trickier to have a higher view of when and where Mozharness changes land.

In addition, there is going to be more friction for deployments. This is because a number of our Mozharness scripts are not directly related to continuous integration jobs: e.g. releases, vcs-sync, b2g bumper, and merge tasks.

why wasn't this done yester-year?

Mozharness now handles > 90% of our build and test jobs. Its internal components: config, script, and log logic, are starting to mature. However, this wasn't always the case.

When it was being developed and its uses were unknown, it made sense to develop on the side and tie itself close to buildbot deployments.

okay. I'm sold. can we just simply hg add mozharness?

Integrating Mozharness in-tree comes with a fe6 challenges

  1. chicken and egg issue

    • currently, for build jobs, Mozharness is in charge of managing version control of the tree itself. How can Mozharness checkout a repo if it itself lives within that repo?
  2. test jobs don't require the src tree

    • test jobs only need a binary and a It doesn't make sense to keep a copy of our branches on each machine that runs tests. In line with that, putting mozharness inside also leads us back to a similar 'chicken and egg' issue.
  3. which branch and revisions do our release engineering scripts use?

  4. how do we handle releases?

  5. how do we not cause extra load on hg.m.o?

  6. what about integrating into Buildbot without interruption?

it's easy!

This shouldn't be too hard to solve. Here is a basic outline my plan of action and road map for this goal:

This is a loose outline of the integration strategy. What I like about this

  1. no code change required within Mozharness' code
  2. there is very little code change within Buildbot
  3. allows Taskcluster to use Mozharness in whatever way it likes
  4. no chicken and egg problem as (in Buildbot world), Mozharness will exist before the tree exists on the slave
  5. no need to manage multiple repos and keep them in sync

I'm sure I am not taking into account many edge cases and I look forward to hitting those edges head on as I start this in Q2. Stay tuned for further developments.

One day, I'd like to see Mozharness (at least its internal parts) be made into isolated python packages installable by pip. However, that's another problem for another day.

Questions? Concerns? Ideas? Please comment here or in the tracking bug

March 28, 2015 11:10 PM