Planet Release Engineering

March 28, 2015

Jordan Lund (jlund)

Mozharness is moving into the forest

Since its beginnings, Mozharness has been living in its own world (repo). That's about to change. Next quarter we are going to be moving it in-tree.

what's Mozharness?

it's a configuration driven script harness

why in tree?
  1. First and foremost: transparency.
    • There is an overarching goal to provide developers the keys to manage and stand up their own builds & tests (AKA self-serve). Having the automation step logic side by side to the compile and test step logic provides developers transparency and a sense of determinism. Which leads to reason number 2.
  2. deterministic builds & tests
    • This is somewhat already in place thanks to Armen's work on pinning specific Mozharness revisions to in-tree revisions. However the pins can end up behind the latest Mozharness revisions so we end up often landing multiple changes to Mozharness at once to one in-tree revsion.
  3. Mozharness automated build & test jobs are not just managed by Buildbot anymore. Taskcluster is starting to take the weight off Buildbot's hands and, because of its own behaviour, Mozharness is better suited in-`tree.
  4. ateam is going to put effort this quarter into unifying how we run tests locally vs automation. Having mozharness in-tree should make this easier
this sounds great. why wouldn't we want to do this?

There are downsides. It arguably puts extra strain on Release Engineering for managing infra health. Though issues will be more isolated, it does become trickier to have a higher view of when and where Mozharness changes land.

In addition, there is going to be more friction for deployments. This is because a number of our Mozharness scripts are not directly related to continuous integration jobs: e.g. releases, vcs-sync, b2g bumper, and merge tasks.

why wasn't this done yester-year?

Mozharness now handles > 90% of our build and test jobs. Its internal components: config, script, and log logic, are starting to mature. However, this wasn't always the case.

When it was being developed and its uses were unknown, it made sense to develop on the side and tie itself close to buildbot deployments.

okay. I'm sold. can we just simply hg add mozharness?

Integrating Mozharness in-tree comes with a fe6 challenges

  1. chicken and egg issue

    • currently, for build jobs, Mozharness is in charge of managing version control of the tree itself. How can Mozharness checkout a repo if it itself lives within that repo?
  2. test jobs don't require the src tree

    • test jobs only need a binary and a tests.zip. It doesn't make sense to keep a copy of our branches on each machine that runs tests. In line with that, putting mozharness inside tests.zip also leads us back to a similar 'chicken and egg' issue.
  3. which branch and revisions do our release engineering scripts use?

  4. how do we handle releases?

  5. how do we not cause extra load on hg.m.o?

  6. what about integrating into Buildbot without interruption?

it's easy!

This shouldn't be too hard to solve. Here is a basic outline my plan of action and roadmap for this goal:

This is a loose outline of the integration strategy. What I like about this

  1. no code change required within Mozharness' code
  2. there is very little code change within Buildbot
  3. allows Taskcluster to use Mozharness in whatever way it likes
  4. no chicken and egg problem as (in Buildbot world), Mozharness will exist before the tree exists on the slave
  5. no need to manage multiple repos and keep them in sync

I'm sure I am not taking into account many edge cases and I look forward to hitting those edges head on as I start this in Q2. Stay tuned for further developments.

One day, I'd like to see Mozharness (at least its internal parts) be made into isolated python packages installable by pip. However, that's another problem for another day.

Questions? Concerns? Ideas? Please comment here or in the tracking bug

March 28, 2015 11:10 PM

March 26, 2015

Morgan Phillips (mrrrgn)

Whoop, Whoop: Pull Up!

Since December 1st 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).[1] For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.[2]

In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.[3]

Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners.

For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....

The St. Patrick's Day Massacre

Case in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process.

The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called runner, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....

On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):


The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories.

There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.

From the repository's point of view, the explosion looked like this:


Then, from runner's point of view, the retrying task:


In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.

Avoiding Future Massacres

After these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:

1.) Bug 1146974 - Add automated alerting for abnormally high retries (in runner).

In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed.

The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.

We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by papertrail.

Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:



2.) Add automated testing, and tiered roll-outs to golden ami generation

Finally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair.

By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.

My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.

The new process will work like this:Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever.

The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....

March 26, 2015 11:55 PM

Armen Zambrano G. (@armenzg)

mozci 0.4.0 released - Many bug fixes and improved performance

For the release notes with all there hyper-links go here.

NOTE: I did a 0.3.1 release but the right number should have been 0.4.0

This release does not add any major features, however, it fixes many issues and has much better performance.

Many thanks to @adusca, @jmaher and @vaibhavmagarwal for their contributions.

Features:


Fixes:



For all changes visit: 0.3.0...0.4.0


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 26, 2015 08:36 PM

March 20, 2015

Kim Moir (kmoir)

Scaling Yosemite

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1.  (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?
  1. We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid.  If we buy new hardware, we need to replace the entire pool at once.  Machines with different hardware specifications = useless performance test comparisons.
  2. We tried to purchase some used machines with the same hardware specs as our existing machines.  However, we couldn't find a source for them.  As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now?  We wait until the population of users running a new platform2 surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop.  It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working.  In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms.  This was a very iterative process.  Run tests on try.  Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy
We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane.  We encountered two major problems that impeded that goal:

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see.  So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up.  Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

References
1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform.  To add to this, the list of platform we support varies across branches.  For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10 
Bug 1121199: Green up 10.10 tests currently failing on try 
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite


Why don't you just run these tests in the cloud?
  1. The Apple EULA severely restricts virtualization on Mac hardware. 
  2. I don't know of any major cloud vendors that offer the Mac as a platform.  Those that claim they do are actually renting racks of Macs on a dedicated per host basis.  This does not have the inherent scaling and associated cost saving of cloud computing.  In addition, the APIs to manage the machines at scale aren't there.
  3. We manage ~350 Mac minis.  We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.

March 20, 2015 06:50 PM

March 17, 2015

Kim Moir (kmoir)

Mozilla pushes - February 2015

Here's February's 2015 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
Although February is a shorter month, the number of pushes were close to those recorded in the previous month.  We had a higher average number of daily pushes (358) than in January (348).

Highlights
10015 pushes
358 pushes/day (average)
Highest number of pushes/day: 574 pushes on Feb 25, 2015
23.18 pushes/hour (highest)

General Remarks
Try had around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes

Records
August 2014 was the month with most pushes (13090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 





March 17, 2015 03:54 PM

March 11, 2015

Chris Cooper (coop)

Better releng patch contribution workflow

Releng systems are complex, with many moving parts. Making changes to these systems used to be fraught with peril. Even for people who had been working with buildbot for many years, it was hard to write a patch and be certain that wouldn’t break some downstream system.

Last year, releng introduced a suite of unittests for the repository that was the worst offender, buildbot-configs. These tests run under tox. We then took the next logical step, hooking those unittests up to travis. Now every time code lands in buildbot-configs — and more recently in buildbotcustom, mozharness, tools, and more — travis tests get run automatically against the github mirror once the commit is synced from hg.

However, this workflow is still not ideal. These tests only run on code that has already made it into the system. Patch authors needed to setup their own testing environments for working with hg. It would be better if contributors could get feedback *before* submitting a patch for review and without needing to setup the test environment themselves beforehand. If they could run the exact same tests, they could then include links to the test results for their patch along with their review request, making the reviewers job much simpler. Some might simply call this a “modern workflow.”

This modern workflow doesn’t exist with hg (we never wrote it), but we get it almost by accident if we switch to using github for development.

Today, if a user forks the buildbot-configs repo (or any of the other repos set up with travis) and has enabled travis testing on the fork, they get the same tests run against their fork automatically when they commit. This will be a boon for releng in terms of increasing the velocity of testing and reviewing patches. It also makes it dead-simple for people *outside* of releng (sheriffs, a-team, …) to contribute solid patches.

We are not setup for pull requests yet, but releng *is* planning to switch our repositories-of-record (RoRs) from hg to github. On the list of things we still need to figure out is how release tagging would work when github is our new RoR. We are also in the midst of adding even more useful informational output to travis, namely diffs of builder lists and master configurations caused by a given commit. This is one of those key elements we look at when evaluating the fitness of a particular patch.

If you’ve been holding off on getting your hands dirty with releng automation, there’s no better time. Get forking!

March 11, 2015 03:42 PM

March 10, 2015

Hal Wine (hwine)

Docker at Vungle

Docker at Vungle

Tonight I attended the San Francisco Dev Ops meetup at Vungle. The topic was one we often discuss at Mozilla - how to simplify a developer’s life. In this case, the solution they have migrated to is one based on Docker, although I guess the title already gave that away.

Long (but interesting - I’ll update with a link to the video when it becomes available) story short, they are having much more success using DevOps managed Docker containers for development than their previous setup of Virtualbox images built & maintained with Vagrant and Chef.

Vungle’s new hire setup:
  • install Boot2Docker (they are an all Mac dev shop)
  • clone the repository. [1]
  • run docker.sh script which pulls all the base images from DockerHub. This one time image pull gives the new hire time to fill out HR paperwork ;)
  • launch the app in the container and start coding.

Sigh. That’s nice. When you come back from PTO, just re-run the script to get the latest updates - it won’t take nearly as long as only the container deltas need to come down. Presto - back to work!

A couple of other highlights – I hope to do a more detailed post later.

  • They follow the ‘each container has a single purpose’ approach.
  • They use “helper containers” to hold recent (production) data.
  • Devs have a choice in front end development: inside the container (limited tooling) or in the local filesystem (dev’s choice of IDE, etc.). [2]
  • Currently, Docker containers are only being used in development. They are looking down the road to deploying containers in production, but it’s not a major focus at this time.

Footnotes

[1]Thanks to BFG for clarifying that docker-foo is kept in a separate repository from source code. The docker.sh script is in the main source code repository. [Updated 2015-03-11]
[2]More on this later. There are some definite tradeoffs.

March 10, 2015 07:00 AM

March 06, 2015

Armen Zambrano G. (@armenzg)

How to generate data potentially useful to a dynamically generated trychooser UI

If you're interested on generating an up-to-date trychooser, I would love to hear from you.
adusca has helped me generate data similar to what a dynamic trychooser UI could use.
If you would like to help, please visit bug 983802 and let us know.

In order to generate the data all you have to do is:
git clone https://github.com/armenzg/mozilla_ci_tools.git
cd mozilla_ci_tools
python setup.py develop
python scripts/misc/write_tests_per_platform_graph.py

That's it! You will then have a graphs.json dictionary with some of the pieces needed. Once we have an idea on how to generate the UI and what we're missing we can modify this script.

Here's some of the output:
{
    "android": [
        "cppunit",
        "crashtest",
        "crashtest-1",
        "crashtest-2",
        "jsreftest-1",
        "jsreftest-2",
...

Here are the remaining keys:
[u'android', u'android-api-11', u'android-api-9', u'android-armv6', u'android-x86', u'emulator', u'emulator-jb', u'emulator-kk', u'linux', u'linux-pgo', u'linux32_gecko', u'linux64', u'linux64-asan', u'linux64-cc', u'linux64-mulet', u'linux64-pgo', u'linux64_gecko', u'macosx64', u'win32', u'win32-pgo', u'win64', u'win64-pgo']


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 06, 2015 04:41 PM

March 05, 2015

Armen Zambrano G. (@armenzg)

mozci 0.3.0 - Support for backfilling jobs on treeherder added

Sometime on treeherder, jobs get coalesced (a.k.a. we run the tests on the most recent revision) in order to handle load. This is good so we can catch up when many pushes are committed on a tree.

However, when a job run on the most recent code comes back failing we need to find out which revision introduced the the regression. This is when we need to backfill up to the last good run.

In this release of mozci we have added the ability to --backfill:
python scripts/trigger_range.py --buildername "b2g_ubuntu64_vm cedar debug test gaia-js-integration-5" --dry-run --revision 2dea8b3c6c91 --backfill
This should be useful specially for sheriffs.

You can start using mozci as long as you have LDAP credentials. Follow these steps to get started:
git clone https://github.com/armenzg/mozilla_ci_tools.git
python setup.py develop (or install)


Release notes

Thanks again to vaibhav1994 and adusca for their many contributions in this release.

Major changes
  • Issue #75 - Added the ability to backfill changes until last good is found
  • No need to use --repo-name anymore
  • Issue #83 - Look for request_ids from a better place
  • Add interface to get status information instead of scheduling info
Minor fixes:
  • Fixes to make livehtml documentation
  • Make determine_upstream_builder() case insensitive
      Release notes: https://github.com/armenzg/mozilla_ci_tools/releases/tag/0.3.0
      PyPi package: https://pypi.python.org/pypi/mozci/0.3.0
      Changes: https://github.com/armenzg/mozilla_ci_tools/compare/0.2.5...0.3.0




      Creative Commons License
      This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

      March 05, 2015 04:19 PM

      March 03, 2015

      Armen Zambrano G. (@armenzg)

      mozci 0.2.5 released - major bug fixes + many improvements

      Big thanks again to vaibhav1994adusca and valeriat for their many contributions in this release.

      Release notes

      Major bug fixes:
      • Bug fix: Sort pushid_range numerically rather than alphabetically
      • Calculation of hours_ago would not take days into consideration
      Others:
      • Added coveralls/coverage support
      • Added "make livehtml" for live documentation changes
      • Improved FAQ
      • Updated roadmap
      • Large documentation refactoring
      • Automatically document scripts
      • Added partial testing of mozci.mozci
      • Streamed fetching of allthethings.json and verify integrity
      • Clickable treeherder links
      • Added support for zest.releaser
        Release notes: https://github.com/armenzg/mozilla_ci_tools/releases/tag/0.2.5
        PyPi package: https://pypi.python.org/pypi/mozci/0.2.5
        Changes: https://github.com/armenzg/mozilla_ci_tools/compare/0.2.4...0.2.5


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 10:46 PM

        How to generate allthethings.json

        It's this easy!
            hg clone https://hg.mozilla.org/build/braindump
            cd braindump/community
            ./generate_allthethings_json.sh

        allthethings.json is generated based on data from buildbot-configs.
        It contains data about builders, schedulers, masters and slavepools.

        If you want to extract information from allthethings.json feel free to use mozci to help you!
        https://mozilla-ci-tools.readthedocs.org/en/latest/allthethings.html


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 04:54 PM

        February 27, 2015

        Chris AtLee (catlee)

        Diving into python logging

        Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

        We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

        Here are a few tips and tricks that have really helped me with python logging:

        There can be only more than one

        Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

        You can have multiple loggers active, each specific to its own module or even class or instance.

        Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

        # in module foo.py
        import logging
        log = logging.getLogger(__name__)
        

        This works because inside foo.py, __name__ is equal to "foo". So inside this module the log object is specific to this module.

        Loggers are hierarchical

        The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called foo.bar, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

        # Let's enable all the debug logging for all the foo modules
        import logging
        logging.getLogger('foo').setLevel(logging.DEBUG)
        

        Log messages are like events that flow up through the hierarchy

        Let's say we have a module foo.bar:

        import logging
        log = logging.getLogger(__name__)  # __name__ is "foo.bar" here
        
        def make_widget():
            log.debug("made a widget!")
        

        When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

        The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

        So if the foo.bar logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

        This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

        This also allows us to use mixed levels of log output depending on where the message are coming from:

        import logging
        
        # Enable debug logging for all the foo modules
        logging.getLogger("foo").setLevel(logging.DEBUG)
        
        # Configure the root logger to log only INFO calls, and output to the console
        # (the default)
        logging.basicConfig(level=logging.INFO)
        
        # This will output the debug message
        logging.getLogger("foo.bar").debug("ohai!")
        

        If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

        exc_info is teh awesome

        All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

        import logging
        logging.basicConfig(level=logging.INFO)
        
        log = logging.getLogger(__name__)
        
        try:
            assert False
        except AssertionError:
            log.info("surprise! got an exception!", exc_info=True)
        

        There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

        Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

        "No handlers found..."

        You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off the...top of the chain (maybe I need a better metaphor).

        import logging
        log = logging.getLogger()
        log.error("no log for you!")
        

        outputs:

        No handlers could be found for logger "root"
        

        There are two things that can be done here:

        1. Configure logging in your module with basicConfig() or similar

        2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

        Want more?

        I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

        February 27, 2015 09:09 PM

        February 25, 2015

        Massimo Gervasini (mgerva)

        on mixins

        We use mixins quite a lot in mozharness.

        Mixins are a powerful pattern that allow you to extend your objects, reusing your code (more here). Think about mixin as “plugins”, you can create your custom class and import features just inheriting from a Mixin class for example:

        class B2GBuild(LocalesMixin, PurgeMixin, B2GBuildBaseScript,
                       GaiaLocalesMixin, SigningMixin, MapperMixin, BalrogMixin):
        

        B2GBuild manages FirefoxOS builds and it knows how to:
        * manage locales (LocalesMixin)
        * how to deal with repositories (PurgeMixin)
        * sign the code (SigningMixin)
        * and more…

        this is just from the class definition! At this point a we haven’t added any single method or property, but we already know how to do a lot of tasks and it’s almost for free!

        So should we use mixins everywhere? Short answer: No.
        Long answer Mixins are powerful but also they can lead to some unexpected behavior.

        object C and D have exactly the same parents and the same methods but their behavior is different, it depends on how the parents are declared.

        This is a side effect of the way python implements the inheritance. Having an object inheriting from too many Mixins can lead to unexpected failures (MRO – method resolution objects) when the object is instantiated, or even worse, at runtime when a method is doing something that is not expected.
        When the inheritance becomes obscure, it’s also becomes difficult to write appropriate tests.

        How can we write a mozharness module without using mixins? Let’s try to write a generic module that provides some disk informations for example we could create the mozharness.base.diskutils module that provides useful information about the disk size. Our first approach would be writing something as:

        class DiskInfoMixin():
            def get_size(self, path):
                self.info('calculating disk size')
                <code here>
        
            def other_methods(self):
                <code here>
        

        and then use it in the final class

        from mozharness.base.diskutils import DiskInfoMixin
        ...
        
        class BuildRepackages(ScriptMixin, LogMixin, ..., DiskInfoMixin):
        ...
            disk_info = self.get_size(path)
        

        Easy! But why are we using a mixin here? Because we need to log some operations and to do so, we need to interact with the LogMixin. This mixin provides everything we need to log messages with mozharness, it provides an abstraction layer to make logging consistent among all the mozharness script and it’s very easy to use, just import the LogMixin and start logging!
        The same code without the using the LogMixin, would more or less be:

        import logging
        
        get_size(path):
            logging.info('calculating disk size')
            ...
            return disk_size

        Just a function. Even easier.

        … and the final script becomes:

        from mozharness.base.diskutils import get_size
        class BuildRepackages(ScriptMixin, LogMixin, ...,):
        ...
             disk_info = get_size(path)

        One less mixin!
        There’s a problem though. Messages logged by get_size() will be inconsistent with the rest of the logging. How can we use the mozharness logging style in other modules?
        The LogMixin, it’s a complex class and it has many methods, but at the end of the day it’s a wrapper around the logging module, so behind the scenes, it must call the logger module. What if we can just ask our logger to use the python log facilities, already configured by mozharness?
        getLogger() method is what we need here!

        import logger
        mhlog = logger.getLogger('Multi')
        get_size(path):
            mhlog.info('calculating disk size')
            ...
            return disk_size

        Mozharness by default uses this ‘Multi‘ logger for its messages, so we have just hooked up our logger into mozharness one. Now every logger call will follow the mozharness style!
        We are half way trough the logging issues for our brand new module: what if we want to log to an arbitrary log level, for example, a quite common pattern in mozharness, is let the caller of a function, decide at what level we want to log, so let’s add a log_level parameter…

        import logger
        mhlog = logger.getLogger('Multi')
        get_size(path, log_level=logger.INFO):
            mhlog.log(lvl=log_level, msg='calculating disk size')
            ...
            return disk_size

        This will work fine for a generic module but we want to use this module in mozharness so there’s only one more thing to change: mozharness log levels are strings type, logging module levels are integers, we need a function to convert between the two formats.
        For convenience, in mozharness.base.log we will explicitly expose the mozharness log levels and add function that converts mozharness log levels to standard log levels.

        LOG_LEVELS = {
            DEBUG: logging.DEBUG,
            INFO: logging.INFO,
            WARNING: logging.WARNING,
            ERROR: logging.ERROR,
            CRITICAL: logging.CRITICAL,
            FATAL: FATAL_LEVEL
        }
        
        def numeric_log_level(level):
            """Converts a mozharness log level (string) to the corresponding logger
               level (number). This function makes possible to set the log level
               in functions that do not inherit from LogMixin
            """
            return LOG_LEVELS[level]
        

        our final module becomes:

        import logging
        from mozharness.base.log import INFO, numeric_log_level
        # use mozharness log
        mhlog = logging.getLogger('Multi')
        
        def get_size(path, unit, log_level=INFO):
            ...
            lvl = numeric_log_level(log_level)
            mhlog.log(lvl=lvl, msg="calculating disk size")

        This is just an example on how to use the standard python logging modules.
        A real diskutils module is about to land in mozharness (bug 1130336), and shouldn’t be too difficult, following the same pattern to create new modules with no dependencies on LogMixin.

        This is a first step in the direction of removing some mixins from the mozharness code (see bug 1101183).
        Mixin are not the absolute evil but they must be used carefully. From now on, if I have to write or modify anything in a mozarness module I will try to enforce the following rules:


        February 25, 2015 05:00 PM

        Kim Moir (kmoir)

        Release Engineering special issue now available

        The release engineering special issue of IEEE software was published yesterday. (Download pdf here).  This issue focuses on the current state of release engineering, from both an industry and research perspective. Lots of exciting work happening in this field!

        I'm interviewed in the roundtable article on the future of release engineering, along with Chuck Rossi of Facebook and Boris Debic of Google.  Interesting discussions on the current state of release engineering at organizations that scale large number of builds and tests, and release frequently.  As well,  the challenges with mobile releases versus web deployments are discussed. And finally, a discussion of how to find good release engineers, and what the future may hold.

        Thanks to the other guest editors on this issue -  Stephany Bellomo, Tamara Marshall-Klein, Bram Adams, Foutse Khomh and Christian Bird - for all their hard work that make this happen!


        As an aside, when I opened the issue, the image on the front cover made me laugh.  It's reminiscent of the cover on a mid-century science fiction anthology.  I showed Mr. Releng and he said "Robot birds? That is EXACTLY how I pictured working in releng."  Maybe it's meant to represent that we let software fly free.  In any case, I must go back to tending the flock of robotic avian overlords.

        February 25, 2015 03:26 PM

        February 24, 2015

        Armen Zambrano G. (@armenzg)

        Listing builder differences for a buildbot-configs patch improved

        Up until now, we updated the buildbot-configs repository to the "default" branch instead of "production" since we normally write patches against that branch.

        However, there is a problem with this, buildbot-configs is always to be on the same branch as buildbotcustom. Otherwise, we can have changes land in one repository which require changes on the other one.

        The fix was to simply make sure that both repositories are either on default or their associated production branches.

        Besides this fix, I have landed two more changes:

        1. Use the production branches instead of 'default'
          • Use -p
        2. Clobber our whole set up (e.g. ~/.mozilla/releng)
          • Use -c

        Here are the two changes:
        https://hg.mozilla.org/build/braindump/rev/7b93c7b7c46a
        https://hg.mozilla.org/build/braindump/rev/bbb5c54a7d42


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 24, 2015 09:45 PM

        February 23, 2015

        Nick Thomas (nthomas)

        FileMerge bug

        FileMerge is a nice diff and merge tool for OS X, and I use it a lot for larger code reviews where lots of context is helpful. It also supports intra-line diff, which comes in pretty handy.

        filemerge screenshot

        However in recent releases, at least in v2.8 which comes as part of XCode 6.1, it assumes you want to be merging and shows that bottom pane. Adjusting it away doesn’t persist to the next time you use it, *gnash gnash gnash*.

        The solution is to open a terminal and offer this incantation:

        defaults write com.apple.FileMerge MergeHeight 0

        Unfortunately, if you use the merge pane then you’ll have to do that again. Dear Apple, pls fix!

        February 23, 2015 09:23 AM

        February 15, 2015

        Rail Alliev (rail)

        Funsize hacking

        Prometheus

        The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

        Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

        Funsize willl solve the problems listed above: to distribute load and to be flexible.

        Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

        Funsize is split into several pieces:

        • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
        • Celery-based workers to generate partial updates and upload them to S3.
        • SQS or RabbitMQ to coordinate Celery workers.

        One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

        The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

        Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

        February 15, 2015 04:32 AM

        February 13, 2015

        Armen Zambrano G. (@armenzg)

        Mozilla CI tools 0.2.1 released - Trigger multiple jobs for a range of revisions

        Today I have released a major release of mozci which includes the following:


        PyPi:       https://pypi.python.org/pypi/mozci
        Source:   https://github.com/armenzg/mozilla_ci_tools


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 13, 2015 04:14 PM

        Kim Moir (kmoir)

        Mozilla pushes - January 2015

        Here's January 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        Trends
        We're back to regular volume after the holidays. Also, it's really cold outside in some parts of the of the Mozilla world.  Maybe committing code > going outside.


        Highlights
        10798 pushes
        348 pushes/day (average)
        Highest number of pushes/day: 562 pushes on Jan 28, 2015
        18.65 pushes/hour (highest)

        General Remarks
        Try had around around 42% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 24% of all of the pushes

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 




        February 13, 2015 04:13 PM

        February 09, 2015

        Morgan Phillips (mrrrgn)

        Gödel, Docker, Bach: Containers Building Containers

        As Docker continues to mature, many organizations are striving to run as much of their infrastructure as possible within containers. Of course, this investment results in a lot of docker-centric tooling for deployment, development, etc...

        Given that, I think it makes a lot of sense for docker containers themselves to be built within other docker containers. Otherwise, you'll introduce a needless exception into your automation practices. Boo to that!

        There are a few ways to run docker from within a container, but here's a neat way that leaves you with access to your host's local images: just mount the docker from your host system.

        ** note: in cases where arbitrary users can push code to your containers, this would be a dangerous thing to do **
        Et voila!



        February 09, 2015 10:11 PM

        Introducing RelEng Containers: Build Firefox Consistently (For A Better Tomorrow)

        From time to time, Firefox developers encounter errors which only appear on our build machines. Meaning -- after they've likely already failed numerous times to coax the failure form their own environment -- they must resort to requesting RelEng to pluck a system from our infrastructure so they can use it for debugging: we call this a slave loan, and they happen frequently.

        Case in point: bug #689291

        Firefox is a huge open source project: slave loans can never scale enough to serve our community. So, this weekend I took a whack at solving this problem with Docker. So far, five [of an eventual fourteen] containers have been published, which replicate the following aspects of our in house build environments:As usual, you can find my scratch work on GitHub: mozilla/build-environments




        What Are These Environments Based On?

        For a long time, builds have taken place inside of chroots built with Mock. We have three bare bones mock configs which I used to bake some base platform images: On top of our Mock configs, we further specialize build chroots via build scripts powered by Mozharness. The specifications of each environment are laid out in these mozharness configs. To make use of these, I wrote a simple script which converts a mozharness config into a Dockerfile.

        The environments I've published so far:The next step, before I publish more containers, will be to write some documentation for developers so they can begin using them for builds with minimal hassle. Stay tuned!

        February 09, 2015 06:09 AM

        February 06, 2015

        Hal Wine (hwine)

        Kaizen the low tech way

        Kaizen the low tech way

        On Jan 29, I treated myself to a seminar on Successful Lean Teams, with an emphasis on Kanban & Kaizen techniques. I’d read about both, but found the presentation useful. Many of the other attendees were from the Health Care industry and their perspectives were very enlightening!

        Hearing how successful they were in such a high risk, multi-disciplinary, bureaucratic, and highly regulated environment is inspiring. I’m inclined to believe that it would also be achievable in a simple-by-comparison low risk environment of software development. ;)

        What these hospitals are using is a light weight, self managed process which:

        • ensures visibility of changes to all impacted folks
        • outlines the expected benefits
        • includes a “trial” to ensure the change has the desired impact
        • has a built in feedback system

        That sounds achievable. In several of the settings, the traditional paper and bulletin board approach was used, with 4 columns labeled “New Ideas”, “To Do”, “Doing”, and “Done”. (Not a true Kanban board for several reasons, but Trello would be a reasonable visual approximation; CAB uses spreadsheets.)

        Cards move left to right, and could cycle back to “New Ideas” if iteration is needed. “New Ideas” is where things start, and they transition from there (I paraphrase a lot in the following):

        1. Everyone can mark up cards in New Ideas & add alternatives, etc.
        2. A standup is held to select cards to move from “New Ideas” to “To Do”
        3. The card stays in “To Do” for a while to allow concerns to be expressed by other stake holders. Also a team needs to sign up to move the change through the remaining steps. Before the card can move to “Doing”, a “test” (pilot or checkpoints) is agreed on to ensure the change can be evaluated for success.
        4. The team moves the card into “Doing”, and performs PSDA cycles (Plan, Do, Study, Adjust) as needed.
        5. Assuming the change yields the projected results, the change is implemented and the card is moved to “Done”. If the results aren’t as anticipated, the card gets annotated with the lessons learned, and either goes to “Done” (abandon) or back to “New Ideas” (try again) as appropriate.

        For me, I’m drawn to the 2nd and 3rd steps. That seems to be the change from current practice in teams I work on. We already have a gazillion bugs filed (1st step). We also can test changes in staging (4th step) and update production (5th step). Well, okay, sometimes we skip the staging run. Occasionally that *really* bites us. (Foot guns, foot guns – get your foot guns here!)

        The 2nd and 3rd steps help focus on changes. And make the set of changes happening “nowish” more visible. Other stakeholders then have a small set of items to comment upon. Net result - more changes “stick” with less overall friction.

        Painting with a broad brush, this Kaizen approach is essentially what the CAB process is that Mozilla IT implemented successfully. I have experienced the CAB reduce the amount of stress, surprises, and self inflicted damage amongst both inside and outside of IT. Over time, the velocity of changes has increased and backlogs have been reduced. In short, it is a “Good Thing(tm)”.

        So, I’m going to see if there is a way to “right size” this process for the smaller teams I’m on now. Stay tuned....

        February 06, 2015 08:00 AM

        February 04, 2015

        Rail Alliev (rail)

        Deploying your code from github to AWS Elastic Beanstalk using Travis

        I have been playing with Funsize a lot recently. One of the goals was iterating faster:

        I have hit some challenges with both Travis and Elastic Beanstalk.

        The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

        A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

        The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

        Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

        # .travis.yml snippet
        deploy:
            - provider: elasticbeanstalk
              app: funsize # Elastic Beanstalk app name
              env: funsize-dev-rail # Elastic Beanstalk env name
              bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
              region: us-east-1
              access_key_id:
                secure: "encrypted key id"
              secret_access_key:
                secure: "encrypted key"
              on:
                  repo: rail/build-funsize # Deploy only using my user repo for now
                  all_branches: true
                  # deploy only if particular jobs in the job matrix passes, not any
                  condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis
        

        Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

        Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

        February 04, 2015 02:09 AM

        February 03, 2015

        Armen Zambrano G. (@armenzg)

        What the current list of buildbot builders is

        This becomes very easy with mozilla_ci_tools (aka mozci):
        >>> from mozci import mozci
        >>> builders = mozci.list_builders()
        >>> len(builders)
        15736
        >>> builders[0]
        u'Linux x86-64 mozilla-inbound leak test build'
        This and many other ways to interact with our CI will be showing up in the repository.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 03, 2015 07:48 PM

        Morgan Phillips (mrrrgn)

        shutdown -r never part deux

        In my last post, I wrote about how runner and cleanslate were being leveraged by Mozilla RelEng to try at eliminating the need for rebooting after each test/build job -- thus reclaiming a good deal of wasted time. Since then, I've had the opportunity to outfit all of our hosts with better logging, and collect live data which highlights the progress that's been made. It's been bumpy, but the data suggests that we have reduced reboots (across all tiers) by around 40% -- freeing up over 72,000 minutes of compute time per day, with an estimated savings of $51,000 per year.

        Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.

        Collecting Data

        With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed ps. To take advantage of this, I wrote a "task hook" feature which passes task state to an external script. From there, I wrote a hook script which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.

        Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:

        * a started buildbot task generally indicates that a job is active on a machine *


        * a global ps! *


        * spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *


        * we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *


        Costs/Time Saved Calculations

        To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving.

        The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.



        I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:

        (non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr

        (spot) cost: $14277.72 time: 875936hr avg: $0.02/hr


        Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.

        To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.

        You can see all of my raw data, queries, and helper scripts at this github repo: https://github.com/mrrrgn/build-no-reboots-data

        Why Are We Only Saving 40%?

        The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, anything which interacts with a display server.

        To take advantage of the jobs which area already working, I added a task "post_flight.py," which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.

        Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.

        Why Not Use Containers?

        First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: task cluster. Another big part of my work will be porting some of our current builds to that system.

        What About Windows

        If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: bug 1055794

        Future Plans

        Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc...

        Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.

        Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis.

        February 03, 2015 05:53 PM

        January 27, 2015

        Justin Wood (Callek)

        Release Engineering does a lot…

        Hey Everyone,

        I spent a few minutes a week over the last month or two working on compiling a list of Release Engineering work areas. Included in that list is identifying which repositories we “own” and work in, as well as where these repositories are mirrored. (We have copies in hg.m.o git.m.o and github, some exclusively in their home).

        While we transition to a more uniform and modern design style and philosphy.

        My major takeaway here is we have A LOT of things that we do. (this list is explicitly excluding repositories that are obsolete and unused)

        So without further ado, I present our page ReleaseEngineering/Repositories

        repositoriesYou’ll notice a few things about this, we have a column for Mirrors, and RoR (Repository of Record), “Committable Location” was requested by Hal and is explicitly for cases where “Where we consider our important location the RoR, it may not necessarily be where we allow commits to”

        The other interesting thing is we have automatic population of travis and coveralls urls/status icons. This is for free using some magic wiki templates I did.

        The other piece of note here, is the table is generated by a list of pages, using “SemanticMediaWiki” so the links to the repositories can be populated with things like “where are the docs” “what applications use this repo”, “who are suitable reviewers” etc. (all those are TODO on the releng side so far).

        I’m hoping to be putting together a blog post at some point about how I chose to do much of this with mediawiki, however in the meantime should any team at Mozilla find this enticing and wish to have one for themselves, much of the work I did here can be easily replicated for your team, even if you don’t need/like the multiple repo location magic of our table. I can help get you setup to add your own repos to the mix.

        Remember the only fields that are necessary is a repo name, the repo location, and owner(s). The last field can even be automatically filled in by a form on your page (see the end of Release Engineerings page for an example of that form)

        Reach out to me on IRC or E-mail (information is on my mozillians profile) if you desire this for your team and we can talk. If you don’t have a need for your team, you can stare at all the stuff Releng is doing and remember to thank one of us next time you see us. (or inquire about what we do, point contributors our way, we’re a friendly group, I promise.)

        January 27, 2015 11:11 PM

        January 22, 2015

        Armen Zambrano G. (@armenzg)

        Backed out - Pinning for Mozharness is enabled for the fx-team integration tree

        EDIT=We had to back out this change since it caused issues for PGO talos jobs. We will try again after further testing.

        Pinning for Mozharness [1] has been enabled for the fx-team integration tree.
        Nothing should be changing. This is a no-op change.

        We're still using the default mozharness repository and the "production" branch is what is being checked out. This has been enabled on Try and Ash for almost two months and all issues have been ironed out. You can know if a job is using pinning of Mozharness if you see "repostory_manifest.py" in its log.

        If you notice anything odd please let me know in bug 1110286.

        If by Monday we don't see anything odd happening, I would like to enable it for mozilla-central for few days before enabling it on all trunk trees.

        Again, this is a no-op change, however, I want people to be aware of it.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 22, 2015 08:57 PM

        January 21, 2015

        Kim Moir (kmoir)

        Reminder: Releng 2015 submissions due Friday, January 23

        Just a reminder that submissions for the Releng 2015 conference are due this Friday, January 23. 

        It will be held on May 19, 2015 in Florence Italy.

        If you've done recent work like
        we'd love to hear from you.  Please consider submitting a talk!

        In addition, if you have colleagues that work in this space that might have interesting topics to discuss at this workshop, please forward this information. I'm happy to talk to people about the submission process or possible topics if there are questions.

        Il Duomo di Firenze by ©eddi_07, Creative Commons by-nc-sa 2.0


        Sono nel comitato che organizza la conferenza Releng 2015 che si terrà il 19 Maggio 2015 a Firenze. La scadenza per l’invio dei paper è il 23 Gennaio 2015.

        http://releng.polymtl.ca/RELENG2015/html/index.html

        se avete competenze in:
        e volete discutere della vostra esperienza, inviateci una proposta di talk!

        Per favore inoltrate questa richiesta ai vostri colleghi e alle persone interessate a questi argomenti. Nel caso ci fossero domande sul processo di invio o sui temi di discussione, non esitate a contattarmi.

        (Thanks Massimo for helping with the Italian translation).

        More information
        Releng 2015 web page
        Releng 2015 CFP now open

        January 21, 2015 08:36 PM

        January 16, 2015

        Nick Thomas (nthomas)

        Plans for 2015 – Revamping the Release Automation

        Mozilla’s Release Engineering team has been through several major iterations of our “release automation”, which is how we produce the bits for Firefox betas and releases. With each incarnation, the automation has become more reliable, supported more functionality, and end-to-end time has reduced. If you go back a few years to Firefox 2.0 it took several days to prepare 40 or so locales and three platforms for a release; now it’s less than half a day for 90 locales and four platforms. The last major rewrite was some time ago so it’s time to embark on a big revamp – this time we want to reduce the end-to-end time significantly.

        Currently, when a code change lands in the repository (eg mozilla-beta) a large set of compile and test jobs are started. It takes about 5 hours for the slowest platform to complete an optimized build and run the tests, in part because we’re using Profile-Guided Optimization (PGO) and need to link XUL twice. Assuming the tests have passed, or been recognized as an intermittent failure, a Release Manager will kick off the release automation. It will tag the gecko and localization repositories, and a second round of compilation will start, using the official branding and other release-specific settings. Accounting for all the other release work (localized builds, source tarballs, updates, and so on) the automation takes 10 or more hours to complete.

        The first goal of the revamp is to avoid the second round of compilation, with all the loss of time and test coverage it brings. Instead, we’re looking at ‘promoting’ the builds we’ve already done (in the sense of rank, not marketing). By making some other improvements along the way, eg fast generation of partial updates using funsize, we may be able to save as much as 50% from the current wall time. So we’ll be able to ship fixes to beta users more often than twice a week, get feedback earlier in the cycle, and be more confident about shipping a new release. It’ll help us to ship security fixes faster too.

        We’re calling this ‘Build Promotion’ for short, and you can follow progress in Bug 1118794 and dependencies.

        January 16, 2015 10:08 AM

        January 10, 2015

        Hal Wine (hwine)

        ChatOps Meetup

        ChatOps Meetup

        This last Wednesday, I went to a meetup on ChatOps organized by SF DevOps, hosted by Geekdom (who also made recordings available), and sponsored by TrueAbility.

        I had two primary goals in attending: I wanted to understand what made ChatOps special, and I wanted to see how much was applicable to my current work at Mozilla. The two presentations helped me accomplish the first. I’m still mulling over the second. (Ironically, I had to shift focus during the event to clean up a deployment-gone-wrong that was very close to one of the success stories mentioned by Dan Chuparkoff.)

        My takeaway on why chatops works is that it is less about the tooling (although modern web services make it a lot easier), and more about the process. Like a number of techniques, it appears to be more successful when teams fully embrace their vision of ChatOps, and make implementation a top priority. Success is enhanced when the tooling supports the vision, and that appears to be what all the recent buzz is about – lots of new tools, examples, and lessons learned make it easier to follow the pioneers.

        What are the key differentiators?

        Heck, many teams use irc for operational coordination. There are scripts which automate steps (some workflows can be invoked from the web even). We’ve got automated configuration, logging, dashboards, and wikis – are we doing ChatOps?

        Well, no, we aren’t.

        Here are the differences I noted:
        • ChatOps requires everyone both agreeing and committing to a single interface to all operations. (The opsbot, like hubot, lita or Err.) Technical debt (non-conforming legacy systems) will be reworked to fit into ChatOps.
        • ChatOps requires focus and discipline. There are a small number of channels (chat rooms, MUC) that have very specific uses - and folks follow that. High signal to noise ratio. (No animated gifs in the deploy channel - that’s what the lolcat channel is for.)
        • A commitment to explicitly documenting all business rules as executable code.

        What do you get for giving up all those options and flexibility? Here was the “ah ha!” concepts for me:

        1. Each ChatOps room is a “shared console” everyone can see and operate. No more screen sharing over video, or “refresh now” coordination!

        2. There is a bot which provides the “facts” about the world. One view accessible by all.

        3. The bot is also the primary way folks interact and modify the system. And it is consistent in usage across all commands. (The bot extensions perform the mapping to whatever the backend needs. The code adapts, not the human!)

        4. The bot knows all and does all:
          • Where’s the documentation?
          • How do I do X?
          • Do X!
          • What is the status of system Y?
        5. The bot is “fail safe” - you can’t bypass the rules. (If you code in a bypass, well, you loaded that foot gun!)

        Thus everything is consistent and familiar for users, which helps during those 03:00 forays into a system you aren’t as familiar with. Nirvana ensues (remember, everyone did agree to drink the koolaid above).

        Can you get there from here?

        The speaker selection was great – Dan was able to speak to the benefits of committing to ChatOps early in a startup’s life. James Fryman (from StackStorm) showed a path for migrating existing operations to a ChatOps model. That pretty much brackets the range, so yeah, it’s doable.

        The main hurdle, imo, would be getting the agreement to a total commitment! There are some tensions in deploying such a system at a highly open operation like Mozilla: ideally chat ops is open to everyone, and business rules ensure you can’t do or see anything improper. That means the bot has (somewhere) the credentials to do some very powerful operations. (Dan hopes to get their company to the “no one uses ssh, ever” point.)

        My next steps? Still thinking about it a bit – I may load Err onto my laptop and try doing all my local automation via that.

        January 10, 2015 08:00 AM

        January 09, 2015

        Chris AtLee (catlee)

        Upcoming hotness from RelEng

        To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.

        Balrog

        First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

        Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/

        Funsize

        Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

        We're aiming to have nightly builds use funsize for partial update generation this quarter.

        I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

        The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

        S3 uploads from automation

        This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

        No reboots

        Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

        Release build promotion

        Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

        Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

        This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

        ... and much more!

        This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

        January 09, 2015 06:35 PM

        Ben Hearsum (bhearsum)

        UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

        (This post has been updated with the new go-live date.)

        Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we’re ready to switch the vast majority of our users over. We’ll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

        Stick around if you’re interested in some of the load testing we did.


        Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

        We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn’t a complete surprise given that we hadn’t implemented any form of caching yet. After implementing a simple LRU cache on Balrog’s largest objects, we did another load test. Here’s what the load looked like on one web head:

        Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

        I’m not sure what to call that other than winning.

        January 09, 2015 04:39 PM

        January 08, 2015

        Kim Moir (kmoir)

        Mozilla pushes - December 2014


        Here's December 2014's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        Trends
        There was a low number of pushes this month.  I expect this is due to the Mozilla all-hands in Portland in early December where we were encouraged to meet up with other teams instead of coding :-) and the holidays at the end of the month for many countries.
        As as side node, in 2014 we had a total number of 124423 pushes, compared to 79233 in 2013 which represents a growth rate of 57% this year.

        Highlights
        7836 pushes
        253 pushes/day (average)
        Highest number of pushes/day: 706 pushes on Dec 17, 2014
        15.25 pushes/hour (highest)

        General Remarks
        Try had around around 46% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all of the pushes

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 







        January 08, 2015 05:14 PM

        January 06, 2015

        Armen Zambrano G. (@armenzg)

        Tooltool fetching can now use LDAP credentials from a file

        You can now fetch tooltool files by using an authentication file.
        All you have to do is append "--authentication-file file" to your tooltool fetching command.

        This is important if you want to use automation to fetch files from tooltool on your behalf.
        This was needed to allow Android test jobs to run locally since we need to download tooltool files for it.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 06, 2015 04:45 PM

        January 05, 2015

        Armen Zambrano G. (@armenzg)

        Run Android test jobs locally

        You can now run Android test jobs on your local machine with Mozharness.

        As with any other developer capable Mozharness script, all you have to do is:

        An example for this is:
        python scripts/android_emulator_unittest.py --cfg android/androidarm.py
        --test-suite mochitest-gl-1 --blob-upload-branch try
        --download-symbols ondemand --cfg developer_config.py
        --installer-url http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android-api-9/en-US/fennec-37.0a1.en-US.android-arm.apk
        --test-url http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android-api-9/en-US/fennec-37.0a1.en-US.android-arm.tests.zip


        Here's the bug where the work happened.
        Here's the documentation on how to run Mozharness as a developer.

        Please file a bug under Mozharness if you find any issues.

        Here are some other related blog posts:


        Disclaimers

        Bug 1117954- I think that I need a different SDK or emulator version is needed to run Android API 10 jobs.

        I wish we run all of our jobs in proper isolation!


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 05, 2015 08:47 PM

        December 22, 2014

        Armen Zambrano G. (@armenzg)

        Run mozharness talos as a developer (Community contribution)

        Thanks to our contributor Simarpreet Singh from Waterloo we can now run a talos job through mozharness on your local machine (bug 1078619).

        All you have to add is the following:
        --cfg developer_config.py 
        --installer-url http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-trunk/firefox-37.0a1.en-US.linux-x86_64.tar.bz2

        To read more about running Mozharness locally go here.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 22, 2014 08:10 PM

        December 11, 2014

        Kim Moir (kmoir)

        Releng 2015 CFP now open

        Florence, Italy.  Home of beautiful architecture.

        Il Duomo di Firenze by ©runner310, Creative Commons by-nc-sa 2.0


        Delicious food and drink.

        Panzanella by © Pete Carpenter, Creative Commons by-nc-sa 2.0

        Caffè ristretto by © Marcelo César Augusto Romeo, Creative Commons by-nc-sa 2.0


        And next May, release engineering :-)

        The CFP for Releng 2015 is now open.  The deadline for submissions is January 23, 2015.  It will be held on May 19, 2015 in Florence Italy and co-located with ICSE 2015.   We look forward to seeing your proposals about the exciting work you're doing in release engineering!

        If you have questions about the submission process or anything else, please contact any of the program committee members. My email is kmoir and I work at mozilla.com.

        December 11, 2014 09:00 PM

        December 09, 2014

        Armen Zambrano G. (@armenzg)

        Running Mozharness in developer mode will only prompt once for credentials

        Thanks to Mozilla's contributor kartikgupta0909 we now only have to enter LDAP credentials once when running the developer mode of Mozharness.

        He accomplished it in bug 1076172.

        Thank you Kartik!


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 09, 2014 09:43 PM

        December 08, 2014

        Armen Zambrano G. (@armenzg)

        Test mozharness changes on Try

        You can now push to your own mozharness repository (even a specific branch) and have it be tested on Try.

        Few weeks ago we developed mozharness pinning (aka mozharness.json) and recently we have enabled it for Try. Read the blog post to learn how to make use of it.

        NOTE: This currently only works for desktop, mobile and b2g test jobs. More to come.
        NOTE: We only support named branches, tags or specific revisions. Do not use bookmarks as it doesn't work.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 08, 2014 06:59 PM

        December 04, 2014

        Morgan Phillips (mrrrgn)

        shutdown -r never

        For the past month I've worked on achieving the effects of a reboot without actually doing one. Sort of a "virtual" reboot. This isn't a usual optimization; but in Mozilla's case it's likely to create a huge impact on performance.

        Mozilla build/test infrastructure is complex. The jobs can be expensive and messy. So messy that, for a while now, machines have been rebooted after completing tasks to ensure that environments remain fresh.

        This strategy works marvelously at preventing unnecessary failures; but wastes a lot of resources. In particular, with reboots taking something like two minutes to complete, and at around 100k jobs per day, a whopping 200,000 minutes of machine time. That's nearly five months - yikes!1

        Yesterday I began rolling out these "virtual" reboots for all of our Linux hosts, and it seems to be working well [edit: after a few rollbacks]. By next month I should also have it turned on for OSX and Windows machines.



        What does a "virtual" reboot look like?

        For starters [pun intended], each job requires a good amount of setup and teardown, so, a sort of init system is necessary. To achieve this a utility called runner has been created. Runner is a project that manages starting tasks in a defined order. If tasks fail, the chain can be retried, or halted. Many tasks that once lived in /etc/init.d/ are now managed by runner including buildbot itself.



        Among runner's tasks are various scripts for cleaning up temporary files, starting/restarting services, and also a utility called cleanslate. Cleanslate resets a users running processes to a previously recorded state.

        At boot, cleanslate takes a snapshot of all running processes, then, before each job it kills any processes (by name) which weren't running when the system was fresh. This particular utility is key to maintaining stability and may be extended in the future to enforce other kinds of system state as well.



        The end result is this:

        old work flow

        Boot + init -> Take Job -> Reboot (2-5 min)

        new work flow

        Boot + Runner -> Take Job -> Shutdown Buildslave
        (runner loops and restarts slave)



        [1] What's more, this estimate does not take into account the fact that jobs run faster on a machine that's already "warmed up."

        December 04, 2014 06:54 PM

        December 03, 2014

        Kim Moir (kmoir)

        Mozilla pushes - November 2014

        Here's November's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

        Trends
        Not a record breaking month, in fact we are down over 2000 pushes since the last month.

        Highlights
        10376 pushes
        346 pushes/day (average)
        Highest number of pushes/day: 539 pushes on November 12
        17.7 pushes/hour (average)

        General Remarks
        Try keeps had around 38% of all the pushes, and gaia-try has about 30%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all the pushes.

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes    







        December 03, 2014 09:41 PM

        November 24, 2014

        Armen Zambrano G. (@armenzg)

        Pinning mozharness from in-tree (aka mozharness.json)

        Since mozharness came around 2-3 years ago, we have had the same issue where we test a mozharness change against the trunk trees, land it and get it backed out because we regress one of the older release branches.

        This is due to the nature of the mozharness setup where once a change is landed all jobs start running the same code and it does not matter on which branch that job is running.

        I have recently landed some code that is now active on Ash (and soon on Try) that will read a manifest file that points your jobs to the right mozharness repository and revision. We call this process to "pin mozhaness". In other words, what we do is to fix an external factor to our job execution.

        This will allow you to point your Try pushes to your own mozharness repository.

        In order to pin your jobs to a repository/revision of mozharness you have to change a file called mozharness.json which indicates the following two values:


        This is a similar concept as talos.json introduced which locks every job to a specific revision of talos. The original version of it landed in 2011.

        Even though we have a similar concept since 2011, that doesn't mean that it was as easy to make it happen for mozharness. Let me explain a bit why:

        Coming up:
        • Enable on Try
        • Free up Ash and Cypress
          • They have been used to test custom mozharness patches and the default branch of Mozharness (pre-production)
        Long term:
        • Enable the feature on all remaining Gecko trees
          • We would like to see this run at scale for a bit before rolling it out
          • This will allow mozharness changes to ride the trains
        If you are curious, the patches are in bug 791924.

        Thanks for Rail for all his patch reviews and Jordan for sparking me to tackle it.



        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        November 24, 2014 05:35 PM

        November 12, 2014

        Kim Moir (kmoir)

        Scaling capacity while saving cash

        There was a very interesting release engineering summit this Monday held in concert with LISA in Seattle.  I was supposed fly there this past weekend so I could give a talk on Monday but late last week I became ill and was unable to go.   Which was very disappointing because the summit looked really great and I was looking forward to meeting the other release engineers and learning about the challenges they face.

        Scale in the Market  ©Clint Mickel, Creative Commons by-nc-sa 2.0

        Although I didn't have the opportunity to give the talk in person, the slides for it are available on slideshare and my mozilla people account   The talk describes how we scaled our continuous integration infrastructure on AWS to handle double the amount of pushes it handled in early 2013, all while reducing our AWS monthly bill by 2/3.

        Cost per push from Oct 2012 until Oct 2014. This does not include costs for on premise equipment. It reflects our monthly AWS bill divided by the number of monthly pushes (commits).  The chart reflects costs from October 2012-2014.

        Thank you to Dinah McNutt and the other program committee members for organizing this summit.  I look forward to watching the talks once they are online.

        November 12, 2014 07:34 PM

        Mozilla pushes - October 2014

        Here's the October 2014 monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

        Trends
        We didn't have a record breaking month in terms of the number of pushes, however we did have a daily record on October 18 with 715 pushes. 

        Highlights
        12821 pushes, up slightly from the previous month
        414 pushes/day (average)
        Highest number of pushes/day: 715 pushes on October 8
        22.5 pushes/hour (average)

        General Remarks
        Try keeps had around 39% of all the pushes, and gaia-try has about 31%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes




        November 12, 2014 03:45 PM

        Morgan Phillips (mrrrgn)

        AirMozilla: Distrusting Our Own [Build] Infrastructure

        If you missed last week's AirMozilla broadcast: Why and How of Reproducible Builds: Distrusting Our Own Infrastructure For Safer Software Releases by the Tor Project, consider checking it out.

        The talk is an in depth look at how one can protect release pipelines from being owned by attacks which target build systems. Particularly, attacks where compromised compilers may be used to create unsafe binaries from safe source code.

        Meanwhile RelEng is underway, putting these ideas into practice.

        November 12, 2014 08:54 AM

        A Simple Trusting Trust Attack

        ....

        November 12, 2014 07:54 AM

        November 10, 2014

        Morgan Phillips (mrrrgn)

        A Note on Deterministic Builds

        Since I joined Mozilla's Release Engineering team I've had the opportunity to put my face into a firehose of interesting new knowledge and challenges. Maintaining a release pipeline for binary installers and updates used by a substantial portion of the Earth's population is a whole other kind of beast from ops roles where I've focused on serving some kind of SaaS or internal analytics infrastructure. It's really exciting!

        One of the most interesting problems I've seen getting attention lately are deterministic builds, that is, builds that produce the same sequence of bytes from source on a given platform at any time.

        What good are deterministic builds?

        For starters, they aid in detecting "Trusting Trust" attacks. That's where a compromised compiler produces malicious binaries from perfectly harmless source code via replacing certain patterns during compilation. It sort of defeats the whole security advantage of open source when you download binaries right?

        Luckily for us users, a fellow named David A. Wheeler rigorously proved a method for circumventing this class of attacks altogether via a technique he coined "Diverse Double-Compiling" (DDC). The gist of it is, you compile a project's source code with a trusted tool chain then compare a hash of the result with some potentially malicious binary. If the hashes match you're safe.

        DDC also detects the less clever scenario where an adversary patches, otherwise open, source code during the build process and serves up malwareified packages. In either case, it's easy to see that this works if and only if builds are deterministic.

        Aside from security, they can also help projects that support many platforms take advantage of cross building with less stress. That is, one could compile arm packages on an x86_64 host then compare the results to a native build and make sure everything matches up. This can be a huge win for folks who want to cut back on infrastructure overhead.

        How can I make a project more deterministic?

        One bit of good news is, most compilers are already pretty deterministic (on a given platform). Take hello.c for example:

        int main() {
            printf("Hello World!");
        }


        Compile that a million times and take the md5sum. Chances are you'll end up with a million identical md5sums. Scale that up to a million lines of code, and there's no reason why this won't hold true.

        However, take a look at this doozy:

        int main() {
            printf("Hello from %s! @ %s", __FILE__, __TIME__);
        }


        Having timestamps and other platform specific metadata baked into source code is a huge no-no for creating deterministic builds. Compile that a million times, and you'll likely get a million different md5sums.

        In fact, in an attempt to make Linux more deterministic all __TIME__ macros were removed and the makefile specifies a compiler option (-Werror=date-time) that turns any use of it into an error.

        Unfortunately, removing all traces of such metadata in a mature code base could be all but impossible, however, a fantastic tool called gitian will allow you to compile projects within a virtual environment where timestamps and other metadata are controlled.

        Definitely check gitian out and consider using it as a starting point.

        Another trouble spot to consider is static linking. Here, unless you're careful, determinism sits at the mercy of third parties. Be sure that your build system has access to identical libraries from anywhere it may be used. Containers and pre-baked vms seem like a good choice for fixing this issue, but remember that you could also be passing around a tainted compiler!

        Scripts that automate parts of the build process are also a potent breeding ground for non-deterministic behaviors. Take this python snippet for example:

        with open('manifest', 'w') as manifest:
            for dirpath, dirnames, filenames in os.walk("."):
                for filename in filenames:
                    manifest.write("{}\n".format(filename))


        The problem here is that os.walk will not always print filenames in the same order. :(

        One also has to keep in mind that certain data structures become very dangerous in such scripts. Consider this pseudo-python that auto generates some sort of source code in a compiled language:

        weird_mapping = dict(file_a=99, file_b=1)
        things_in_a_set = set([thing_a, thing_b, thing_c])
        for k, v in werid_mapping.items():
            ... generate some code ...
        for thing in things_in_a_set:
            ... generate some code ...


        A pattern like this would dash any hope that your project had of being deterministic because it makes use of unordered data structures.

        Beware of unordered data structures in build scripts and/or sort all the things before writing to files.

        Enforcing determinism from the beginning of a project's life cycle is the ideal situation, so, I would highly recommend incorporating it into CI flows. When a developer submits a patch it should include a hash of their latest build. If the CI system builds and the hashes don't match, reject that non-deterministic code! :)

        EOF

        Of course, this hardly scratches the surface on why deterministic builds are important; but I hope this is enough for a person to get started on. It's a very interesting topic with lots of fun challenges that need solving. :) If you'd like to do some further reading, I've listed a few useful sources below.

        https://blog.torproject.org/blog/deterministic-builds-part-two-technical-details

        https://wiki.debian.org/ReproducibleBuilds#Why_do_we_want_reproducible_builds.3F

        http://www.chromium.org/developers/testing/isolated-testing/deterministic-builds

        November 10, 2014 07:54 PM

        Justin Wood (Callek)

        Firefox Launches Developer Editon (Minor Papercut Issues)

        So, as you may have heard, Firefox is launching a dev edition.

        This post does not attempt to elaborate on that specifically too much, but it’s more to identify some issues I hit in early testing and the solutions to them.

        Theme

        While I do admire the changes of the Developer Edition Theme, I’m a guy who likes to stick with “what I know” more than a drastic change like that. What I didn’t realize was that this is possible out of the box in developer edition.

        After the Tour you get, you’ll want to open the Customize panel and then deselect “Use Firefox Developer Edition Theme” (see the following image — arrow added) and that will get you back to what you know.

        DevEditionTheme

        Sync

        As a longtime user, I had “Old Firefox Sync” enabled; this was the one that very few users enabled and even fewer used it across devices.

        Firefox Developer Edition, however, creates a new profile (so you can use it alongside whatever Firefox version you want) and supports setting up only the “New” sync features. Due to creating a new profile, it also leaves you without history or saved passwords.

        To sync my old profile with developer edition, I had to:

        1. Unlink my Desktop Firefox from old sync
        2. Unlink my Android Firefox from old sync
        3. Create a new sync account
        4. Link my old Firefox profile with new sync
        5. Link my Android with new sync
        6. Link Dev Edition with new sync
        7. Profit

        Now other than steps 6 and 7 (yea, how DO I profit?) this is all covered quite well in a SuMo article on the subject. I will happily help guide people through this process, especially in the near future, as I’ve just gone through it!

        (Special Thanks to Erik for helping to copy-edit this post)

        November 10, 2014 04:30 PM

        I’m a wordpress newbie

        If this is on planet.mozilla.org, and so is a “content is password protected” post below it, I’m sorry.

        The post is merely that way because its unfinished but I wanted to share it with a few others for early feedback.

        I’ll delete this post, and unhide that one once things are ready. (Sorry for any confusion)

        November 10, 2014 05:18 AM

        November 06, 2014

        Armen Zambrano G. (@armenzg)

        Setting buildbot up a-la-releng (Create your own local masters and slaves)

        buildbot is what Mozilla's Release Engineering uses to run the infrastructure behind tbpl.mozilla.org.
        buildbot assigns jobs to machines (aka slaves) through hosts called buildbot masters.

        All the different repositories and packages needed to setup buildbot are installed through Puppet and I'm not aware of a way of setting my local machine through Puppet (I doubt I would want to do that!).
        I managed to set this up a while ago by hand [1][2] (it was even more complicated in the past!), however, these one-off attempts were not easy to keep up-to-date and isolated.

        I recently landed few scripts that makes it trivial to set up as many buildbot environments as you want and all isolated from each other.

        All the scripts have been landed under the "community" directory under the "braindump" repository:
        https://hg.mozilla.org/build/braindump/file/default/community

        The main two scripts:

        If you call create_community_slaves_and_masters.sh with -w /path/to/your/own/workdir you will have everything set up for you. From there on, all you would have to do is this:
        • cd /path/to/your/own/workdir
        • source venv/bin/activate
        • buildbot start masters/test_master (for example)
        • buildslave start slaves/test_slave
        Each paired master and slave have been setup to talk to each other.

        I hope this is helpful for people out there. It's been great for me when I contribute patches for buildbot (bug 791924).

        As always in Mozilla, contributions are always welcome!

        PS 1 = Only tested on Ubuntu. If you want it to port this to other platforms please let me know and I can give you a hand.

        PS 2 = I know that there is a repository that has docker images called "tupperware", however, I had these set of scripts being worked on for a while. Perhaps someone wants to figure out how to set a similar process through the docker images.



        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        November 06, 2014 02:02 PM

        November 05, 2014

        Massimo Gervasini (mgerva)

        Sign the hash of the bundle, not the full bundle!

        With Bug 1083683,, we are stopping direct processing of .bundle and .source files by our signing servers. This means that in the near future we will not have new *.bundle.asc and and *.source.tar.bz2.asc files on the ftp server.
        Bundles and source files have grown quite a bit and get them signed sometimes ends up in retries and failed jobs, disrupting and delaying the release process. There’s also no benefit on having them signed directly; the source-package job already calculates the hash of the bundle/source files and their MD5/SHA1/SHA512 hashes get included in the .checksum file, which is signed with the release automation key.


        November 05, 2014 04:57 PM

        October 31, 2014

        Chris Cooper (coop)

        10.8 testing disabled by default on Try

        Mountain LionIf you’ve recently submitted patches to the Mozilla Try server, you may have been dismayed by the turnaround time for your test results. Indeed, last week we had reports from some developers that they were waiting more than 24 hours to get results for a single Try push in the face of backlogs caused by tree closures.

        The chief culprit here was Mountain Lion, or OS X 10.8, which is our smallest pool (99) of test machines. It was not uncommon for there to be over 2,000 pending test jobs for Mountain Lion at any given time last week. Once we reach a pending count that high, we cannot make headway until the weekend when check-in volume drops substantially.

        In the face of these delays, developers started landing some patches on mozilla-inbound before the corresponding jobs had finished on Try, and worse still, not killing the obsolete pending jobs on Try. That’s just bad hygiene and practice. Sheriffs had to actively look for the duplicate jobs and kill them up to help decrease load.

        We cannot easily increase the size of the Mountain Lion pool. Apple does not allow you to install older OS X versions on new hardware, so our pool size here is capped at the number of machines we bought when 10.8 was released over 2 years ago or what we can scrounge from resellers.

        To improve the situation, we made the decision this week to disable 10.8 testing by default on Try. Developers must now select 10.8 explicitly from the “Restrict tests to platform(s)” list on TryChooser if they want to run Mountain Lion tests. If you have an existing Mac Try build that you’d like to back-fill with 10.8 results, please ping the sheriff on duty (sheriffduty) in #developers or #releng and they can help you out *without* incurring another full Try run.

        Please note that we do plan to stand up Yosemite (10.10) testing as a replacement for Mountain Lion early in 2015. This is a stop-gap measure until we’re able to do so.

        October 31, 2014 08:25 PM

        October 27, 2014

        Kim Moir (kmoir)

        Mozilla pushes - September 2014

        Here's September 2014's monthly analysis of the pushes to our Mozilla development trees.
        You can load the data as an HTML page or as a json file.


        Trends
        Suprise!  No records were broken this month.

        Highlights
        12267 pushes
        409 pushes/day (average)
        Highest number of pushes/day: 646 pushes on September 10, 2014
        22.6 pushes/hour (average)

        General Remarks
        Try has around 36% of pushes and Gaia-Try comprise about 32%.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 620 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        August 20, 2014 had the highest number of pushes in one day with 690 pushes





        October 27, 2014 09:11 PM

        Release Engineering in the classroom

        The second week of October, I had the pleasure of presenting lectures on release engineering to university students in Montreal as part of the PLOW lectures at École Polytechnique de Montréal.    Most of the students were MSc or PhD students in computer science, with a handful of postdocs and professors in the class as well. The students came from Montreal area universities and many were international students. The PLOW lectures consisted of several invited speakers from various universities and industry spread over three days.

        View looking down from the university

        Université de Montréal administration building

        École Polytechnique building.  Each floor is painted a different colour to represent a differ layer of the earth.  So the ground floor is red, the next orange and finally green.

        The first day, Jack Jiang from York University gave a talk about software performance engineering.
        The second day, I gave a lecture on release engineering in the morning.  The rest of the day we did a lot of labs to configure a Jenkins server to build and run tests on an open source project. Earlier that morning, I had setup m3.large instances for the students on Amazon that they could ssh into and conduct their labs.  Along the way, I talked about some release engineering concepts.  It was really interesting and I learned a lot from their feedback.  Many of the students had not been exposed to release engineering concepts so it was fun to share the information.

        Several students came up to me during the breaks and said "So, I'm doing my PhD in release engineering, and I have several questions for you" which was fun.  Also, some of the students were making extensive use of code bases for Mozilla or other open source projects so that was interesting to learn more about.  For instance one research project looking at the evolution of multi-threading in a Mozilla code bases, and another student was conducting bugzilla comment sentiment analysis.  Are angry bug comments correlated with fewer bug fixes?  Looking forward to the results of this research!

        I ended the day by providing two challenge exercises to the students that they could submit answers to.  One exercise was to setup a build pipeline in Jenkins for another open source project.  The other challenge was to use a the Jenkins REST API to query the Apache projects Jenkins server and present some statistics on their build history.  The results were pretty impressive!

        My slides are on GitHub and the readme file describes how I setup the Amazon instances so Jenkins and some other required packages were installed before hand.  Please use them and distribute them if you are interested in teaching release engineering in your classroom.

        Lessons I learned from this experience:
        The third day there was a lecture by Michel Dagenais of Polytechnique Montréal on tracing heterogeneous cloud instances using (tracing framework for Linux).  The Eclipse trace compass project also made an appearance in the talk. I always like to see Eclipse projects highlighted.  One of his interesting points was that none of the companies that collaborate on this project wanted to sign a bunch of IP agreements so they could collaborate on this project behind closed doors.  They all wanted collaborate via an open source community and source code repository.  Another thing he emphasized was that students should make their work available on the web, via GitHub or other repositories so they have a portfolio of work available.  It was fantastic to seem him promote the idea of students being involved in open source as a way to help their job prospects when they graduate!

        Thank you Foutse and  Bram  for the opportunity to lecture at your university!  It was a great experience!  Also, thanks Mozilla for the opportunity to do this sort of outreach to our larger community on company time!

        Also, I have a renewed respect for teachers and professors.  Writing these slides took so much time.  Many long nights for me especially in the days leading up to the class.  Kudos to you all who do teach everyday.

        References
        The slides are on GitHub and the readme file describes how I setup the Amazon instances for the labs

        October 27, 2014 01:34 PM

        Beyond the Code 2014: a recap

        I started this blog post about a month ago and didn't finish it because well, life is busy.  

        I attended Beyond the Code last September 19.  I heard about it several months ago on twitter.  A one-day conference about celebrating women in computing, in my home town, with an fantastic speaker line up?  I signed up immediately.   In the opening remarks, we were asked for a show of hands to show how many of us were developers, in design,  product management, or students and there was a good representation from all those categories.  I was especially impressed to see the number of students in the audience, it was nice to see so many of them taking time out of their busy schedule to attend.

        View of the Parliament Buildings and Chateau Laurier from the MacKenzie street bridge over the Rideau Canal
        Ottawa Conference Centre, location of Beyond the Code
         
        There were seven speakers, three workshop organizers, a lunch time activity, and a panel at the end. The speakers were all women.  The speakers were not all white women or all heterosexual women.  There were many young women, not all industry veterans :-) like me.  To see this level of diversity at a tech conference filled me with joy.  Almost every conference I go to is very homogenous in the make up of the speakers and the audience.  To to see ~200 tech women in at conference and 10% men (thank you for attending:-) was quite a role reversal.

        I completely impressed by the caliber of the speakers.  They were simply exceptional.

        The conference started out with Kronda Adair giving a talk on Expanding Your Empathy.  One of the things that struck me from this talk was that she talked about how everyone lives in a bubble, and they don't see things that everyone does due to privilege.  She gave the example of how privilege is like a browser, and colours how we see the world.  For a straight white guy a web age looks great when they're running the latest Chrome on MacOSx.  For a middle class black lesbian, the web page doesn't look as great because it's like she's running IE7.  There is less inherent privilege.  For a "differently abled trans person of color" the world is like running IE6 in quirks mode. This was a great example. She also gave a shout out to the the Ascend Project which she and Lukas Blakk are running in Mozilla Portland office. Such an amazing initiative.

        The next speaker was Bridget Kromhout who gave talk about Platform Ops in the Public Cloud.
        I was really interested in this talk because we do a lot of scaling of our build infrastructure in AWS and wanted to see if she had faced similar challenges. She works at DramaFever, which she described as Netflix for Asian soap operas.  The most interesting things to me were the fact that she used all AWS regions to host their instances, because they wanted to be able to have their users download from a region as geographically close to them as possible.  At Mozilla, we only use a couple of AWS regions, but more instances than Dramafever, so this was an interesting contrast in the services used. In addition, the monitoring infrastructure they use was quite complex.  Her slides are here.

        I was going to summarize the rest of the speakers but Melissa Jean Clark did an exceptional job on her blog.  You should read it!

        Thank you Shopify for organizing this conference.  It was great to meet some many brilliant women in the tech industry! I hope there is an event next year too!

        October 27, 2014 01:33 PM

        October 14, 2014

        Jordan Lund (jlund)

        This week in Releng - Oct 5th, 2014

        Major highlights:

        Completed work (resolution is 'FIXED'):


        In progress work (unresolved and not assigned to nobody):

        October 14, 2014 04:36 AM

        October 07, 2014

        Ben Hearsum (bhearsum)

        Redo 1.3 is released – now with more natural syntax!

        We’ve been using the functions packaged in Redo for a few years now at Mozilla. One of the things we’ve been striving for with it is the ability to write the most natural code possible. In it’s simplest form, retry, a callable that may raise, the exceptions to retry on, and the callable to run to cleanup before another attempt – are all passed in as arguments. As a result, we have a number of code blocks like this, which don’t feel very Pythonic:

        retry(self.session.request, sleeptime=5, max_sleeptime=15,
              retry_exceptions=(requests.HTTPError, 
                                requests.ConnectionError),
              attempts=self.retries,
              kwargs=dict(method=method, url=url, data=data,
                          config=self.config, timeout=self.timeout,
                          auth=self.auth, params=params)
        )
        

        It’s particularly unfortunate that you’re forced to let retry do your exception handling and cleanup – I find that it makes the code a lot less readable. It’s also not possible to do anything in a finally block, unless you wrap the retry in one.

        Recently, Chris AtLee discovered a new method of doing retries that results in much cleaner and more readable code. With it, the above block can be rewritten as:

        for attempt in retrier(attempts=self.retries):
            try:
                self.session.request(method=method, url=url, data=data,
                                     config=self.config,
                                     timeout=self.timeout, auth=self.auth,
                                     params=params)
                break
            except (requests.HTTPError, requests.ConnectionError), e:
                pass
        

        retrier simply handles the the mechanics of tracking attempts and sleeping, leaving your code to do all of its own exception handling and cleanup – just as if you weren’t retrying at all. It’s important to note that the break at the end of the try block is important, otherwise self.session.request would run even if it succeeded.

        I released Redo 1.3 with this new functionality this morning – enjoy!

        October 07, 2014 12:48 PM

        October 02, 2014

        Hal Wine (hwine)

        bz Quick Search

        October 02, 2014 07:00 AM

        September 29, 2014

        Jordan Lund (jlund)

        This Week In Releng - Sept 21st, 2014

        Major Highlights:

        Completed work (resolution is 'FIXED'):


        In progress work (unresolved and not assigned to nobody):

        September 29, 2014 06:08 PM

        This Week In Releng - Sept 7th, 2014

        Major Highlights

        Completed work (resolution is 'FIXED'):


        In progress work (unresolved and not assigned to nobody):

        September 29, 2014 05:44 PM

        September 25, 2014

        Armen Zambrano G. (@armenzg)

        Making mozharness easier to hack on and try support

        Yesterday, we presented a series of proposed changes to Mozharness at the bi-weekly meeting.

        We're mainly focused on making it easier for developers and allow for further flexibility.
        We will initially focus on the testing side of the automation and make ground work for other further improvements down the line.

        The set of changes discussed for this quarter are:

        1. Move remaining set of configs to the tree - bug 1067535
          • This makes it easier to test harness changes on try
        2. Read more information from the in-tree configs - bug 1070041
          • This increases the number of harness parameters we can control from the tree
        3. Use structured output parsing instead of regular where it applies - bug 1068153
          • This is part of a larger goal where we make test reporting more reliable, easy to consume and less burdening on infrastructure
          • It's to establish a uniform criteria for setting a job status based on a test result that depends on structured log data (json) rather than regex-based output parsing
          • "How does a test turn a job red or orange?" 
          • We will then have a simple answer that is that same for all test harnesses
        4. Mozharness try support - bug 791924
          • This will allow us to lock which repo and revision of mozharnes is checked out
          • This isolates mozharness changes to a single commit in the tree
          • This give us try support for user repos (freedom to experiment with mozharness on try)


        Even though we feel the pain of #4, we decided that the value gained for developers through #1 & #2 gave us immediate value while for #4 we know our painful workarounds.
        I don't know if we'll complete #4 in this quarter, however, we are committed to the first three.

        If you want to contribute to the longer term vision on that proposal please let me know.


        In the following weeks we will have more updates with regards to implementation details.


        Stay tuned!



        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        September 25, 2014 07:42 PM

        September 23, 2014

        Ben Hearsum (bhearsum)

        Stop stripping (OS X builds), it leaves you vulnerable

        While investigating some strange update requests on our new update server, I discovered that we have thousands of update requests from Beta users on OS X that aren’t getting an update, but should. After some digging I realized that most, if not all of these are coming from users who have installed one of our official Beta builds and subsequently stripped out the architecture they do not need from it. In turn, this causes our builds to report in such a way that we don’t know how to serve updates for them.

        We’ll look at ways of addressing this, but the bottom line is that if you want to be secure: Stop stripping Firefox binaries!

        September 23, 2014 05:38 PM

        September 19, 2014

        Ben Hearsum (bhearsum)

        New update server has been rolled out to Firefox/Thunderbird Beta users

        Yesterday marked a big milestone for the Balrog project when we made it live for Firefox and Thunderbird Beta users. Those with a good long term memory may recall that we switched Nightly and Aurora users over almost a year ago. Since then, we’ve been working on and off to get Balrog ready to serve Beta updates, which are quite a bit more complex than our Nightly ones. Earlier this week we finally got the last blocker closed and we flipped it live yesterday morning, pacific time. We have significantly (~10x) more Beta users than Nightly+Aurora, so it’s no surprise that we immediately saw a spike in traffic and load, but our systems stood up to it well. If you’re into this sort of thing, here are some graphs with spikey lines:
        The load average on 1 (of 4) backend nodes:

        The rate of requests to 1 backend node (requests/second):

        Database operations (operations/second):

        And network traffic to the database (MB/sec):

        Despite hitting a few new edge cases (mostly around better error handling), the deployment went very smoothly – it took less than 15 minutes to be confident that everything was working fine.

        While Nick and I are the primary developers of Balrog, we couldn’t have gotten to this point without the help of many others. Big thanks to Chris and Sheeri for making the IT infrastructure so solid, to Anthony, Tracy, and Henrik for all the testing they did, and to Rail, Massimo, Chris, and Aki for the patches and reviews they contributed to Balrog itself. With this big milestone accomplished we’re significantly closer to Balrog being ready for Release and ESR users, and retiring the old AUS2/3 servers.

        September 19, 2014 02:31 PM

        September 17, 2014

        Kim Moir (kmoir)

        Mozilla Releng: The ice cream

        A week or so ago, I was commenting in IRC that I was really impressed that our interns had such amazing communication and presentation skills.  One of the interns, John Zeller said something like "The cream rises to the top", to which I replied "Releng: the ice cream of CS".  From there, the conversation went on to discuss what would be the best ice cream flavour to make that would capture the spirit of Mozilla releng.  The consensus at the end was was that Irish Coffee (coffee with whisky) with cookie dough chunks was the favourite.  Because a lot of people like on the team like coffee, whisky makes it better and who doesn't like cookie dough?

        I made this recipe over the weekend with some modifications.  I used the coffee recipe from the Perfect Scoop.  After it was done churning in the ice cream maker,  instead of whisky, which I didn't have on hand, I added Kahlua for more coffee flavour.  I don't really like cookie dough in ice cream but cooked chocolate chip cookies cut up with a liberal sprinkling of Kahlua are tasty.

        Diced cookies sprinkled with Kahlua

        Ice cream ready to put in freezer

        Finished product
        I have to say, it's quite delicious :-) If I open source ever stops being fun, I'm going to start a dairy empire.  Not really. Now back to bugzilla...

        September 17, 2014 01:43 PM

        September 16, 2014

        Armen Zambrano G. (@armenzg)

        Which builders get added to buildbot?

        To add/remove jobs on tbpl.mozilla.org, we have to modify buildbot-configs.

        Making changes can be learnt by looking at previous patches, however, there's a bit of an art to it to get it right.

        I just landed a script that sets up buildbot for you inside of a virtualenv and you can pass a buildbot-config patch and determine which builders get added/removed.

        You can run this by checking out braindump and running something like this:
        buildbot-related/list_builder_differences.sh -j path_to_patch.diff

        NOTE: This script does not check that the job has all the right parameters once live (e.g. you forgot to specify the mozharness config for it).

        Happy hacking!


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        September 16, 2014 03:26 PM

        September 11, 2014

        Armen Zambrano G. (@armenzg)

        Run tbpl jobs locally with Http authentication (developer_config.py) - take 2

        Back in July, we deployed the first version of Http authentication for mozharness, however, under some circumstances, the initial version could fail and affect production jobs.

        This time around we have:

        If you read How to run Mozharness as a developer you should see the new changes.

        As quick reminder, it only takes 3 steps:

        1. Find the command from the log. Copy/paste it.
        2. Append --cfg developer_config.py
        3. Append --installer-url/--test-url with the right values
        To see a real example visit this


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        September 11, 2014 12:45 PM

        September 10, 2014

        Kim Moir (kmoir)

        Mozilla pushes - August 2014

        Here's August 2014's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.



        Trends
        It was another record breaking month.  No surprise here!

        Highlights

        General Remarks
        Both Try and Gaia-Try have about 36% each of the pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes.


        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 620 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        August 20, 2014 had the highest number of pushes in one day with 690 pushes






        September 10, 2014 02:09 PM

        September 09, 2014

        Nick Thomas (nthomas)

        ZNC and Mozilla IRC

        ZNC is great for having a persistent IRC connection, but it’s not so great when the IRC server or network has a blip. Then you can end up failing to rejoin with

        nthomas (…) has joined #releng
        nthomas has left … (Max SendQ exceeded)

        over and over again.

        The way to fix this is to limit the number of channels ZNC can connect to simultaneously. In the Web UI, you change ‘Max Joins’ preference to something like 5. In the config file use ‘MaxJoins = 5′ in a <User foo> block.

        September 09, 2014 10:19 AM

        September 08, 2014

        Jordan Lund (jlund)

        This Week In Releng - Sept 1st, 2014

        Major Highlights:

        Completed work (resolution is 'FIXED'):


        In progress work (unresolved and not assigned to nobody):

        September 08, 2014 04:58 AM

        September 06, 2014

        Hal Wine (hwine)

        New Hg Server Status Page

        New Hg Server Status Page

        Just a quick note to let folks know that the Developer Services team continues to make improvements on Mozilla’s Mercurial server. We’ve set up a status page to make it easier to check on current status.

        As we continue to improve monitoring and status displays, you’ll always find the “latest and greatest” on this page. And we’ll keep the page updated with recent improvements to the system. We hope this page will become your first stop whenever you have questions about our Mercurial server.

        September 06, 2014 07:00 AM

        September 01, 2014

        Nick Thomas (nthomas)

        Deprecating our old rsync modules

        We’ve removed the rsync modules mozilla-current and mozilla-releases today, after calling for comment a few months ago and hearing no objections. Those modules were previously used to deliver Firefox and other Mozilla products to end users via a network of volunteer mirrors but we now use content delivery networks (CDN). If there’s a use case we haven’t considered then please get in touch in the comments or on the bug.

        September 01, 2014 10:09 PM

        August 26, 2014

        Chris AtLee (catlee)

        Gotta Cache 'Em All

        TOO MUCH TRAFFIC!!!!

        Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

        Network traffic from January to August 2014

        The answer - cache all the things!

        Obligatory XKCD

        Caching build artifacts

        The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

        Caching tooltool downloads

        Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

        Results so far - 50% decrease in bandwidth

        Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

        What's next?

        There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

        There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

        I want to know MOAR!

        Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

        Big thanks to George Miroshnykov for his work on developing proxxy.

        August 26, 2014 02:21 PM

        August 18, 2014

        Jordan Lund (jlund)

        This week in Releng - Aug 11th 2014

        Completed work (resolution is 'FIXED'):


        In progress work (unresolved and not assigned to nobody):

        August 18, 2014 06:38 AM

        August 12, 2014

        Ben Hearsum (bhearsum)

        Upcoming changes to Mac package layout, signing

        Apple recently announced changes to how OS X applications must be packaged and signed in order for them to function correctly on OS X 10.9.5 and 10.10. The tl;dr version of this is “only mach-O binaries may live in .app/Contents/MacOS, and signing must be done on 10.9 or later”. Without any changes, future versions of Firefox will cease to function out-of-the-box on OS X 10.9.5 and 10.10. We do not have a release date for either of these OS X versions yet.

        Changes required:
        * Move all non-mach-O files out of .app/Contents/MacOS. Most of these will move to .app/Contents/Resources, but files that could legitimately change at runtime (eg: everything in defaults/) will move to .app/MozResources (which can be modified without breaking the signature): https://bugzilla.mozilla.org/showdependencytree.cgi?id=1046906&hide_resolved=1. This work is in progress, but no patches are ready yet.
        * Add new features to the client side update code to allow partner repacks to continue to work. (https://bugzilla.mozilla.org/show_bug.cgi?id=1048921)
        * Create and use 10.9 signing servers for these new-style apps. We still need to use our existing 10.6 signing servers for any builds without these changes. (https://bugzilla.mozilla.org/show_bug.cgi?id=1046749 and https://bugzilla.mozilla.org/show_bug.cgi?id=1049595)
        * Update signing server code to support new v2 signatures.

        Timeline:
        We are intending to ship the required changes with Gecko 34, which ships on November 25th, 2014. The changes required are very invasive, and we don’t feel that they can be safely backported to any earlier version quickly enough without major risk of regressions. We are still looking at whether or not we’ll backport to ESR 31. To this end, we’ve asked that Apple whitelist Firefox and Thunderbird versions that will not have the necessary changes in them. We’re still working with them to confirm whether or not this can happen.

        This has been cross posted a few places – please send all follow-ups to the mozilla.dev.platform newsgroup.

        August 12, 2014 05:05 PM

        August 11, 2014

        Jordan Lund (jlund)

        This Week In Releng - Aug 4th, 2014

        Major Highlights:

        Completed work (resolution is 'FIXED'):

        In progress work (unresolved and not assigned to nobody):

        August 11, 2014 01:09 AM