Planet Release Engineering

June 27, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 26, 2015

Friday, foxyeah!

It’s been a very busy and successful work week here in beautiful Whistler, BC. People are taking advantage of being in the same location to meet, plan, hack, and socialize. A special thanks to Jordan for inviting us to his place in beautiful Squamish for a BBQ!

(Note: No release engineering folks were harmed by bears in the making of this work week.)

tl;dr

Whistler: Keynotes were given by our exec team and we learned we’re focusing on quality, dating our users to get to know them better, and that WE’RE GOING TO SPACE!! We also discovered that at LEGO, Everything is Awesome now that they’re thinking around the box instead of inside or outside of it. Laura’s GoFaster project sounds really exciting, and we got a shoutout from her on the way we manage the complexity of our systems. There should be internal videos of the keynotes up next week if you missed them.

Internally, we talked about Q3 planning and goals, met with our new VP, David, met with our CEO, Chris, presented some lightning talks, and did a bunch of cross-group planning/hacking. Dustin, Kim, and Morgan talked to folks at our booth at the Science Fair. We had a cool banner and some cards (printed by Dustin) that we could hand out to tell people about try. SHIP IT!

Taskcluster: Great news; the TaskCluster team is joining us in Platform! There was lots of evangelism about TaskCluster and interest from a number of groups. There were some good discussions about operationalizing taskcluster as we move towards using it for Firefox automation in production. Pete also demoed the Generic Worker!

Puppetized Windows in AWS: Rob got the nxlog puppet module done. Mark is working on hg and NSIS puppet modules in lieu of upgrading to MozillaBuild 2.0. Jake is working on the metric-collective module. The windows folks met to discuss the future of windows package management. Q is finishing up the performance comparison testing in AWS. Morgan, Mark, and Q deployed runner to all of the try Windows hosts and one of the build hosts.

Operational: Amy has been working on some additional nagios checks. Ben, Rail, and Nick met and came up with a solid plan for release promotion. Rail and Nick worked on releasing Firefox 39 and two versions of Firefox ESR. Hal spent much of the week working with IT. Dustin and catlee got some work on on migrating treestatus to relengapi. Hal, Nick, Chris, and folks from IT, sheriffs, dev-services debugged problems with b2g jobs. Callek deployed a new version of slaveapi. Kim, Jordan, Chris, and Ryan worked on a plan for addons. Kim worked with some new buildduty folks to bring them up to speed on operational procedures.

Thank you all, and have a safe trip home!

And here are all the details:

Taskcluster

Puppetized Windows in AWS

Operational

See you next week!

June 27, 2015 03:19 PM

June 25, 2015

Aki Sasaki (aki)

on configuration

A few people have suggested I look at other packages for config solutions. I thought I'd record some of my thoughts on the matter. Let's look at requirements first.

Requirements

  1. Commandline argument support. When running scripts, it's much faster to specify some config via the commandline than always requiring a new config file for each config change.

  2. Default config value support. If a script assumes a value works for most cases, let's make it default, and allow for overriding those values in some way.

  3. Config file support. We need to be able to read in config from a file, and in some cases, several files. Some config values are either too long and unwieldy to pass via the commandline, and some config values contain characters that would be interpreted by the shell. Plus, the ability to use diff and version control on these files is invaluable.

  4. Multiple config file type support. json, yaml, etc.

  5. Adding the above three solutions together. The order should be: default config value -> config file -> commandline arguments. (The rightmost value of a configuration item wins.)

  6. Config definition and validation. Commandline options are constrained by the options that are defined, but config files can contain any number of arbitrary key/value pairs.

  7. The ability to add groups of commandline arguments together. Sometimes familes of scripts need a common set of commandline options, but also need the ability to add script-specific options. Sharing the common set allows for consistency.

  8. The ability to add config definitions together. Sometimes families of scripts need a common set of config items, but also need the ability to add script-specific config items.

  9. Locking and/or logging any changes to the config. Changing config during runtime can wreak havoc on the debugability of a script; locking or logging the config helps avoid or mitigate this.

  10. Python 3 support, and python 2.7 unicode support, preferably unicode-by-default.

  11. Standardized solution, preferably non-company and non-language specific.

  12. All-in-one solution, rather than having to use multiple solutions.

Packages and standards

argparse

Argparse is the standardized python commandline argument parser, which is why configman and scriptharness have wrapped it to add further functionality. Its main drawbacks are lack of config file support and limited validation.

  1. Commandline argument support: yes. That's what it's written for.

  2. Default config value support: yes, for commandline options.

  3. Config file support: no.

  4. multiple config file type support: no.

  5. Adding the above three solutions together: no. The default config value and the commandline arguments are placed in the same Namespace, and you have to use the parser.get_default() method to determine whether it's a default value or an explicitly set commandline option.

  6. Config definition and validation: limited. It only covers commandline option definition+validation, and there's the required flag but not a if foo is set, bar is required type validation. It's possible to roll your own, but that would be script-specific rather than part of the standard.

  7. Adding groups of commandline arguments together: yes. You can take multiple parsers and make them parent parsers of a child parser, if the parent parsers have specified add_help=False

  8. Adding config definitions together: limited, as above.

  9. The ability to lock/log changes to the config: no. argparse.Namespace will take changes silently.

  10. Python 3 + python 2.7 unicode support: yes.

  11. Standardized solution: yes, for python. No for other languages.

  12. All-in-one solution: no, for the above limitations.

configman

Configman is a tool written to deal with configuration in various forms, and adds the ability to transform configs from one type to another (e.g., commandline to ini file). It also adds the ability to block certain keys from being saved or output. Its argparse implementation is deeper than scriptharness' ConfigTemplate argparse abstraction.

Its main drawbacks for scriptharness usage appear to be lack of python 3 + py2-unicode-by-default support, and for being another non-standardized solution. I've given python3 porting two serious attempts, so far, and I've hit a wall on the dotdict __getattr__ hack working differently on python 3. My wip is here if someone else wants a stab at it.

  1. Commandline argument support: yes.

  2. Default config value support: yes.

  3. Config file support: yes.

  4. Multiple config file type support: yes.

  5. Adding the above three solutions together: not as far as I can tell, but since you're left with the ArgumentParser object, I imagine it'll be the same solution to wrap configman as argparse.

  6. Config definition and validation: yes.

  7. Adding groups of commandline arguments together: yes.

  8. Adding config definitions together: not sure, but seems plausible.

  9. The ability to lock/log changes to the config: no. configman.namespace.Namespace will take changes silently.

  10. Python 3 support: no. Python 2.7 unicode support: there are enough str() calls that it looks like unicode is a second class citizen at best.

  11. Standardized solution: no.

  12. All-in-one solution: no, for the above limitations.

docopt

Docopt simplifies the commandline argument definition and prettifies the help output. However, it's purely a commandline solution, and doesn't support adding groups of commandline options together, so it appears to be oriented towards relatively simple script configuration. It could potentially be added to json-schema definition and validation, as could the argparse-based commandline solutions, for an all-in-two solution. More on that below.

json-schema

This looks very promising for an overall config definition + validation schema. The main drawback, as far as I can see so far, is the lack of commandline argument support.

A commandline parser could generate a config object to validate against the schema. (Bonus points for writing a function to validate a parser against the schema before runtime.) However, this would require at least two definitions: one for the schema, one for the hopefully-compliant parser. Alternately, the schema could potentially be extended to support argparse settings for various items, at the expense of full standards compatiblity.

There's already a python jsonschema package.

  1. Commandline argument support: no.

  2. Default config value support: yes.

  3. Config file support: I don't think directly, but anything that can be converted to a dict can be validated.

  4. Multiple config file type support: no.

  5. Adding the above three solutions together: no.

  6. Config definition and validation: yes.

  7. Adding groups of commandline arguments together: no.

  8. Adding config definitions together: sure, you can add dicts together via update().

  9. The ability to lock/log changes to the config: no.

  10. Python 3 support: yes. Python 2.7 unicode support: I'd guess yes since it has python3 support.

  11. Standardized solution: yes, even cross-language.

  12. All-in-one solution: no, for the above limitations.

scriptharness 0.2.0 ConfigTemplate + LoggingDict or ReadOnlyDict

Scriptharness currently extends argparse and dict for its config. It checks off the most boxes in the requirements list currently. My biggest worry with the ConfigTemplate is that it isn't fully standardized, so people may be hesitant to port all of their configs to it.

An argparse/json-schema solution with enough glue code in between might be a good solution. I think ConfigTemplate is sufficiently close to that that adding jsonschema support shouldn't be too difficult, so I'm leaning in that direction right now. Configman has some nice behind the scenes and cross-file-type support, but the python3 and __getattr__ issues are currently blockers, and it seems like a lateral move in terms of standards.

An alternate solution may be BYOC. If the scriptharness Script takes a config object that you built from somewhere, and gives you tools that you can choose to use to build that config, that may allow for enough flexibility that people can use their preferred style of configuration in their scripts. The cost of that flexibility is familiarity between scriptharness scripts.

  1. Commandline argument support: yes.

  2. Default config value support: yes, both through argparse parsers and script initial_config.

  3. Config file support: yes. You can define multiple required config files, and multiple optional config files.

  4. Multiple config file type support: no. Mozharness had .py and .json. Scriptharness currently only supports json because I was a bit iffy about execfileing python again, and PyYAML doesn't always install cleanly everywhere. It's on the list to add more formats, though. We probably need at least one dynamic type of config file (e.g. python or yaml) or a config-file builder tool.

  5. Adding the above three solutions together: yes.

  6. Config definition and validation: yes.

  7. Adding groups of commandline arguments together: yes.

  8. Adding config definitions together: yes.

  9. The ability to lock/log changes to the config: yes. By default Scripts use LoggingDict that logs runtime changes; StrictScript uses a ReadOnlyDict (sams as mozharness) that prevents any changes after locking.

  10. Python 3 and python 2.7 unicode support: yes.

  11. Standardized solution: no. Extended/abstracted argparse + extended python dict.

  12. All-in-one solution: yes.

Corrections, additions, feedback?

As far as I can tell there is no perfect solution here. Thoughts?



comment count unavailable comments

June 25, 2015 10:14 PM

June 21, 2015

Aki Sasaki (aki)

scriptharness 0.2.0

I've been getting some good feedback about scriptharness 0.1.0; thank you. I listed the 0.2.0 highlights and changes in the 0.2.0 Release Notes, but wanted to mention a few things here.

First, mozharness' config had the flexibility of accepting any arbitrary key/value pairs from various sources (initial_config, commandline options, config files...). However, it wasn't always clear what each config variable was for, or if it was required, or if the config was valid. I filed bug 699343 back in 2011, but didn't know how to tackle it then. I believe I have the solution now, with ConfigTemplates.

Second, 0.1.0 was definitely lacking a run_command() and get_output_from_command() analogs. 0.2.0 has Command for just running+logging a command, ParsedCommand for parsing the output of a command, and Output for getting the output from a command, as well as run(), parse(), get_output(), and get_text_output() shortcut functions to instantiate the objects and run them for you. (Docs are here.) Each of these supports cross-platform output_timeouts and max_timeouts in both python 2.7 and python3, thanks to the multiprocessing module. As a bonus, I was able to add context line support to the ErrorLists for ParsedCommand. This was also a want since 2011.

I fleshed out some more documentation and populated the scriptharness issues with my todo list.

I think I know what I have in mind for 0.3.0, but feedback is definitely welcome!



comment count unavailable comments

June 21, 2015 08:40 AM

June 19, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 19, 2015

Happy Friday once again, releng enthusiasts!

The release engineering and operations teams are heads-down this week trying to get quarterly deliverables done *before* heading off to Whistler for a Mozilla-wide work week. There’s lots of work in-flight, although getting updates has occasionally been like pulling teeth.

Because almost everyone will be in Whistler next week, next week’s update will focus less on completed or in-progress work and more on what releng team members took away from their time together in Whistler.

tl;dr

Taskcluster: Morgan got 32-bit Linux builds working! Rail reports that funsize update generation is ready to go, pending an AWS whitelist update by IT. Ted reproduced mshal’s previous work to get OS X builds cross-compiling in one of Morgan’s existing desktop build containers.

Puppetized Windows in AWS: Jake and Rob are working on additional puppet modules for Windows. Q is running performance tests on jobs in AWS after the networking modifications mentioned last week.

Operational: MozillaBuild 2.0 is out! Mark deployed NSIS 3.0b1 to our windows build/try pools. Kim and Netops have finished up the SCL3 switch upgrades. Jake rolled out changes to enable statsd on our POSIX systems. Dustin’s talk on fwuint was accepted to LISA 15. Dustin merged all of the relengapi blueprints into a single repository and released relengapi 3.0.0.

Whistler: There’s been a bunch of planning around Whistler, and props to catlee, naveed, and davidb for getting our stuff into the sched site (the tag for releng/relops/ateam/relman is platform-ops). Be sure to take a look and pick some planning, presentation, and hacking sessions to go attend! http://juneworkweekwhistler2015.sched.org/type/platform-ops

Thank you all!

And here are all the details:

Taskcluster

Puppetized Windows in AWS

Operational

See you next week!

June 19, 2015 06:39 PM

June 16, 2015

Kim Moir (kmoir)

Test job reduction by the numbers

In an earlier post,  I wrote how we had reduced the amount of test jobs that run on two branches to allow us to scale our infrastructure more effectively.  We run the tests that historically identify regressions more often.  The ones that don't, we skip on every Nth push.  We now have data on how this reduced the number of jobs we run since we began implementation in April.

We run SETA on two branches (mozilla-inbound and fx-team) and on 18 types of builds.  Collectively, these two branches represent about 20% of pushes each month.  Implementing SETA allowed us to move  from ~400 -> ~240 jobs per push on these two branches1 We run the tests identified as not reporting regressions on every 10th commit or 90 minutes since the last test was scheduled.  We run the critical tests on every commit.2

Reduction in number of jobs per push on mozilla-inbound as SETA scheduling is rolled out

A graph for the fx-team branch shows a similar trend. It was a staged rollout starting in early April, as I enabled platforms and as the SETA data became available. The dip in early June reflects where I enabled SETA for Android 4.3.

This data will continue to be updated in our scheduling configuration as it evolves and is updated by the code that Joel and Vaibhav wrote to analyze regressions. The analysis identifies that there were

Jobs to ignore: 440
Jobs to run: 114
Total number of jobs: 554

which is significant.  Our buildbot configurations are updated the latest SETA data with every reconfig, which occurs usually occurs every couple of days.

The platforms configured to run fewer tests for both opt and debug are

        MacOSX (10.6, 10.10)
        Windows (XP, 7, 8)
        Ubuntu 12.04 for linux32, linux64 and ASAN x64
        Android 2.3 armv7 API 9
        Android 4.3 armv7 API 11+

Additional info
1Tests may have been disabled/added at the same time,  this is not taken into account
2There still some scheduling issues to be fixed see bug 1174870  and bug 1174746 for further details

June 16, 2015 08:52 PM

Armen Zambrano G. (@armenzg)

mozci 0.8.0 - New feature -- Trigger coalesced jobs + improved performance

Beware! This release is full awesome!! However, at the same time new bugs might pop up so please let us know :)
You can trigger now all coalesced jobs on a revision:
mozci-trigger --coalesced --repo-name REPO -r REVISION

Contributions

Thanks to @adusca @glandium @vaibhavmagarwal @chmanchester for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • #259 - New feature - Allow triggering
  • #227 - Cache files as gzip files instead of uncompressed
    • Less disk print
  • #227 - Allow using multiple builders
  • 1e591bf - Make sure that we do not fetch files if they are not newer
    • We were failing to double check that the last modification date of a file was the same as the one in the server
    • Hence, we were downloading files more often than needed
  • Caching builds-4hr on memory for improved performance

Minor improvements

  • f72135d - Backfilling did not accept dry_run or to trigger more than once
  • More tests and documents
  • Support for split tests (test packages json file)
  • Some OOP refactoring

All changes

You can see all changes in here:
0.7.3...0.8.0


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 16, 2015 05:38 PM

Chris AtLee (catlee)

RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing
  • Entropy of the changes
  • The amount of time since the files being changed were last modified
  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring
  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

June 16, 2015 03:39 PM

June 15, 2015

Chris Cooper (coop)

Releng & Relops weekly highlights - June 12, 2015

Happy Monday!

Release Engineering has a lot going on. To help spread good news and keep everyone informed, we’re trying an experiment in communication.

Managers put together a list of what we all have been working on, highlighting wins in the last week or so. That list gets sent out to the public release-engineering mailing list and then gets reblogged here.

Please send feedback - did you learn anything, what else should we have included, and what topics might need some additional explanation or context.

tl;dr

Taskcluster: Morgan now has working opt, debug, PGO and ASan 64-bit Linux builds for TaskCluster. This work enables developers to experiment with linux try jobs on their local systems! Dustin debugged and confirmed that inter-region S3 transfers are capped at 1 MB/sec; he also stood up a relengapi proxy for accessing private files in tooltool. Rail just finished work on deploying signing workers for TaskCluster, important for deploying funsize.

Puppetized Windows in AWS: Mark debugged and worked around an annoying ACL and netsh bugs in puppet that were blocking forward progress on Windows puppetization. Amy and Rob are generating 2008R2 puppetized AMIs in AWS via cloud-tools.

Operational

Amy, Hal, Nick, Ben, and Kim responded to the many issues caused by the NetApp outage, including unexpected buildbot database corruption. Hal, Ben, Nick and Rail have been working hard to enable 38.0.6, in support of the spring release which included a bunch of features we needed to get out the door to users.

Q worked on making S3 uploads from Windows complete successfully and worked with Sheriffs to debug and fix a Start Screen problem on Windows 8 that was causing test failures. Jake is saving us money by retiring diamond (shout out to shutting things off!). Rob & Q ensured that all systems now send logs to Papertrail, EVEN XP!

Mike removed the one of the last blockers to getting off FTP and over to S3: porting Android buildsto mozharness. In addition to freeing us from the buildbot factories, this also uses TaskCluster’s index service for uploading artifacts to S3. Hal and Anhad (our amazing intern!) now have vcs-sync conversions all running in parallel from AWS. As part of moving mozharness in-tree, Jordan is doing to work allow consumers to create and fetch mozharness bundles automatically from relengapi.

Thank you all!

And here are all the details:

Taskcluster

Puppetized Windows in AWS

Operational

June 15, 2015 05:04 PM

Brace yourselves

Memes.com

June 15, 2015 04:34 PM

June 12, 2015

Kim Moir (kmoir)

Mozilla pushes - May 2015

Here's May 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.


Trends

The number of pushes decreased from those recorded in the previous month (8894) with a total of 8363. 

 
Highlights

General Remarks

Records



June 12, 2015 07:28 PM

June 10, 2015

Morgan Phillips (mrrrgn)

service "Dockerized Firefox Builds" status

After 160 [work] days, 155 patches, and approximately 8,977 cups of coffee; I'm now wrapping up my third quarter with Mozilla -- the time has flown by! In these first eight months I've written some fun code, shaved enough yaks to knit new wool sweater, and become acquainted with the ghosts of Mozilla RelEng past, present, and future; to great effect.


"Behold child, 'tis buildbot"
Quarter one was spent getting familiar with Mozilla RelEng's legacy: re-writing/taking ownership of existing services like clobberer. The second was all about optimizing our existing infrastructure: principally by rolling out runner. Then, this one has been dedicated new beginnings: porting jobs from our old buildbot based CI infra to the shiny new TaskCluster based one.

In moving [Linux] jobs from buildbot to TaskCluster, I've worked on docker containers which will build Firefox with all of the special options that RelEng needs. This is really cool because it means developers can download our images and work within them as well, thus creating parity between our CI infrastructure and their local environments (making it easier to debug certain bugs). So, what's my status update?

The good news: the container for Linux64 jobs is in tree, and working for both Desktop and Android builds!

The better news: these new jobs are already working in the Try tree! They're hidden in treeherder, but you can reveal them with the little checkbox in the upper right hand corner of the screen. You can also just use this link: https://treeherder.mozilla.org/#/jobs?repo=try&exclusion_profile=false

# note: These are running alongside the old buildbot jobs for now, and hidden. The container is still changing a few times a week (sometimes breaking jobs), so the training wheels will stay on like this for a little while.

The best news: You can run the same job that the try server runs, in the same environment simply by installing docker and running the bash script below.

Bonus: A sister 32 bit container will be coming along shortly.



#!/bin/bash -e
# WARNING: this is experimental mileage may vary!

# Fetch docker image
docker pull mrrrgn/desktop-build:16

# Find a unique container name
export NAME='task-CCJHSxbxSouwLZE_mZBddA-container';

# Run docker command
docker run -ti \
--name $NAME \
-e TOOLTOOL_CACHE='/home/worker/tooltool-cache' \
-e RELENGAPI_TOKEN='ce-n-est-pas-necessaire' \
-e MH_BUILD_POOL='taskcluster' \
-e MOZHARNESS_SCRIPT='mozharness/scripts/fx_desktop_build.py' \
-e MOZHARNESS_CONFIG='builds/releng_base_linux_64_builds.py' \
-e NEED_XVFB='true' \
mrrrgn/desktop-build:16 \
/bin/bash -c /home/worker/bin/build.sh \
;


# Delete docker container
docker rm -v $NAME;

June 10, 2015 02:10 PM

June 03, 2015

Armen Zambrano G. (@armenzg)

mozci 0.7.2 - Support b2g jobs that still run on Buildbot

There are a lot of b2g (aka Firefox OS) jobs that still run on Buildbot .
Interestingly enough we had not tried before to trigger one with mozci.
This release adds support for it.
This should have been a minor release (0.8.0) rather than a security release (0.7.2). My apologies!
All jobs that start with "b2g_" in all_builders.txt are b2g jobs that still run on Buildbot instead of TaskCluster (docs - TC jobs on treeherder).


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 03, 2015 07:23 PM

mozci 0.7.1 - regression fix - do not look for files for running jobs

This release mainly fixes a regression we introduced in the release 0.7.0.
The change (#220) we introduced checked completed and running jobs for files that have been uploaded in order to trigger tests.
The problem is that running jobs do not have any metadata until they actually complete.
We fixed this on #234.

Contributions

Thanks to @adusca and @glandium for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • #234 - (bug fix) - Do not try to find files for running jobs
  • #228 - For try, only trigger talos jobs on existing build jobs ** rather than triggering builds for platforms that were not requested
  • #238 - Read credentials through environment variables

Minor improvements

  • #226 - (bug fix) Properly cache downloaded files
  • #228 - (refactor) Move SCHEDULING_MANAGER
  • #231 - Doc fixes

All changes

You can see all changes in here:
0.7.0...0.7.1


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 03, 2015 07:14 PM

May 28, 2015

Armen Zambrano G. (@armenzg)

mozci 0.7.0 - Less network fetches - great speed improvements!

This release is not large in scope but it has many performance improvements.
The main improvement is to have reduced the number of times that we fetch for information and use a cache where possible. The network cost was very high.
You can read more about in here: http://explique.me/cProfile

Contributions

Thanks to @adusca @parkouss @vaibhavmagarwal for their contributions on this release.

How to update

Run "pip install -U mozci" to update

Major highlights

  • Reduce drastically the number of requests by caching where possible
  • If a failed build has uploaded good files let's use them
  • Added support for retriggering and cancelling jobs
  • Retrigger a job once with a count of N instead of triggering individually N times

Minor improvements

  • Documenation updates
  • Add waffle.io badge

All changes

You can see all changes in here:
0.6.0...0.7.0


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 28, 2015 02:01 PM

May 27, 2015

Armen Zambrano G. (@armenzg)

Welcome adusca!

It is my privilege to announce that adusca (blog) joined Mozilla (since Monday) as an Outreachy intern for the next 4 months.

adusca has an outstanding number of contributions over the last few months including Mozilla CI Tools (which we're working on together).

Here's a bit about herself from her blog:
Hi! I’m Alice. I studied Mathematics in college. I was doing a Master’s degree in Mathematical Economics before getting serious about programming.
She is also a graduate from Hacker's School.

Even though Alice has not been a programmer for many years, she has shown already lots of potential. For instance, she wrote a script to generate scheduling relations for buildbot; for this and many other reasons I tip my hat.

adusca will initially help me out with creating a generic pulse listener to handle job cancellations and retriggers for Treeheder. The intent is to create a way for Mozilla CI tools to manage scheduling on behalf of TH, make the way for more sophisticated Mozilla CI actions and allow other people to piggy back to this pulse service and trigger their own actions.

If you have not yet had a chance to welcome her and getting to know her, I highly encourage you to do so.

Welcome Alice!


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 27, 2015 05:10 PM

May 26, 2015

Chris AtLee (catlee)

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.

tl;dr

"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.

Keynote

Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.
    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?
    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.
    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards
  • actionable metrics - people need to know how to move the needle
  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.
  • standardized quality metrics across the company
  • informing engineering teams about risk
  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

May 26, 2015 06:51 PM

May 25, 2015

Aki Sasaki (aki)

introducing scriptharness

I found myself missing mozharness at various points over the past 10 months. Several things kept me from using it at my then-new job:

I had wanted to address these issues for years, but never had time to devote fully to harness-specific development.

Now I do.

Introducing scriptharness 0.1.0:

I'm proud of this. I'm also aware it's not mature [yet], and it's currently missing some functionality.

There are some ideas I'd love to explore before 1.0.0:

I already have 0.2.0 on the brain. I'd love any feedback or patches.



comment count unavailable comments

May 25, 2015 08:32 PM

mozharness turns 5

Five years ago today, I landed the first mozharness commit in my user repo. (github)

starting something, or wasting my time. Log.py + a scratch trunk_nightly.json

The project had three initial goals:

Multi-locale Fennec became a reality, and then we started adding projects to mozharness, one by one.

As of last July, mozharness was the client-side engine for the majority of Mozilla's CI and release infrastructure. I still see plenty of activity in bugmail and IRC these days. I'll be the first to point out its shortcomings, but I think overall it has been a success.

Happy birthday, mozharness!



comment count unavailable comments

May 25, 2015 08:21 PM

May 15, 2015

Armen Zambrano G. (@armenzg)

mozci 0.6.0 - Trigger based on Treeherder filters, Windows support, flexible and encrypted password managament

In this release of mozci we have a lot of developer facing improvements like Windows support or flexibility on password management.
We also have our latest experimental script mozci-triggerbyfilters (http://mozilla-ci-tools.readthedocs.org/en/latest/scripts.html#triggerbyfilters-py).

How to update

Run "pip install -U mozci" to update.

Notice

We have move all scripts from scripts/ to mozci/scripts/.
Note that you can now use "pip install" and have all scripts available as mozci-name_of_script_here in your PATH.

Contributions

We want to welcome @KWierso as our latest contributor!
Our gratitude @Gijs for reporting the Windows issues and for all his feedback.
Congratulations to @parkouss for making https://github.com/parkouss/mozbattue the first project using mozci as its dependency.
In this release we had @adusca and @vaibhavmagarwal as our main and very active contributors.

Major highlights

  • Added script to trigger jobs based on Treeherder filters
    • This allows using filters like --include "web-platform-tests" and that will trigger all matching builders
    • You can also use --exclude to exclude builders you don't want
  • With the new trigger by filters script you can preview what will be triggered:
233 jobs will be triggered, do you wish to continue? y/n/d (d=show details) d
05/15/2015 02:58:17 INFO: The following jobs will be triggered:
Android 4.0 armv7 API 11+ try opt test mochitest-1
Android 4.0 armv7 API 11+ try opt test mochitest-2
  • Remove storing passwords in plain-text (Sorry!)
    • We now prompt the user if he/she wants to store their password enctrypted
  • When you use "pip install" we will also install the main scripts as mozci-name_of_script_here binaries
    • This makes it easier to use the binaries in any location
  • Windows issues
    • The python module gzip.py is uncapable of decompressing large binaries
    • Do not store buildjson on a temp file and then move

Minor improvements

  • Updated docs
  • Improve wording when triggering a build instead of a test job
  • Loosened up the python requirements from == to >=
  • Added filters to alltalos.py

All changes

You can see all changes in here:
0.5.0...0.6.0

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 15, 2015 08:13 PM

May 08, 2015

Armen Zambrano G. (@armenzg)

mozci 0.5.0 released - Store password in keyring, prevent corrupted data, progress bar and many small improvements

In this release we have many small improvements that help with issues we have found.

The main improvement is that we now don't store credentials in plain-text (sorry!) but use keyring to store it encrypted.

We also prevent partially downloading any data (corrupted data) and added progress bar to downloads.

Congrats to @chmanchester as our latest contributor!
Our usual and very appreciated contributions are by @adusca @jmaher and @vaibhavmagarwal

Minor improvements:
  • Lots of test changes and increased coverage
  • Do not use the root logger but a mozci logger
  • Allow passing custom files to a triggered job
  • Work around buildbot status corruptions (Issue 167)
  • Allow passing buildernames with lower case and removing trailing spaces (since we sometimes copy/paste from TH)
  • Added support to use build a buildername based on trychooser syntax
  • Allow passing extra properties when scheduling a job on Buildbot
You can see all changes in here:
0.4.0...0.5.0

Link to official release notes.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 08, 2015 10:15 PM

May 05, 2015

Chris Cooper (coop)

Automated reconfigs and you

In an effort to offload yet more work from buildduty, today I deployed scripts to automatically reconfig masters when relevant repos are updated. The process works as follows:

Pretty simple, right?

So how does this change affect you?

In practice, it doesn’t.

Unless you already have an environment setup to run the end_to_end_reconfig.sh script, you’ll probably still want to ask buildduty to perform the merge for you. The end_to_end_reconfig.sh script has been changed to *only* merge repos by default, but the script also updates the wiki and bugzilla, which is important for maintaining an audit trail.

However, if you *are* willing to perform these extra steps, you can have your changes automatically deployed within the hour.

The eventual goal is to become comfortable enough with our travis coverage that we can move the production tags automatically when the tests pass.

Our tests seem pretty solid now to me, but maybe others have thoughts about how aggressive or cautious we should be here.

https://bugzilla.mozilla.org/show_bug.cgi?id=978928

May 05, 2015 10:23 PM

May 02, 2015

Morgan Phillips (mrrrgn)

To Serve Developers

The neatest thing about release engineering, is the fact that our pipeline forms the primary bridge between users and developers. On one end, we maintain the CI infrastructure that engineers rely on for thorough testing of their code, and, on the other end, we build stable releases and expose them for the public to download. Being in this position means that we have the opportunity to impact the experiences of both contributors and users by improving our systems (it also makes working on them a lot of fun).

Lately, I've become very interested in improving the developer experience by bringing our CI infrastructure closer to contributors. In short, I would like developers to have access to the same environments that we use to test/build their code. This will make it:
[The release pipeline from 50,000ft]

How?

The first part of my plan revolves around integrating release engineering's CI system with a tool that developers are already using: mach; starting with a utility called: mozbootstrap -- a system that detects its host operating system and invokes a package manager for installing all of the libraries needed to build firefox desktop or firefox android.

The first step here was to make it possible to automate the bootstrapping process (see bug: 1151834 "allow users to bootstrap without any interactive prompts"), and then integrate it into the standing up of our own systems. Luckily, at the moment I'm also porting some of our Linux builds from buildbot to TaskCluster (see bug: 1135206), which necessitates scrapping our old chroot based build environments in favor of docker containers. This fresh start has given me the opportunity begin this transition painlessly.

This simple change alone strengthens the interface between RelEng and developers, because now we'll be using the same packages (on a given platform). It also means that our team will be actively maintaining a tool used by contributors. I think it's a huge step in the right direction!

What platforms/distributions are you supporting?

Right now, I'm only focusing on Linux, though in the future I expect to support OSX as well. The bootstrap utility supports several distributions (Debian/Ubuntu/CentOS/Arch), though, I've been trying to base all of release engineering's new docker containers on Ubuntu 14.04 -- as such, I'd consider this our canonical distribution. Our old builders were based on CentOS, so it would have been slightly easier to go with that platform, but I'd rather support the platform that the majority of our contributors are using.

What about developers who don't use Ubuntu 14.04, and/or have a bizarre environment

One fabulous side effect of using TaskCluster is that we're forced to create docker containers for running our jobs, in fact, they even live in mozilla-central. That being the case, I've started a conversation around integrating our docker containers into mozbootstrap, giving it the option to pull down a releng docker container in lieu of bootstrapping a host system.

On my own machine, I've been mounting my src directory inside of a builder and running ./mach build, then ./mach run within it. All of the source, object files, and executables live on my host machine, but the actual building takes place in a black box. This is a very tidy development workflow that's easy to replicate and automate with a few bash functions [which releng should also write/support].


[A simulation of how I'd like to see developers interacting with our docker containers.]

Lastly, as the final nail in the coffin of hard to reproduce CI bugs, I'd like to make it possible for developers to run our TaskCluster based test/build jobs on their local machines. Either from mach, or a new utility that lives in /testing.

If you'd like to follow my progress toward creating this brave new world -- or heckle me in bugzilla comments -- check out these tickets:

May 02, 2015 06:54 AM

May 01, 2015

Kim Moir (kmoir)

Mozilla pushes - April 2015

Here's April 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.  


Trends
The number of pushes decreased from those recorded in the previous month with a total of 8894.  This is due to the fact that gaia-try is managed by taskcluster and thus these jobs don't appear in the buildbot scheduling databases anymore which this report tracks.


Highlights


General Remarks


Records




Note
I've changed the graphs to only track 2015 data.  Last month they were tracking 2014 data as well but it looked crowded so I updated them.  Here's a graph showing the number of pushes over the last few years for comparison.



May 01, 2015 04:44 PM

April 28, 2015

Kim Moir (kmoir)

Releng 2015 program now available

Releng 2015 will take place in concert with ICSE in Florence, Italy on May 19, 2015. The program is now available. Register here!

via romana in firenze by ©pinomoscato, Creative Commons by-nc-sa 2.0



April 28, 2015 07:54 PM

Less testing, same great Firefox taste!


Running a large continuous integration farm forces you to deal with many dynamic inputs coupled with capacity constraints. The number of pushes increase.  People add more tests.  We build and test on a new platform.  If the number of machines available remains static, the computing time associated with a single push will increase.  You can scale this for platforms that you build and test in the cloud (for us - Linux and Android on emulators), but this costs more money.  Adding hardware for other platforms such as Mac and Windows in data centres is also costly and time consuming.

Do we really need to run every test on every commit? If not, which tests should be run?  How often do they need to be run in order to catch regressions in a timely manner (i.e. able to bisect where the regression occurred)


Several months ago, jmaher and vaibhav1994, wrote code to analyze the test data and determine the minimum number of tests required to run to identify regressions.  They named their software SETA (search for extraneous test automation). They used historical data to determine the minimum set of tests that needed to be run to catch historical regressions.  Previously, we coalesced tests on a number of platforms to mitigate too many jobs being queued for too few machines.  However, this was not the best way to proceed because it reduced the number of times we ran all tests, not just less useful ones.  SETA allows us to run a subset of tests on every commit that historically have caught regressions.  We still run all the test suites, but at a specified interval. 

SETI – The Search for Extraterrestrial Intelligence by ©encouragement, Creative Commons by-nc-sa 2.0
In the last few weeks, I've implemented SETA scheduling in our our buildbot configs to use the data that the analysis that Vaibhav and Joel  implemented.  Currently, it's implemented on mozilla-inbound and fx-team branches which in aggregate represent around 19.6% (March 2015 data) of total pushes to the trees.  The platforms configured to run fewer tests for both opt and debug are

As we gather more SETA data for newer platforms, such as Android 4.3, we can implement SETA scheduling for it as well and reduce our test load.  We continue to run the full suite of tests on all platforms other branches other than m-i and fx-team, such as mozilla-central, try, and the beta and release branches. If we did miss a regression by reducing the tests, it would appear on other branches mozilla-central. We will continue to update our configs to incorporate SETA data as it changes.

How does SETA scheduling work?
We specify the tests that we would like to run on a reduced schedule in our buildbot configs.  For instance, this specifies that we would like to run these debug tests on every 10th commit or if we reach a timeout of 5400 seconds between tests.

http://hg.mozilla.org/build/buildbot-configs/file/2d9e77a87dfa/mozilla-tests/config_seta.py#l692


Previously, catlee had implemented a scheduling in buildbot that allowed us to coallesce jobs on a certain branch and platform using EveryNthScheduler.  However, as it was originally implemented, it didn't allow us to specify tests to skip, such as mochitest-3 debug on MacOSX 10.10 on mozilla-inbound.  It would only allow us to skip all the debug or opt tests for a certain platform and branch.

I modified misc.py to parse the configs and create a dictionary for each test specifying the interval at which the test should be skipped and the timeout interval.  If the tests has these parameters specified, it should be scheduled using the  EveryNthScheduler instead of the default scheduler.

http://hg.mozilla.org/build/buildbotcustom/file/728dc76b5ad0/misc.py#l2727
There are still some quirks to work out but I think it is working out well so far. I'll have some graphs in a future post on how this reduced our test load. 

Further reading
Joel Maher: SETA – Search for Extraneous Test Automation



April 28, 2015 06:47 PM

April 27, 2015

Armen Zambrano G. (@armenzg)

mozci hackday - Friday May 1st, 2015

I recently blogged about mozci and I was gladly surprised that people have curiosity about it.

I want to spend Friday fixing some issues on the tool and I wonder if you would like to join me to learn more about it and help me fix some of them.

I will be available as armenzg_mozci from 9 to 5pm EDT on IRC (#ateam channel).
I'm happy to jump on Vidyo to give you a hand understanding mozci.

I hand picked some issues that I could get a hand with.
Documentation and definition of the project in readthedocs.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 27, 2015 05:30 PM

April 24, 2015

Armen Zambrano G. (@armenzg)

What Mozilla CI tools is and what it can do for you (aka mozci)

Mozci (Mozilla CI tools) is a python library, scripts and package which allows you to trigger jobs on treeherder.mozilla.org.
Not all jobs can be triggered but those that are run on Release Engineering's Buildbot setup. Most (if not all) Firefox desktop and Firefox for Android jobs can be triggered. I believe some B2G jobs can still be triggered.

NOTE: Most B2G jobs are not supported yet since they run on TaskCluster. Support for it will be given on this quarter.

Using it

Once you check out the code:
git clone https://github.com/armenzg/mozilla_ci_tools.git
python setup.py develop
you can run scripts like this one (click here for other scripts):
python scripts/trigger.py \
  --buildername "Rev5 MacOSX Yosemite 10.10 fx-team talos dromaeojs" \
  --rev e16054134e12 --times 10
which would trigger a specific job 10 times.

NOTE: This is independent if a build job exist to trigger the test job. mozci will trigger everything which is required to get you what you need.

One of the many other options is if you want to trigger the same job for the last X revisions, this would require you to use --back-revisions X.

There are many use cases and options listed in here.


A use case for developers

One use case which could be useful to developers (thanks @mike_conley!) is if you pushed to try and used this try syntax: "try: -b o -p win32 -u mochitests -t none". Unfortunately, you later determine that you really need this one: "try: -b o -p linux64,macosx64,win32 -u reftest,mochitests -t none".

In normal circumstances you would go and push again to the try server, however, with mozci (once someone implements this), we could simply pass the new syntax to a script (or with ./mach) and trigger everything that you need rather than having to push again and waster resources and your time!

If you have other use cases, please file an issue in here.

If you want to read about the definition of the project, vision, use cases or FAQ please visit the documentation.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 08:23 PM

Firefox UI update testing

We currently trigger manually UI update tests for Firefox releases. There are automated headless update verification tests but they don't test the UI of Firefox.

The goal is to integrate this UI update testing as part of the Firefox releases.
This will require changes to firefox-ui-tests, buildbot scheduling changes, Marionette changes and other Mozbase packages. The ultimate goal is to speed up our turn around on releases.

The update testing code was recently ported from Mozmill to use Marionette to drive the testing.

I've already written some documentation on how to run the update verification using Release Engineering configuration files. You can use my tools repository until the code lands (update_testing is the branch to be used).

My deliverable is to ensure that the update testing works reliably on Release Engineering infrastructure and there is existing scheduling code for it.

You can read more about this project in bug 1148546.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 24, 2015 02:42 PM

April 21, 2015

Nick Thomas (nthomas)

Changes coming to ftp.mozilla.org

ftp.mozilla.org has been around for a long time in the world of Mozilla, dating back to original source release in 1998. Originally it was a single server, but it’s grown into a cluster storing more than 60TB of data, and serving more than a gigabit/s in traffic. Many projects store their files there, and there must be a wide range of ways that people use the cluster.

This quarter there is a project in the Cloud Services team to move ftp.mozilla.org (and related systems) to the cloud, which Release Engineering is helping with. It would be very helpful to know what functionality people are relying on, so please complete this survey to let us know. Thanks!

April 21, 2015 02:47 AM

April 20, 2015

Chris AtLee (catlee)

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates

Balrog

balrog

All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server, aus3.mozilla.org, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!

Funsize

When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... ftp.mozilla.org is going away...

brace yourselves...ftp is going away

...in its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind ftp.mozilla.org will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to ftp.mozilla.org. We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.

Infrastructure

In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.

Testing

Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

April 20, 2015 11:00 AM

April 15, 2015

Kim Moir (kmoir)

Mozilla pushes - March 2015

Here's March 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
The number of pushes increased from those recorded in the previous month with a total of 10943. 

Highlights

General Remarks

Records





April 15, 2015 02:18 PM

April 12, 2015

Massimo Gervasini (mgerva)

Buildduty April 2015

This month I am on Buildduty with Callek. We cover the European and the East Coast time zones.

What is “buildduty”?

From the buildduty wiki page

“Every month, there is one person from the Release Engineering (releng) team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole month. This is similar to the sheriff role that rotates through the sheriffing team . To avoid confusion, the releng sheriff position is known as “buildduty.”

My impressions after two weeks

I haven’t been covering buildduty role for the last year, and I have to admit the idea to take this role for a month was giving me some headaches. Buildduty used to be a source of stress and this is the reason why each person was assigned to the buildduty role for a single week.
My felling were hugely exaggerated, buildduty is not as painfully as it was in the past. Covering this role for just two weeks made me appreciate the incredible work done to make buildduty a better experience. A huge “thank you” to all who contributed to this!

What to expect

Recent changes have created a virtuous cycle, where we can spend less time on manual work and we can invest more time to push the automation further. In particular we are working on two specific areas for the near future:

 


April 12, 2015 08:59 AM

April 07, 2015

Justin Wood (Callek)

Find our footing on python best practices, of yesteryear.

In the beginning there was fire buildbot. This was Wed, 13 Feb 2008 for the first commit in the repository buildbot-configs.

For context, at this time:

In picking buildbot as our tool we were improving vastly on the decade old technology we had at the time (tinderbox) which was also written in oft-confusing and not-as-shiny perl (we love to hate it now, but it was a good language) [see relevant: image of then-new cutting edge technology but strung together in clunky ways]

As such, we at Mozilla Release Engineering, while just starting to realize the benefits of CI for tests in our main products (like Firefox), were not accustomed to it.

We were writing our buildbot-related code in 3 main repositories at the time (buildbot-configs, buildbotcustom, and tools) all of which we still use today.

Fast forward 5 years and you would have seen a some common antipatterns in large codebases… (over 203k lines of code! )  It was hard to even read most code, let alone hack on it. Each patch was requiring lots of headspace. And we would consistently break things with patches that were not well tested. (even when we tried)

It was at a workweek here in 2013 that catlee got our group agreement on trying to improve that situation by continually running autopep8 over the codebase until there was no (or few) changes with each pass.

Thus began our first, attempt, at bringing our processes to what we call our modern practices.

This reduced, in buildbotcustom and tools alone our pep8 error rate from ~7,139 to ~1,999. (In contrast our current rate for those two repos is ~1485).

(NOTE: This is a good contributor piece, to drive pep8 errors/warnings down to 0 for any of our repos, such as these. We can then make our current tests fail if pep8 fails. Though newer repos started with pep8 compliance, older ones did not. See List of Repositories to pick some if you want to try. — Its not glorious work, but makes everyone more productive once its done.)

The one agreement we decided where pep8 wasn’t for us was line length, we have had many cases where a single line (or even url) barely fits in 80 characters for legit reasons, and felt that arbitrarily limiting variable names or depth just to satisfy that restriction was going to reduce readability. Therefore we generally use –max-line-length of ~159 when validating against pep8.  (The above numbers do not account for –max-line-length)

Around this time we had also setup an internal only jenkins instance as a test for validating at least pep8 and its trends, we have since found jenkins to not be suitable for what we wanted.

Stay tuned to this blog for more history and how we arrived at some best practices that most don’t take for granted these days.

April 07, 2015 01:52 AM

March 31, 2015

Rail Alliev (rail)

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the docker.com registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.

Conclusion

There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

March 31, 2015 12:47 PM

March 28, 2015

Jordan Lund (jlund)

Mozharness is moving into the forest

Since its beginnings, Mozharness has been living in its own world (repo). That's about to change. Next quarter we are going to be moving it in-tree.

what's Mozharness?

it's a configuration driven script harness

why in tree?

  1. First and foremost: transparency.
    • There is an overarching goal to provide developers the keys to manage and stand up their own builds & tests (AKA self-serve). Having the automation step logic side by side to the compile and test step logic provides developers transparency and a sense of determinism. Which leads to reason number 2.
  2. deterministic builds & tests
    • This is somewhat already in place thanks to Armen's work on pinning specific Mozharness revisions to in-tree revisions. However the pins can end up behind the latest Mozharness revisions so we end up often landing multiple changes to Mozharness at once to one in-tree revsion.
  3. Mozharness automated build & test jobs are not just managed by Buildbot anymore. Taskcluster is starting to take the weight off Buildbot's hands and, because of its own behaviour, Mozharness is better suited in-`tree.
  4. ateam is going to put effort this quarter into unifying how we run tests locally vs automation. Having mozharness in-tree should make this easier

this sounds great. why wouldn't we want to do this?

There are downsides. It arguably puts extra strain on Release Engineering for managing infra health. Though issues will be more isolated, it does become trickier to have a higher view of when and where Mozharness changes land.

In addition, there is going to be more friction for deployments. This is because a number of our Mozharness scripts are not directly related to continuous integration jobs: e.g. releases, vcs-sync, b2g bumper, and merge tasks.

why wasn't this done yester-year?

Mozharness now handles > 90% of our build and test jobs. Its internal components: config, script, and log logic, are starting to mature. However, this wasn't always the case.

When it was being developed and its uses were unknown, it made sense to develop on the side and tie itself close to buildbot deployments.

okay. I'm sold. can we just simply hg add mozharness?

Integrating Mozharness in-tree comes with a fe6 challenges

  1. chicken and egg issue

    • currently, for build jobs, Mozharness is in charge of managing version control of the tree itself. How can Mozharness checkout a repo if it itself lives within that repo?
  2. test jobs don't require the src tree

    • test jobs only need a binary and a tests.zip. It doesn't make sense to keep a copy of our branches on each machine that runs tests. In line with that, putting mozharness inside tests.zip also leads us back to a similar 'chicken and egg' issue.
  3. which branch and revisions do our release engineering scripts use?

  4. how do we handle releases?

  5. how do we not cause extra load on hg.m.o?

  6. what about integrating into Buildbot without interruption?

it's easy!

This shouldn't be too hard to solve. Here is a basic outline my plan of action and road map for this goal:

This is a loose outline of the integration strategy. What I like about this

  1. no code change required within Mozharness' code
  2. there is very little code change within Buildbot
  3. allows Taskcluster to use Mozharness in whatever way it likes
  4. no chicken and egg problem as (in Buildbot world), Mozharness will exist before the tree exists on the slave
  5. no need to manage multiple repos and keep them in sync

I'm sure I am not taking into account many edge cases and I look forward to hitting those edges head on as I start this in Q2. Stay tuned for further developments.

One day, I'd like to see Mozharness (at least its internal parts) be made into isolated python packages installable by pip. However, that's another problem for another day.

Questions? Concerns? Ideas? Please comment here or in the tracking bug

March 28, 2015 11:10 PM

March 26, 2015

Morgan Phillips (mrrrgn)

Whoop, Whoop: Pull Up!

Since December 1st 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).[1] For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.[2]

In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.[3]

Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners.

For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....

The St. Patrick's Day Massacre

Case in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process.

The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called runner, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....

On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):


The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories.

There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.

From the repository's point of view, the explosion looked like this:


Then, from runner's point of view, the retrying task:


In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.

Avoiding Future Massacres

After these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:

1.) Bug 1146974 - Add automated alerting for abnormally high retries (in runner).

In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed.

The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.

We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by papertrail.

Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:



2.) Add automated testing, and tiered roll-outs to golden ami generation

Finally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair.

By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.

My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.

The new process will work like this:Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever.

The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....

March 26, 2015 11:55 PM

Armen Zambrano G. (@armenzg)

mozci 0.4.0 released - Many bug fixes and improved performance

For the release notes with all there hyper-links go here.

NOTE: I did a 0.3.1 release but the right number should have been 0.4.0

This release does not add any major features, however, it fixes many issues and has much better performance.

Many thanks to @adusca, @jmaher and @vaibhavmagarwal for their contributions.

Features:


Fixes:



For all changes visit: 0.3.0...0.4.0


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 26, 2015 08:36 PM

March 20, 2015

Kim Moir (kmoir)

Scaling Yosemite

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1.  (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?
  1. We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid.  If we buy new hardware, we need to replace the entire pool at once.  Machines with different hardware specifications = useless performance test comparisons.
  2. We tried to purchase some used machines with the same hardware specs as our existing machines.  However, we couldn't find a source for them.  As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now?  We wait until the population of users running a new platform2 surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop.  It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working.  In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms.  This was a very iterative process.  Run tests on try.  Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy
We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane.  We encountered two major problems that impeded that goal:

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see.  So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up.  Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

References
1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform.  To add to this, the list of platform we support varies across branches.  For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10 
Bug 1121199: Green up 10.10 tests currently failing on try 
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite


Why don't you just run these tests in the cloud?
  1. The Apple EULA severely restricts virtualization on Mac hardware. 
  2. I don't know of any major cloud vendors that offer the Mac as a platform.  Those that claim they do are actually renting racks of Macs on a dedicated per host basis.  This does not have the inherent scaling and associated cost saving of cloud computing.  In addition, the APIs to manage the machines at scale aren't there.
  3. We manage ~350 Mac minis.  We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.

March 20, 2015 06:50 PM

March 17, 2015

Kim Moir (kmoir)

Mozilla pushes - February 2015

Here's February's 2015 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
Although February is a shorter month, the number of pushes were close to those recorded in the previous month.  We had a higher average number of daily pushes (358) than in January (348).

Highlights
10015 pushes
358 pushes/day (average)
Highest number of pushes/day: 574 pushes on Feb 25, 2015
23.18 pushes/hour (highest)

General Remarks
Try had around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes

Records
August 2014 was the month with most pushes (13090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 





March 17, 2015 03:54 PM

March 10, 2015

Hal Wine (hwine)

Docker at Vungle

Docker at Vungle

Tonight I attended the San Francisco Dev Ops meetup at Vungle. The topic was one we often discuss at Mozilla - how to simplify a developer’s life. In this case, the solution they have migrated to is one based on Docker, although I guess the title already gave that away.

Long (but interesting - I’ll update with a link to the video when it becomes available) story short, they are having much more success using DevOps managed Docker containers for development than their previous setup of Virtualbox images built & maintained with Vagrant and Chef.

Vungle’s new hire setup:
  • install Boot2Docker (they are an all Mac dev shop)
  • clone the repository. [1]
  • run docker.sh script which pulls all the base images from DockerHub. This one time image pull gives the new hire time to fill out HR paperwork ;)
  • launch the app in the container and start coding.

Sigh. That’s nice. When you come back from PTO, just re-run the script to get the latest updates - it won’t take nearly as long as only the container deltas need to come down. Presto - back to work!

A couple of other highlights – I hope to do a more detailed post later.

  • They follow the ‘each container has a single purpose’ approach.
  • They use “helper containers” to hold recent (production) data.
  • Devs have a choice in front end development: inside the container (limited tooling) or in the local filesystem (dev’s choice of IDE, etc.). [2]
  • Currently, Docker containers are only being used in development. They are looking down the road to deploying containers in production, but it’s not a major focus at this time.

Footnotes

[1]Thanks to BFG for clarifying that docker-foo is kept in a separate repository from source code. The docker.sh script is in the main source code repository. [Updated 2015-03-11]
[2]More on this later. There are some definite tradeoffs.

March 10, 2015 07:00 AM

March 06, 2015

Armen Zambrano G. (@armenzg)

How to generate data potentially useful to a dynamically generated trychooser UI

If you're interested on generating an up-to-date trychooser, I would love to hear from you.
adusca has helped me generate data similar to what a dynamic trychooser UI could use.
If you would like to help, please visit bug 983802 and let us know.

In order to generate the data all you have to do is:
git clone https://github.com/armenzg/mozilla_ci_tools.git
cd mozilla_ci_tools
python setup.py develop
python scripts/misc/write_tests_per_platform_graph.py

That's it! You will then have a graphs.json dictionary with some of the pieces needed. Once we have an idea on how to generate the UI and what we're missing we can modify this script.

Here's some of the output:
{
    "android": [
        "cppunit",
        "crashtest",
        "crashtest-1",
        "crashtest-2",
        "jsreftest-1",
        "jsreftest-2",
...

Here are the remaining keys:
[u'android', u'android-api-11', u'android-api-9', u'android-armv6', u'android-x86', u'emulator', u'emulator-jb', u'emulator-kk', u'linux', u'linux-pgo', u'linux32_gecko', u'linux64', u'linux64-asan', u'linux64-cc', u'linux64-mulet', u'linux64-pgo', u'linux64_gecko', u'macosx64', u'win32', u'win32-pgo', u'win64', u'win64-pgo']


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

March 06, 2015 04:41 PM

March 05, 2015

Armen Zambrano G. (@armenzg)

mozci 0.3.0 - Support for backfilling jobs on treeherder added

Sometime on treeherder, jobs get coalesced (a.k.a. we run the tests on the most recent revision) in order to handle load. This is good so we can catch up when many pushes are committed on a tree.

However, when a job run on the most recent code comes back failing we need to find out which revision introduced the the regression. This is when we need to backfill up to the last good run.

In this release of mozci we have added the ability to --backfill:
python scripts/trigger_range.py --buildername "b2g_ubuntu64_vm cedar debug test gaia-js-integration-5" --dry-run --revision 2dea8b3c6c91 --backfill
This should be useful specially for sheriffs.

You can start using mozci as long as you have LDAP credentials. Follow these steps to get started:
git clone https://github.com/armenzg/mozilla_ci_tools.git
python setup.py develop (or install)


Release notes

Thanks again to vaibhav1994 and adusca for their many contributions in this release.

Major changes
  • Issue #75 - Added the ability to backfill changes until last good is found
  • No need to use --repo-name anymore
  • Issue #83 - Look for request_ids from a better place
  • Add interface to get status information instead of scheduling info
Minor fixes:
  • Fixes to make livehtml documentation
  • Make determine_upstream_builder() case insensitive
      Release notes: https://github.com/armenzg/mozilla_ci_tools/releases/tag/0.3.0
      PyPi package: https://pypi.python.org/pypi/mozci/0.3.0
      Changes: https://github.com/armenzg/mozilla_ci_tools/compare/0.2.5...0.3.0




      Creative Commons License
      This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

      March 05, 2015 04:19 PM

      March 03, 2015

      Armen Zambrano G. (@armenzg)

      mozci 0.2.5 released - major bug fixes + many improvements

      Big thanks again to vaibhav1994adusca and valeriat for their many contributions in this release.

      Release notes

      Major bug fixes:
      • Bug fix: Sort pushid_range numerically rather than alphabetically
      • Calculation of hours_ago would not take days into consideration
      Others:
      • Added coveralls/coverage support
      • Added "make livehtml" for live documentation changes
      • Improved FAQ
      • Updated roadmap
      • Large documentation refactoring
      • Automatically document scripts
      • Added partial testing of mozci.mozci
      • Streamed fetching of allthethings.json and verify integrity
      • Clickable treeherder links
      • Added support for zest.releaser
        Release notes: https://github.com/armenzg/mozilla_ci_tools/releases/tag/0.2.5
        PyPi package: https://pypi.python.org/pypi/mozci/0.2.5
        Changes: https://github.com/armenzg/mozilla_ci_tools/compare/0.2.4...0.2.5


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 10:46 PM

        How to generate allthethings.json

        It's this easy!
            hg clone https://hg.mozilla.org/build/braindump
            cd braindump/community
            ./generate_allthethings_json.sh

        allthethings.json is generated based on data from buildbot-configs.
        It contains data about builders, schedulers, masters and slavepools.

        If you want to extract information from allthethings.json feel free to use mozci to help you!
        https://mozilla-ci-tools.readthedocs.org/en/latest/allthethings.html


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        March 03, 2015 04:54 PM

        February 27, 2015

        Chris AtLee (catlee)

        Diving into python logging

        Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

        We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

        Here are a few tips and tricks that have really helped me with python logging:

        There can be only more than one

        Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

        You can have multiple loggers active, each specific to its own module or even class or instance.

        Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

        # in module foo.py
        import logging
        log = logging.getLogger(__name__)
        

        This works because inside foo.py, __name__ is equal to "foo". So inside this module the log object is specific to this module.

        Loggers are hierarchical

        The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called foo.bar, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

        # Let's enable all the debug logging for all the foo modules
        import logging
        logging.getLogger('foo').setLevel(logging.DEBUG)
        

        Log messages are like events that flow up through the hierarchy

        Let's say we have a module foo.bar:

        import logging
        log = logging.getLogger(__name__)  # __name__ is "foo.bar" here
        
        def make_widget():
            log.debug("made a widget!")
        

        When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

        The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

        So if the foo.bar logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

        This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

        This also allows us to use mixed levels of log output depending on where the message are coming from:

        import logging
        
        # Enable debug logging for all the foo modules
        logging.getLogger("foo").setLevel(logging.DEBUG)
        
        # Configure the root logger to log only INFO calls, and output to the console
        # (the default)
        logging.basicConfig(level=logging.INFO)
        
        # This will output the debug message
        logging.getLogger("foo.bar").debug("ohai!")
        

        If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

        exc_info is teh awesome

        All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

        import logging
        logging.basicConfig(level=logging.INFO)
        
        log = logging.getLogger(__name__)
        
        try:
            assert False
        except AssertionError:
            log.info("surprise! got an exception!", exc_info=True)
        

        There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

        Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

        "No handlers found..."

        You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off the...top of the chain (maybe I need a better metaphor).

        import logging
        log = logging.getLogger()
        log.error("no log for you!")
        

        outputs:

        No handlers could be found for logger "root"
        

        There are two things that can be done here:

        1. Configure logging in your module with basicConfig() or similar

        2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

        Want more?

        I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

        February 27, 2015 09:09 PM

        February 25, 2015

        Massimo Gervasini (mgerva)

        on mixins

        We use mixins quite a lot in mozharness.

        Mixins are a powerful pattern that allow you to extend your objects, reusing your code (more here). Think about mixin as “plugins”, you can create your custom class and import features just inheriting from a Mixin class for example:

        class B2GBuild(LocalesMixin, PurgeMixin, B2GBuildBaseScript,
                       GaiaLocalesMixin, SigningMixin, MapperMixin, BalrogMixin):
        

        B2GBuild manages FirefoxOS builds and it knows how to:
        * manage locales (LocalesMixin)
        * how to deal with repositories (PurgeMixin)
        * sign the code (SigningMixin)
        * and more…

        this is just from the class definition! At this point a we haven’t added any single method or property, but we already know how to do a lot of tasks and it’s almost for free!

        So should we use mixins everywhere? Short answer: No.
        Long answer Mixins are powerful but also they can lead to some unexpected behavior.

        object C and D have exactly the same parents and the same methods but their behavior is different, it depends on how the parents are declared.

        This is a side effect of the way python implements the inheritance. Having an object inheriting from too many Mixins can lead to unexpected failures (MRO – method resolution objects) when the object is instantiated, or even worse, at runtime when a method is doing something that is not expected.
        When the inheritance becomes obscure, it’s also becomes difficult to write appropriate tests.

        How can we write a mozharness module without using mixins? Let’s try to write a generic module that provides some disk informations for example we could create the mozharness.base.diskutils module that provides useful information about the disk size. Our first approach would be writing something as:

        class DiskInfoMixin():
            def get_size(self, path):
                self.info('calculating disk size')
                <code here>
        
            def other_methods(self):
                <code here>
        

        and then use it in the final class

        from mozharness.base.diskutils import DiskInfoMixin
        ...
        
        class BuildRepackages(ScriptMixin, LogMixin, ..., DiskInfoMixin):
        ...
            disk_info = self.get_size(path)
        

        Easy! But why are we using a mixin here? Because we need to log some operations and to do so, we need to interact with the LogMixin. This mixin provides everything we need to log messages with mozharness, it provides an abstraction layer to make logging consistent among all the mozharness script and it’s very easy to use, just import the LogMixin and start logging!
        The same code without the using the LogMixin, would more or less be:

        import logging
        
        get_size(path):
            logging.info('calculating disk size')
            ...
            return disk_size

        Just a function. Even easier.

        … and the final script becomes:

        from mozharness.base.diskutils import get_size
        class BuildRepackages(ScriptMixin, LogMixin, ...,):
        ...
             disk_info = get_size(path)

        One less mixin!
        There’s a problem though. Messages logged by get_size() will be inconsistent with the rest of the logging. How can we use the mozharness logging style in other modules?
        The LogMixin, it’s a complex class and it has many methods, but at the end of the day it’s a wrapper around the logging module, so behind the scenes, it must call the logger module. What if we can just ask our logger to use the python log facilities, already configured by mozharness?
        getLogger() method is what we need here!

        import logger
        mhlog = logger.getLogger('Multi')
        get_size(path):
            mhlog.info('calculating disk size')
            ...
            return disk_size

        Mozharness by default uses this ‘Multi‘ logger for its messages, so we have just hooked up our logger into mozharness one. Now every logger call will follow the mozharness style!
        We are half way trough the logging issues for our brand new module: what if we want to log to an arbitrary log level, for example, a quite common pattern in mozharness, is let the caller of a function, decide at what level we want to log, so let’s add a log_level parameter…

        import logger
        mhlog = logger.getLogger('Multi')
        get_size(path, log_level=logger.INFO):
            mhlog.log(lvl=log_level, msg='calculating disk size')
            ...
            return disk_size

        This will work fine for a generic module but we want to use this module in mozharness so there’s only one more thing to change: mozharness log levels are strings type, logging module levels are integers, we need a function to convert between the two formats.
        For convenience, in mozharness.base.log we will explicitly expose the mozharness log levels and add function that converts mozharness log levels to standard log levels.

        LOG_LEVELS = {
            DEBUG: logging.DEBUG,
            INFO: logging.INFO,
            WARNING: logging.WARNING,
            ERROR: logging.ERROR,
            CRITICAL: logging.CRITICAL,
            FATAL: FATAL_LEVEL
        }
        
        def numeric_log_level(level):
            """Converts a mozharness log level (string) to the corresponding logger
               level (number). This function makes possible to set the log level
               in functions that do not inherit from LogMixin
            """
            return LOG_LEVELS[level]
        

        our final module becomes:

        import logging
        from mozharness.base.log import INFO, numeric_log_level
        # use mozharness log
        mhlog = logging.getLogger('Multi')
        
        def get_size(path, unit, log_level=INFO):
            ...
            lvl = numeric_log_level(log_level)
            mhlog.log(lvl=lvl, msg="calculating disk size")

        This is just an example on how to use the standard python logging modules.
        A real diskutils module is about to land in mozharness (bug 1130336), and shouldn’t be too difficult, following the same pattern to create new modules with no dependencies on LogMixin.

        This is a first step in the direction of removing some mixins from the mozharness code (see bug 1101183).
        Mixin are not the absolute evil but they must be used carefully. From now on, if I have to write or modify anything in a mozarness module I will try to enforce the following rules:


        February 25, 2015 05:00 PM

        Kim Moir (kmoir)

        Release Engineering special issue now available

        The release engineering special issue of IEEE software was published yesterday. (Download pdf here).  This issue focuses on the current state of release engineering, from both an industry and research perspective. Lots of exciting work happening in this field!

        I'm interviewed in the roundtable article on the future of release engineering, along with Chuck Rossi of Facebook and Boris Debic of Google.  Interesting discussions on the current state of release engineering at organizations that scale large number of builds and tests, and release frequently.  As well,  the challenges with mobile releases versus web deployments are discussed. And finally, a discussion of how to find good release engineers, and what the future may hold.

        Thanks to the other guest editors on this issue -  Stephany Bellomo, Tamara Marshall-Klein, Bram Adams, Foutse Khomh and Christian Bird - for all their hard work that make this happen!


        As an aside, when I opened the issue, the image on the front cover made me laugh.  It's reminiscent of the cover on a mid-century science fiction anthology.  I showed Mr. Releng and he said "Robot birds? That is EXACTLY how I pictured working in releng."  Maybe it's meant to represent that we let software fly free.  In any case, I must go back to tending the flock of robotic avian overlords.

        February 25, 2015 03:26 PM

        February 24, 2015

        Armen Zambrano G. (@armenzg)

        Listing builder differences for a buildbot-configs patch improved

        Up until now, we updated the buildbot-configs repository to the "default" branch instead of "production" since we normally write patches against that branch.

        However, there is a problem with this, buildbot-configs is always to be on the same branch as buildbotcustom. Otherwise, we can have changes land in one repository which require changes on the other one.

        The fix was to simply make sure that both repositories are either on default or their associated production branches.

        Besides this fix, I have landed two more changes:

        1. Use the production branches instead of 'default'
          • Use -p
        2. Clobber our whole set up (e.g. ~/.mozilla/releng)
          • Use -c

        Here are the two changes:
        https://hg.mozilla.org/build/braindump/rev/7b93c7b7c46a
        https://hg.mozilla.org/build/braindump/rev/bbb5c54a7d42


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 24, 2015 09:45 PM

        February 23, 2015

        Nick Thomas (nthomas)

        FileMerge bug

        FileMerge is a nice diff and merge tool for OS X, and I use it a lot for larger code reviews where lots of context is helpful. It also supports intra-line diff, which comes in pretty handy.

        filemerge screenshot

        However in recent releases, at least in v2.8 which comes as part of XCode 6.1, it assumes you want to be merging and shows that bottom pane. Adjusting it away doesn’t persist to the next time you use it, *gnash gnash gnash*.

        The solution is to open a terminal and offer this incantation:

        defaults write com.apple.FileMerge MergeHeight 0

        Unfortunately, if you use the merge pane then you’ll have to do that again. Dear Apple, pls fix!

        February 23, 2015 09:23 AM

        February 15, 2015

        Rail Alliev (rail)

        Funsize hacking

        Prometheus

        The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

        Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

        Funsize willl solve the problems listed above: to distribute load and to be flexible.

        Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

        Funsize is split into several pieces:

        • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
        • Celery-based workers to generate partial updates and upload them to S3.
        • SQS or RabbitMQ to coordinate Celery workers.

        One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

        The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

        Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

        February 15, 2015 04:32 AM

        February 13, 2015

        Armen Zambrano G. (@armenzg)

        Mozilla CI tools 0.2.1 released - Trigger multiple jobs for a range of revisions

        Today I have released a major release of mozci which includes the following:


        PyPi:       https://pypi.python.org/pypi/mozci
        Source:   https://github.com/armenzg/mozilla_ci_tools


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 13, 2015 04:14 PM

        Kim Moir (kmoir)

        Mozilla pushes - January 2015

        Here's January 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        Trends
        We're back to regular volume after the holidays. Also, it's really cold outside in some parts of the of the Mozilla world.  Maybe committing code > going outside.


        Highlights
        10798 pushes
        348 pushes/day (average)
        Highest number of pushes/day: 562 pushes on Jan 28, 2015
        18.65 pushes/hour (highest)

        General Remarks
        Try had around around 42% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 24% of all of the pushes

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 




        February 13, 2015 04:13 PM

        February 09, 2015

        Morgan Phillips (mrrrgn)

        Gödel, Docker, Bach: Containers Building Containers

        As Docker continues to mature, many organizations are striving to run as much of their infrastructure as possible within containers. Of course, this investment results in a lot of docker-centric tooling for deployment, development, etc...

        Given that, I think it makes a lot of sense for docker containers themselves to be built within other docker containers. Otherwise, you'll introduce a needless exception into your automation practices. Boo to that!

        There are a few ways to run docker from within a container, but here's a neat way that leaves you with access to your host's local images: just mount the docker from your host system.

        ** note: in cases where arbitrary users can push code to your containers, this would be a dangerous thing to do **
        Et voila!



        February 09, 2015 10:11 PM

        Introducing RelEng Containers: Build Firefox Consistently (For A Better Tomorrow)

        From time to time, Firefox developers encounter errors which only appear on our build machines. Meaning -- after they've likely already failed numerous times to coax the failure form their own environment -- they must resort to requesting RelEng to pluck a system from our infrastructure so they can use it for debugging: we call this a slave loan, and they happen frequently.

        Case in point: bug #689291

        Firefox is a huge open source project: slave loans can never scale enough to serve our community. So, this weekend I took a whack at solving this problem with Docker. So far, five [of an eventual fourteen] containers have been published, which replicate the following aspects of our in house build environments:As usual, you can find my scratch work on GitHub: mozilla/build-environments




        What Are These Environments Based On?

        For a long time, builds have taken place inside of chroots built with Mock. We have three bare bones mock configs which I used to bake some base platform images: On top of our Mock configs, we further specialize build chroots via build scripts powered by Mozharness. The specifications of each environment are laid out in these mozharness configs. To make use of these, I wrote a simple script which converts a mozharness config into a Dockerfile.

        The environments I've published so far:The next step, before I publish more containers, will be to write some documentation for developers so they can begin using them for builds with minimal hassle. Stay tuned!

        February 09, 2015 06:09 AM

        February 06, 2015

        Hal Wine (hwine)

        Kaizen the low tech way

        Kaizen the low tech way

        On Jan 29, I treated myself to a seminar on Successful Lean Teams, with an emphasis on Kanban & Kaizen techniques. I’d read about both, but found the presentation useful. Many of the other attendees were from the Health Care industry and their perspectives were very enlightening!

        Hearing how successful they were in such a high risk, multi-disciplinary, bureaucratic, and highly regulated environment is inspiring. I’m inclined to believe that it would also be achievable in a simple-by-comparison low risk environment of software development. ;)

        What these hospitals are using is a light weight, self managed process which:

        • ensures visibility of changes to all impacted folks
        • outlines the expected benefits
        • includes a “trial” to ensure the change has the desired impact
        • has a built in feedback system

        That sounds achievable. In several of the settings, the traditional paper and bulletin board approach was used, with 4 columns labeled “New Ideas”, “To Do”, “Doing”, and “Done”. (Not a true Kanban board for several reasons, but Trello would be a reasonable visual approximation; CAB uses spreadsheets.)

        Cards move left to right, and could cycle back to “New Ideas” if iteration is needed. “New Ideas” is where things start, and they transition from there (I paraphrase a lot in the following):

        1. Everyone can mark up cards in New Ideas & add alternatives, etc.
        2. A standup is held to select cards to move from “New Ideas” to “To Do”
        3. The card stays in “To Do” for a while to allow concerns to be expressed by other stake holders. Also a team needs to sign up to move the change through the remaining steps. Before the card can move to “Doing”, a “test” (pilot or checkpoints) is agreed on to ensure the change can be evaluated for success.
        4. The team moves the card into “Doing”, and performs PSDA cycles (Plan, Do, Study, Adjust) as needed.
        5. Assuming the change yields the projected results, the change is implemented and the card is moved to “Done”. If the results aren’t as anticipated, the card gets annotated with the lessons learned, and either goes to “Done” (abandon) or back to “New Ideas” (try again) as appropriate.

        For me, I’m drawn to the 2nd and 3rd steps. That seems to be the change from current practice in teams I work on. We already have a gazillion bugs filed (1st step). We also can test changes in staging (4th step) and update production (5th step). Well, okay, sometimes we skip the staging run. Occasionally that *really* bites us. (Foot guns, foot guns – get your foot guns here!)

        The 2nd and 3rd steps help focus on changes. And make the set of changes happening “nowish” more visible. Other stakeholders then have a small set of items to comment upon. Net result - more changes “stick” with less overall friction.

        Painting with a broad brush, this Kaizen approach is essentially what the CAB process is that Mozilla IT implemented successfully. I have experienced the CAB reduce the amount of stress, surprises, and self inflicted damage amongst both inside and outside of IT. Over time, the velocity of changes has increased and backlogs have been reduced. In short, it is a “Good Thing(tm)”.

        So, I’m going to see if there is a way to “right size” this process for the smaller teams I’m on now. Stay tuned....

        February 06, 2015 08:00 AM

        February 04, 2015

        Rail Alliev (rail)

        Deploying your code from github to AWS Elastic Beanstalk using Travis

        I have been playing with Funsize a lot recently. One of the goals was iterating faster:

        I have hit some challenges with both Travis and Elastic Beanstalk.

        The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

        A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

        The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

        Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

        # .travis.yml snippet
        deploy:
            - provider: elasticbeanstalk
              app: funsize # Elastic Beanstalk app name
              env: funsize-dev-rail # Elastic Beanstalk env name
              bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
              region: us-east-1
              access_key_id:
                secure: "encrypted key id"
              secret_access_key:
                secure: "encrypted key"
              on:
                  repo: rail/build-funsize # Deploy only using my user repo for now
                  all_branches: true
                  # deploy only if particular jobs in the job matrix passes, not any
                  condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis
        

        Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

        Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

        February 04, 2015 02:09 AM

        February 03, 2015

        Armen Zambrano G. (@armenzg)

        What the current list of buildbot builders is

        This becomes very easy with mozilla_ci_tools (aka mozci):
        >>> from mozci import mozci
        >>> builders = mozci.list_builders()
        >>> len(builders)
        15736
        >>> builders[0]
        u'Linux x86-64 mozilla-inbound leak test build'
        This and many other ways to interact with our CI will be showing up in the repository.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        February 03, 2015 07:48 PM

        Morgan Phillips (mrrrgn)

        shutdown -r never part deux

        In my last post, I wrote about how runner and cleanslate were being leveraged by Mozilla RelEng to try at eliminating the need for rebooting after each test/build job -- thus reclaiming a good deal of wasted time. Since then, I've had the opportunity to outfit all of our hosts with better logging, and collect live data which highlights the progress that's been made. It's been bumpy, but the data suggests that we have reduced reboots (across all tiers) by around 40% -- freeing up over 72,000 minutes of compute time per day, with an estimated savings of $51,000 per year.

        Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.

        Collecting Data

        With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed ps. To take advantage of this, I wrote a "task hook" feature which passes task state to an external script. From there, I wrote a hook script which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.

        Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:

        * a started buildbot task generally indicates that a job is active on a machine *


        * a global ps! *


        * spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *


        * we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *


        Costs/Time Saved Calculations

        To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving.

        The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.



        I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:

        (non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr

        (spot) cost: $14277.72 time: 875936hr avg: $0.02/hr


        Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.

        To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.

        You can see all of my raw data, queries, and helper scripts at this github repo: https://github.com/mrrrgn/build-no-reboots-data

        Why Are We Only Saving 40%?

        The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, anything which interacts with a display server.

        To take advantage of the jobs which area already working, I added a task "post_flight.py," which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.

        Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.

        Why Not Use Containers?

        First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: task cluster. Another big part of my work will be porting some of our current builds to that system.

        What About Windows

        If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: bug 1055794

        Future Plans

        Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc...

        Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.

        Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis.

        February 03, 2015 05:53 PM

        January 27, 2015

        Justin Wood (Callek)

        Release Engineering does a lot…

        Hey Everyone,

        I spent a few minutes a week over the last month or two working on compiling a list of Release Engineering work areas. Included in that list is identifying which repositories we “own” and work in, as well as where these repositories are mirrored. (We have copies in hg.m.o git.m.o and github, some exclusively in their home).

        While we transition to a more uniform and modern design style and philosphy.

        My major takeaway here is we have A LOT of things that we do. (this list is explicitly excluding repositories that are obsolete and unused)

        So without further ado, I present our page ReleaseEngineering/Repositories

        repositoriesYou’ll notice a few things about this, we have a column for Mirrors, and RoR (Repository of Record), “Committable Location” was requested by Hal and is explicitly for cases where “Where we consider our important location the RoR, it may not necessarily be where we allow commits to”

        The other interesting thing is we have automatic population of travis and coveralls urls/status icons. This is for free using some magic wiki templates I did.

        The other piece of note here, is the table is generated by a list of pages, using “SemanticMediaWiki” so the links to the repositories can be populated with things like “where are the docs” “what applications use this repo”, “who are suitable reviewers” etc. (all those are TODO on the releng side so far).

        I’m hoping to be putting together a blog post at some point about how I chose to do much of this with mediawiki, however in the meantime should any team at Mozilla find this enticing and wish to have one for themselves, much of the work I did here can be easily replicated for your team, even if you don’t need/like the multiple repo location magic of our table. I can help get you setup to add your own repos to the mix.

        Remember the only fields that are necessary is a repo name, the repo location, and owner(s). The last field can even be automatically filled in by a form on your page (see the end of Release Engineerings page for an example of that form)

        Reach out to me on IRC or E-mail (information is on my mozillians profile) if you desire this for your team and we can talk. If you don’t have a need for your team, you can stare at all the stuff Releng is doing and remember to thank one of us next time you see us. (or inquire about what we do, point contributors our way, we’re a friendly group, I promise.)

        January 27, 2015 11:11 PM

        January 22, 2015

        Armen Zambrano G. (@armenzg)

        Backed out - Pinning for Mozharness is enabled for the fx-team integration tree

        EDIT=We had to back out this change since it caused issues for PGO talos jobs. We will try again after further testing.

        Pinning for Mozharness [1] has been enabled for the fx-team integration tree.
        Nothing should be changing. This is a no-op change.

        We're still using the default mozharness repository and the "production" branch is what is being checked out. This has been enabled on Try and Ash for almost two months and all issues have been ironed out. You can know if a job is using pinning of Mozharness if you see "repostory_manifest.py" in its log.

        If you notice anything odd please let me know in bug 1110286.

        If by Monday we don't see anything odd happening, I would like to enable it for mozilla-central for few days before enabling it on all trunk trees.

        Again, this is a no-op change, however, I want people to be aware of it.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 22, 2015 08:57 PM

        January 21, 2015

        Kim Moir (kmoir)

        Reminder: Releng 2015 submissions due Friday, January 23

        Just a reminder that submissions for the Releng 2015 conference are due this Friday, January 23. 

        It will be held on May 19, 2015 in Florence Italy.

        If you've done recent work like
        we'd love to hear from you.  Please consider submitting a talk!

        In addition, if you have colleagues that work in this space that might have interesting topics to discuss at this workshop, please forward this information. I'm happy to talk to people about the submission process or possible topics if there are questions.

        Il Duomo di Firenze by ©eddi_07, Creative Commons by-nc-sa 2.0


        Sono nel comitato che organizza la conferenza Releng 2015 che si terrà il 19 Maggio 2015 a Firenze. La scadenza per l’invio dei paper è il 23 Gennaio 2015.

        http://releng.polymtl.ca/RELENG2015/html/index.html

        se avete competenze in:
        e volete discutere della vostra esperienza, inviateci una proposta di talk!

        Per favore inoltrate questa richiesta ai vostri colleghi e alle persone interessate a questi argomenti. Nel caso ci fossero domande sul processo di invio o sui temi di discussione, non esitate a contattarmi.

        (Thanks Massimo for helping with the Italian translation).

        More information
        Releng 2015 web page
        Releng 2015 CFP now open

        January 21, 2015 08:36 PM

        January 16, 2015

        Nick Thomas (nthomas)

        Plans for 2015 – Revamping the Release Automation

        Mozilla’s Release Engineering team has been through several major iterations of our “release automation”, which is how we produce the bits for Firefox betas and releases. With each incarnation, the automation has become more reliable, supported more functionality, and end-to-end time has reduced. If you go back a few years to Firefox 2.0 it took several days to prepare 40 or so locales and three platforms for a release; now it’s less than half a day for 90 locales and four platforms. The last major rewrite was some time ago so it’s time to embark on a big revamp – this time we want to reduce the end-to-end time significantly.

        Currently, when a code change lands in the repository (eg mozilla-beta) a large set of compile and test jobs are started. It takes about 5 hours for the slowest platform to complete an optimized build and run the tests, in part because we’re using Profile-Guided Optimization (PGO) and need to link XUL twice. Assuming the tests have passed, or been recognized as an intermittent failure, a Release Manager will kick off the release automation. It will tag the gecko and localization repositories, and a second round of compilation will start, using the official branding and other release-specific settings. Accounting for all the other release work (localized builds, source tarballs, updates, and so on) the automation takes 10 or more hours to complete.

        The first goal of the revamp is to avoid the second round of compilation, with all the loss of time and test coverage it brings. Instead, we’re looking at ‘promoting’ the builds we’ve already done (in the sense of rank, not marketing). By making some other improvements along the way, eg fast generation of partial updates using funsize, we may be able to save as much as 50% from the current wall time. So we’ll be able to ship fixes to beta users more often than twice a week, get feedback earlier in the cycle, and be more confident about shipping a new release. It’ll help us to ship security fixes faster too.

        We’re calling this ‘Build Promotion’ for short, and you can follow progress in Bug 1118794 and dependencies.

        January 16, 2015 10:08 AM

        January 10, 2015

        Hal Wine (hwine)

        ChatOps Meetup

        ChatOps Meetup

        This last Wednesday, I went to a meetup on ChatOps organized by SF DevOps, hosted by Geekdom (who also made recordings available), and sponsored by TrueAbility.

        I had two primary goals in attending: I wanted to understand what made ChatOps special, and I wanted to see how much was applicable to my current work at Mozilla. The two presentations helped me accomplish the first. I’m still mulling over the second. (Ironically, I had to shift focus during the event to clean up a deployment-gone-wrong that was very close to one of the success stories mentioned by Dan Chuparkoff.)

        My takeaway on why chatops works is that it is less about the tooling (although modern web services make it a lot easier), and more about the process. Like a number of techniques, it appears to be more successful when teams fully embrace their vision of ChatOps, and make implementation a top priority. Success is enhanced when the tooling supports the vision, and that appears to be what all the recent buzz is about – lots of new tools, examples, and lessons learned make it easier to follow the pioneers.

        What are the key differentiators?

        Heck, many teams use irc for operational coordination. There are scripts which automate steps (some workflows can be invoked from the web even). We’ve got automated configuration, logging, dashboards, and wikis – are we doing ChatOps?

        Well, no, we aren’t.

        Here are the differences I noted:
        • ChatOps requires everyone both agreeing and committing to a single interface to all operations. (The opsbot, like hubot, lita or Err.) Technical debt (non-conforming legacy systems) will be reworked to fit into ChatOps.
        • ChatOps requires focus and discipline. There are a small number of channels (chat rooms, MUC) that have very specific uses - and folks follow that. High signal to noise ratio. (No animated gifs in the deploy channel - that’s what the lolcat channel is for.)
        • A commitment to explicitly documenting all business rules as executable code.

        What do you get for giving up all those options and flexibility? Here was the “ah ha!” concepts for me:

        1. Each ChatOps room is a “shared console” everyone can see and operate. No more screen sharing over video, or “refresh now” coordination!

        2. There is a bot which provides the “facts” about the world. One view accessible by all.

        3. The bot is also the primary way folks interact and modify the system. And it is consistent in usage across all commands. (The bot extensions perform the mapping to whatever the backend needs. The code adapts, not the human!)

        4. The bot knows all and does all:
          • Where’s the documentation?
          • How do I do X?
          • Do X!
          • What is the status of system Y?
        5. The bot is “fail safe” - you can’t bypass the rules. (If you code in a bypass, well, you loaded that foot gun!)

        Thus everything is consistent and familiar for users, which helps during those 03:00 forays into a system you aren’t as familiar with. Nirvana ensues (remember, everyone did agree to drink the koolaid above).

        Can you get there from here?

        The speaker selection was great – Dan was able to speak to the benefits of committing to ChatOps early in a startup’s life. James Fryman (from StackStorm) showed a path for migrating existing operations to a ChatOps model. That pretty much brackets the range, so yeah, it’s doable.

        The main hurdle, imo, would be getting the agreement to a total commitment! There are some tensions in deploying such a system at a highly open operation like Mozilla: ideally chat ops is open to everyone, and business rules ensure you can’t do or see anything improper. That means the bot has (somewhere) the credentials to do some very powerful operations. (Dan hopes to get their company to the “no one uses ssh, ever” point.)

        My next steps? Still thinking about it a bit – I may load Err onto my laptop and try doing all my local automation via that.

        January 10, 2015 08:00 AM

        January 09, 2015

        Chris AtLee (catlee)

        Upcoming hotness from RelEng

        To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.

        Balrog

        First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

        Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/

        Funsize

        Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

        We're aiming to have nightly builds use funsize for partial update generation this quarter.

        I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

        The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

        S3 uploads from automation

        This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

        No reboots

        Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

        Release build promotion

        Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

        Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

        This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

        ... and much more!

        This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

        January 09, 2015 06:35 PM

        Ben Hearsum (bhearsum)

        UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

        (This post has been updated with the new go-live date.)

        Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we’re ready to switch the vast majority of our users over. We’ll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

        Stick around if you’re interested in some of the load testing we did.


        Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

        We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn’t a complete surprise given that we hadn’t implemented any form of caching yet. After implementing a simple LRU cache on Balrog’s largest objects, we did another load test. Here’s what the load looked like on one web head:

        Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

        I’m not sure what to call that other than winning.

        January 09, 2015 04:39 PM

        January 08, 2015

        Kim Moir (kmoir)

        Mozilla pushes - December 2014


        Here's December 2014's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

        Trends
        There was a low number of pushes this month.  I expect this is due to the Mozilla all-hands in Portland in early December where we were encouraged to meet up with other teams instead of coding :-) and the holidays at the end of the month for many countries.
        As as side node, in 2014 we had a total number of 124423 pushes, compared to 79233 in 2013 which represents a growth rate of 57% this year.

        Highlights
        7836 pushes
        253 pushes/day (average)
        Highest number of pushes/day: 706 pushes on Dec 17, 2014
        15.25 pushes/hour (highest)

        General Remarks
        Try had around around 46% of all the pushes
        The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all of the pushes

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes 







        January 08, 2015 05:14 PM

        January 06, 2015

        Armen Zambrano G. (@armenzg)

        Tooltool fetching can now use LDAP credentials from a file

        You can now fetch tooltool files by using an authentication file.
        All you have to do is append "--authentication-file file" to your tooltool fetching command.

        This is important if you want to use automation to fetch files from tooltool on your behalf.
        This was needed to allow Android test jobs to run locally since we need to download tooltool files for it.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 06, 2015 04:45 PM

        January 05, 2015

        Armen Zambrano G. (@armenzg)

        Run Android test jobs locally

        You can now run Android test jobs on your local machine with Mozharness.

        As with any other developer capable Mozharness script, all you have to do is:

        An example for this is:
        python scripts/android_emulator_unittest.py --cfg android/androidarm.py
        --test-suite mochitest-gl-1 --blob-upload-branch try
        --download-symbols ondemand --cfg developer_config.py
        --installer-url http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android-api-9/en-US/fennec-37.0a1.en-US.android-arm.apk
        --test-url http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android-api-9/en-US/fennec-37.0a1.en-US.android-arm.tests.zip


        Here's the bug where the work happened.
        Here's the documentation on how to run Mozharness as a developer.

        Please file a bug under Mozharness if you find any issues.

        Here are some other related blog posts:


        Disclaimers

        Bug 1117954- I think that I need a different SDK or emulator version is needed to run Android API 10 jobs.

        I wish we run all of our jobs in proper isolation!


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        January 05, 2015 08:47 PM

        December 22, 2014

        Armen Zambrano G. (@armenzg)

        Run mozharness talos as a developer (Community contribution)

        Thanks to our contributor Simarpreet Singh from Waterloo we can now run a talos job through mozharness on your local machine (bug 1078619).

        All you have to add is the following:
        --cfg developer_config.py 
        --installer-url http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-trunk/firefox-37.0a1.en-US.linux-x86_64.tar.bz2

        To read more about running Mozharness locally go here.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 22, 2014 08:10 PM

        December 11, 2014

        Kim Moir (kmoir)

        Releng 2015 CFP now open

        Florence, Italy.  Home of beautiful architecture.

        Il Duomo di Firenze by ©runner310, Creative Commons by-nc-sa 2.0


        Delicious food and drink.

        Panzanella by © Pete Carpenter, Creative Commons by-nc-sa 2.0

        Caffè ristretto by © Marcelo César Augusto Romeo, Creative Commons by-nc-sa 2.0


        And next May, release engineering :-)

        The CFP for Releng 2015 is now open.  The deadline for submissions is January 23, 2015.  It will be held on May 19, 2015 in Florence Italy and co-located with ICSE 2015.   We look forward to seeing your proposals about the exciting work you're doing in release engineering!

        If you have questions about the submission process or anything else, please contact any of the program committee members. My email is kmoir and I work at mozilla.com.

        December 11, 2014 09:00 PM

        December 09, 2014

        Armen Zambrano G. (@armenzg)

        Running Mozharness in developer mode will only prompt once for credentials

        Thanks to Mozilla's contributor kartikgupta0909 we now only have to enter LDAP credentials once when running the developer mode of Mozharness.

        He accomplished it in bug 1076172.

        Thank you Kartik!


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 09, 2014 09:43 PM

        December 08, 2014

        Armen Zambrano G. (@armenzg)

        Test mozharness changes on Try

        You can now push to your own mozharness repository (even a specific branch) and have it be tested on Try.

        Few weeks ago we developed mozharness pinning (aka mozharness.json) and recently we have enabled it for Try. Read the blog post to learn how to make use of it.

        NOTE: This currently only works for desktop, mobile and b2g test jobs. More to come.
        NOTE: We only support named branches, tags or specific revisions. Do not use bookmarks as it doesn't work.


        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        December 08, 2014 06:59 PM

        December 04, 2014

        Morgan Phillips (mrrrgn)

        shutdown -r never

        For the past month I've worked on achieving the effects of a reboot without actually doing one. Sort of a "virtual" reboot. This isn't a usual optimization; but in Mozilla's case it's likely to create a huge impact on performance.

        Mozilla build/test infrastructure is complex. The jobs can be expensive and messy. So messy that, for a while now, machines have been rebooted after completing tasks to ensure that environments remain fresh.

        This strategy works marvelously at preventing unnecessary failures; but wastes a lot of resources. In particular, with reboots taking something like two minutes to complete, and at around 100k jobs per day, a whopping 200,000 minutes of machine time. That's nearly five months - yikes!1

        Yesterday I began rolling out these "virtual" reboots for all of our Linux hosts, and it seems to be working well [edit: after a few rollbacks]. By next month I should also have it turned on for OSX and Windows machines.



        What does a "virtual" reboot look like?

        For starters [pun intended], each job requires a good amount of setup and teardown, so, a sort of init system is necessary. To achieve this a utility called runner has been created. Runner is a project that manages starting tasks in a defined order. If tasks fail, the chain can be retried, or halted. Many tasks that once lived in /etc/init.d/ are now managed by runner including buildbot itself.



        Among runner's tasks are various scripts for cleaning up temporary files, starting/restarting services, and also a utility called cleanslate. Cleanslate resets a users running processes to a previously recorded state.

        At boot, cleanslate takes a snapshot of all running processes, then, before each job it kills any processes (by name) which weren't running when the system was fresh. This particular utility is key to maintaining stability and may be extended in the future to enforce other kinds of system state as well.



        The end result is this:

        old work flow

        Boot + init -> Take Job -> Reboot (2-5 min)

        new work flow

        Boot + Runner -> Take Job -> Shutdown Buildslave
        (runner loops and restarts slave)



        [1] What's more, this estimate does not take into account the fact that jobs run faster on a machine that's already "warmed up."

        December 04, 2014 06:54 PM

        December 03, 2014

        Kim Moir (kmoir)

        Mozilla pushes - November 2014

        Here's November's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

        Trends
        Not a record breaking month, in fact we are down over 2000 pushes since the last month.

        Highlights
        10376 pushes
        346 pushes/day (average)
        Highest number of pushes/day: 539 pushes on November 12
        17.7 pushes/hour (average)

        General Remarks
        Try keeps had around 38% of all the pushes, and gaia-try has about 30%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all the pushes.

        Records
        August 2014 was the month with most pushes (13,090  pushes)
        August 2014 has the highest pushes/day average with 422 pushes/day
        July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
        October 8, 2014 had the highest number of pushes in one day with 715 pushes    







        December 03, 2014 09:41 PM

        November 24, 2014

        Armen Zambrano G. (@armenzg)

        Pinning mozharness from in-tree (aka mozharness.json)

        Since mozharness came around 2-3 years ago, we have had the same issue where we test a mozharness change against the trunk trees, land it and get it backed out because we regress one of the older release branches.

        This is due to the nature of the mozharness setup where once a change is landed all jobs start running the same code and it does not matter on which branch that job is running.

        I have recently landed some code that is now active on Ash (and soon on Try) that will read a manifest file that points your jobs to the right mozharness repository and revision. We call this process to "pin mozhaness". In other words, what we do is to fix an external factor to our job execution.

        This will allow you to point your Try pushes to your own mozharness repository.

        In order to pin your jobs to a repository/revision of mozharness you have to change a file called mozharness.json which indicates the following two values:


        This is a similar concept as talos.json introduced which locks every job to a specific revision of talos. The original version of it landed in 2011.

        Even though we have a similar concept since 2011, that doesn't mean that it was as easy to make it happen for mozharness. Let me explain a bit why:

        Coming up:
        • Enable on Try
        • Free up Ash and Cypress
          • They have been used to test custom mozharness patches and the default branch of Mozharness (pre-production)
        Long term:
        • Enable the feature on all remaining Gecko trees
          • We would like to see this run at scale for a bit before rolling it out
          • This will allow mozharness changes to ride the trains
        If you are curious, the patches are in bug 791924.

        Thanks for Rail for all his patch reviews and Jordan for sparking me to tackle it.



        Creative Commons License
        This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

        November 24, 2014 05:35 PM