Planet Release Engineering

August 26, 2014

Chris AtLee (catlee)

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

August 26, 2014 02:21 PM

August 18, 2014

Jordan Lund (jlund)

This week in Releng - Aug 11th 2014

Completed work (resolution is 'FIXED'):


In progress work (unresolved and not assigned to nobody):

August 18, 2014 06:38 AM

August 12, 2014

Ben Hearsum (bhearsum)

Upcoming changes to Mac package layout, signing

Apple recently announced changes to how OS X applications must be packaged and signed in order for them to function correctly on OS X 10.9.5 and 10.10. The tl;dr version of this is “only mach-O binaries may live in .app/Contents/MacOS, and signing must be done on 10.9 or later”. Without any changes, future versions of Firefox will cease to function out-of-the-box on OS X 10.9.5 and 10.10. We do not have a release date for either of these OS X versions yet.

Changes required:
* Move all non-mach-O files out of .app/Contents/MacOS. Most of these will move to .app/Contents/Resources, but files that could legitimately change at runtime (eg: everything in defaults/) will move to .app/MozResources (which can be modified without breaking the signature): https://bugzilla.mozilla.org/showdependencytree.cgi?id=1046906&hide_resolved=1. This work is in progress, but no patches are ready yet.
* Add new features to the client side update code to allow partner repacks to continue to work. (https://bugzilla.mozilla.org/show_bug.cgi?id=1048921)
* Create and use 10.9 signing servers for these new-style apps. We still need to use our existing 10.6 signing servers for any builds without these changes. (https://bugzilla.mozilla.org/show_bug.cgi?id=1046749 and https://bugzilla.mozilla.org/show_bug.cgi?id=1049595)
* Update signing server code to support new v2 signatures.

Timeline:
We are intending to ship the required changes with Gecko 34, which ships on November 25th, 2014. The changes required are very invasive, and we don’t feel that they can be safely backported to any earlier version quickly enough without major risk of regressions. We are still looking at whether or not we’ll backport to ESR 31. To this end, we’ve asked that Apple whitelist Firefox and Thunderbird versions that will not have the necessary changes in them. We’re still working with them to confirm whether or not this can happen.

This has been cross posted a few places – please send all follow-ups to the mozilla.dev.platform newsgroup.

August 12, 2014 05:05 PM

August 11, 2014

Jordan Lund (jlund)

This Week In Releng - Aug 4th, 2014

Major Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

August 11, 2014 01:09 AM

August 08, 2014

Kim Moir (kmoir)

Mozilla pushes - July 2014

Here's the July 2014 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.
 
Trends
Like every month for the past while, we had a new record number of pushes. In reality, given that July is one day longer than June, the numbers are quite similar.

Highlights


General remarks
Try keeps on having around 38% of all the pushes. Gaia-Try is in second place with around 31% of pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

Records 
July 2014 was the month with most pushes (12,755 pushes)
June 2014 has the highest pushes/day average with 662 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with 662 
 

August 08, 2014 06:16 PM

August 07, 2014

Kim Moir (kmoir)

Scaling mobile testing on AWS

Running tests for Android at Mozilla has typically meant running on reference devices.  Physical devices that run jobs on our continuous integration farm via test harnesses.  However, this leads to the same problem that we have for other tests that run on bare metal.  We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations.  In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep?  ©Bill McIntyre, Creative Commons by-nc-sa 2.0
Recently, we started running Android 2.3 tests on emulators in AWS.  This works well for unit tests (correctness tests).  It's not really appropriate for performance tests, but that's another story.  This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests. 

We run many Linux based tests, including Android emulators on AWS spot instances.  Spot instances are AWS excess capacity that you can bid on.  If someone outbids the price you have paid for your spot instance, you instance can be terminated.  But that's okay because we retry jobs if they fail for infrastructure reasons.  The overall percentage of spot instances that are terminated is quite small.  The huge advantage to using spot instances is price.  They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill

We have a wide variety of unit tests that run on emulators for mobile on AWS.  We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances.   Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully.  At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run.  So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity.  But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool.  This meant that people had to wait a long time for their test results.  Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully.  Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work.  In his earlier work the tests did not complete successfully on this instance type.  We are unsure as to the reasons why.  One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end.  So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool.  Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated.  The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available.  For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server.  I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2.  When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs.  After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The watch_pending.py script watches the number of pending jobs that on AWS and creates instances as required.  We also have scripts to terminate or stop idle instances when we don't get them.  Why pay for machines when you don't need them now?  After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.

Royal Border Bridge.  Also, release engineers love to see green builds and tests.  ©Jonathan Combe, Creative Commons by-nc-sa 2.0
The end result is that some Android 2.3 tests run on m1.medium or (tst-linux64-spot instances), such as mochitests.



And some Android 2.3 tests run on c3.xlarge or (tst-emulator64-spot instances), such as crashtests.

 

In enabling this slave class within our configs, we were also able to reuse it for some b2g tests which also faced the same problem where they needed a more powerful instance type for the tests to complete.

Lessons learned:
Use the minimum (cheapest) instance type required to complete your tests
As usual, test on a branch before full deployment
Scaling mobile tests doesn't mean more racks of reference cards

Future work:
Bug 1047467 c3.xlarge instance types are expensive, let's test running those tests on a range of instance types that are cheaper

Further reading:
AWS instance types 
Chris Atlee wrote about how we Now Use AWS Spot Instances for Tests
Taras Glek wrote How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 months
Rail Aliiev http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html 
Bug 980519 Experiment with other instance types for Android 2.3 jobs 
Bug 1024091 Address high pending count in in-house Linux64 test pool 
Bug 1028293 Increase Android 2.3 mochitest chunks, for aws 
Bug 1032268 Experiment with c3.xlarge for Android 2.3 jobs
Bug 1035863 Add two new Linux64 masters to accommodate new emulator slaves
Bug 1034055 Implement c3.xlarge slave class for Linux64 test spot instances
Bug 1031083 Buildbot changes to run selected b2g tests on c3.xlarge
Bug 1047467 c3.xlarge instance types are expensive, let's try running those tests on a range of instance types that are cheaper

August 07, 2014 06:24 PM

August 04, 2014

Jordan Lund (jlund)

This Week In Releng - July 28th, 2014

Major Highlights:

Completed Work (marked as resolved):

In progress work (unresolved and not assigned to nobody):

August 04, 2014 04:22 PM

July 28, 2014

Kim Moir (kmoir)

2014 USENIX Release Engineering Summit CFP now open

The CFP for the 2014 Release Engineering summit (Western edition) is now open.  The deadline for submissions is September 5, 2014 and speakers will be notified by September 19, 2014.  The program will be announced in late September.  This one day summit on all things release engineering will be held in concert with LISA, in Seattle on November 10, 2014. 

Seattle skyline © Howard Ignatius, https://flic.kr/p/6tQ3H Creative Commons by-nc-sa 2.0


From the CFP


"Suggestions for topics include (but are not limited to):
URES '14 West is looking for relevant and engaging speakers and workshop facilitators for our event on November 10, 2014, in Seattle, WA. URES brings together people from all areas of release engineering—release engineers, developers, managers, site reliability engineers, and others—to identify and help propose solutions for the most difficult problems in release engineering today."

War and horror stories. I like to see that in a CFP.  Describing how you overcame problems with  infrastructure and tooling to ship software are the best kinds of stories.  They make people laugh. Maybe cry as they realize they are currently living in that situation.  Good times.  Also, I think talks around scaling high volume continuous integration farms will be interesting.  Scaling issues are a lot of fun and expose many issues you don't see when you're only running a few builds a day. 

If you have any questions surrounding the CFP, I'm happy to help as I'm on the program committee.   (my irc nick is kmoir (#releng) as is my email id at mozilla.com)

July 28, 2014 09:28 PM

July 25, 2014

Aki Sasaki (aki)

on leaving mozilla

Today's my last day at Mozilla. It wasn't an easy decision to move on; this is the best team I've been a part of in my career. And working at a company with such idealistic principles and the capacity to make a difference has been a privilege.

Looking back at the past five-and-three-quarter years:

I will stay a Mozillian, and I'm looking forward to see where we can go from here!



comment count unavailable comments

July 25, 2014 07:26 PM

July 18, 2014

Kim Moir (kmoir)

Reminder: Release Engineering Special Issue submission deadline is August 1, 2014

Just a friendly reminder that the deadline for the Release Engineering Special Issue is August 1, 2014.  If you have any questions about the submission process or a topic that's you'd like to write about, the guest editors, including myself, are happy to help you!

July 18, 2014 10:03 PM

Mozilla pushes - June 2014

Here's June 2014's  analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file

Trends
This was another record breaking month with a total of 12534 pushes.  As a note of interest, this is is over double the number of pushes we had in June 2013. So big kudos to everyone who helped us scale our infrastructure and tooling.  (Actually we had 6,433 pushes in April 2013 which would make this less than half because June 2013 was a bit of a dip.  But still impressive :-)

Highlights

General Remarks
The introduction of Gaia-try in April has been very popular and comprised around 30% of pushes in June compared to 29% last month.
The Try branch itself consisted of around 38% of pushes.
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes, compared to 22% in the previous month.

Records
June 2014 was the month with most pushes (12534 pushes)
June 2014 has the highest pushes/day average with
418 pushes/day
June 2014 has the highest average of "pushes-per-hour" is
23.17 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with
662 pushes





July 18, 2014 09:46 PM

Massimo Gerva (mgerva)

apache rewrite rules

problem:

always serve the content from an external web server unless the content is available locally:

RewriteEngine on
RewriteCond %{REQUEST_URI} !-U
RewriteRule ^(.+) http://example.com/$1

thanks mod_rewrite!


July 18, 2014 06:53 PM

July 15, 2014

Armen Zambrano G. (@armenzg)

Developing with GitHub and remote branches

I have recently started contributing using Git by using GitHub for the Firefox OS certification suite.

It has been interestting switching from Mercurial to Git. I honestly believed it would be more straight forward but I have to re-read again and again until the new ways sink in with me.

jgraham shared with me some notes (Thanks!) with regards what his workflow looks like and I want to document it for my own sake and perhaps yours:
git clone git@github.com:mozilla-b2g/fxos-certsuite.git

# Time passes

# To develop something on master
# Pull in all the new commits from master

git fetch origin

# Create a new branch (this will track master from origin,
# which we don't really want, but that will be fixed later)

git checkout -b my_new_thing origin/master

# Edit some stuff

# Stage it and then commit the work

git add -p
git commit -m "New awesomeness"

# Push the work to a remote branch
git push --set-upstream origin HEAD:jgraham/my_new_thing

# Go to the GH UI and start a pull request

# Fix some review issues
git add -p
git commit -m "Fix review issues" # or use --fixup

# Push the new commits
git push

# Finally, the review is accepted
# We could rebase at this point, however,
# we tend to use the Merge button in the GH UI
# Working off a different branch is basically the same,
# but you replace "master" with the name of the branch you are working off.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 15, 2014 09:04 PM

July 14, 2014

Massimo Gerva (mgerva)

bash magic

I have find this command in one of our startup script:

<command>
ret=$?
return $?

Our scripts worked fine for months but then some random error appeared. The problem with the above code is that it will always return 0

ret (unused variable) stores the exit code of  <command> but then the script returns $? .

The second $? refers to the status of the assignment of the variable, not the exit code of <command>.

Here is the an updated (and working) version of the code

<command>
return $?

not to self: remember to remove all the unused bash variables.


July 14, 2014 11:33 PM

July 11, 2014

Armen Zambrano G. (@armenzg)

Introducing Http authentication for Mozharness.

A while ago, I asked a colleague (you know who you are! :P) of mine how to run a specific type of test job on tbpl on my local machine and he told me with a smirk, "With mozharness!"

I wanted to punch him (HR: nothing to see here! This is not a literal punch, a figurative one), however he was right. He had good reason to say that, and I knew why he was smiling. I had to close my mouth and take it.

Here's the explanation on why he said that: most jobs running inside of tbpl are being driven by Mozharness, however they're optimized to run within the protected network of Release Engineering. This is good. This is safe. This is sound. However, when we try to reproduce a job outside of the Releng network, it becomes problematic for various reasons.

Many times we have had to guide people who are unfamiliar with mozharness as they try to run it locally with success. (Docs: How to run Mozharness as a developer). However, on other occasions when it comes to binaries stored on private web hosts, it becomes necessary to loan a machine. A loaned machine can reach those files through internal domains since it is hosted within the Releng network.

Today, I have landed a piece of code that does two things:
This change, plus the recently-introduced developer configs for Mozharness, makes it much easier to run mozharness outside of continuous integration infrastructure.

I hope this will help developers have a better experience reproducing the environments used in the tbpl infrastructure. One less reason to loan a machine!

This makes me *very* happy (see below) since I don't have VPN access anymore.




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 11, 2014 07:42 PM

Using developer configs for Mozharness

To help run mozharness by developers I have landed some configs that can be appended to the command appearing on tbpl.

All you have to do is:
  • Find the mozharness script line in a log from tbpl (search for "script/scripts")
  • Look for the --cfg parameter and add it again but it should end with "_dev.py"
    • e.g. --cfg android/androidarm.py --cfg android/androidarm_dev.py
  • Also add the --installer-url and --test-url parameters as explained in the docs
Developer configs have these things in common:
  • They have the same name as the production one but instead end in "_dev.py"
  • They overwrite the "exes" dict with an empty dict
    • This allows to use the binaries in your personal $PATH
  • They overwrite the "default_actions" list
    • The main reason is to remove the action called read-buildbot-configs
  • They fix URLs to point to the right public reachable domains 
Here are the currently available developer configs:
You can help by adding more of them!















Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 11, 2014 07:15 PM

July 04, 2014

Kim Moir (kmoir)

This week in Mozilla Releng - July 4, 2014

This is a special double issue of this week in releng. I was so busy in the last week that I didn't get a chance to post this last week.  Despite the fireworks for Canada Day and Independence Day,  Mozilla release engineering managed to close some bugs. 

Major highlights:
 Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

July 04, 2014 09:39 PM

July 03, 2014

Armen Zambrano G. (@armenzg)

Tbpl's blobber uploads are now discoverable

What is blobber? Blobber is a server and client side set of tools that allow Releng's test infrastructure to upload files without requiring to deploy ssh keys on them.

This is useful since it allows uploads of screenshots, crashdumps and any other file needed to debug what failed on a test job.

Up until now, if you wanted your scripts determine the files uploaded in a job, you would have to download the log and parse it to find the TinderboxPrint lines for Blobbler uploads, e.g.
15:21:18 INFO - (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af
Now, you can look for the set of files uploaded by looking at the uploaded_files.json that we upload at the end of all uploads. This can be discovered by inspecting the buildjson files or by listening to the pulse events. The key used is called "blobber_manifest_url" e.g.
"blobber_manifest_url": "http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/39e400b6b94ac838b4e271ed61a893426371990f1d0cc45a7a5312d495cfdb485a1866d7b8012266621b4ee4df0cf9aa7d0f6d0e947ff63785543d80962aaf9b",
In the future, this feature will be useful when we start uploading structured logs. It will help us not to download logs to extract meta-data about the jobs!

No, your uploads are not this ugly
This work was completed in bug 986112. Thanks to aki, catlee, mtabara and rail to help me get this out the door. You can read more about Blobber by visiting: "Blobber is live - upload ALL the things!" and "Blobber - local environment setup".


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 03, 2014 12:02 PM

July 02, 2014

Hal Wine (hwine)

2014-06 try server update

2014-06 try server update

Chatting with Aki the other day, I realized that word of all the wonderful improvements to the try server issue have not been publicized. A lot of folks have done a lot of work to make things better - here’s a brief summary of the good news.

Before:
Try server pushes could appear to take up to 4 hours, during which time others would be locked out.
Now:
The major time taker has been found and eliminated: ancestor processing. And we understand the remaining occasional slow downs are related to caching . Fortunately, there are some steps that developers can take now to minimize delays.

What folks can do to help

The biggest remaining slowdown is caused by rebuilding the cache. The cache is only invalidated if the push is interrupted. If you can avoid causing a disconnect until your push is complete, that helps everyone! So, please, no Ctrl-C during the push! The other changes should address the long wait times you used to see.

What has been done to infrastructure

There has long been a belief that many of our hg problems, especially on try, came from the fact that we had r/w NFS mounts of the repositories across multiple machines (both hgssh servers & hgweb servers). For various historical reasons, a large part of this was due to the way pushlog was implemented.

Ben did a lot of work to get sqlite off NFS, and much of the work to synchronize the repositories without NFS has been completed.

What has been done to our hooks

All along, folks have been discussing our try server performance issues with the hg developers. A key confusing issue was that we saw processes “hang” for VERY long times (45 min or more) without making a system call. Kendall managed to observe an hg process in such an infinite-looking-loop-that-eventually-terminated a few times. A stack trace would show it was looking up an hg ancestor without makes system calls or library accesses. In discussions, this confused the hg team as they did not know of any reason that ancestor code should be being invoked during a push.

Thanks to lots of debugging help from glandium one evening, we found and disabled a local hook that invoked the ancestor function on every commit to try. \o/ team work!

Caching – the remaining problem

With the ancestor-invoking-hook disabled, we still saw some longish periods of time where we couldn’t explain why pushes to try appeared hung. Granted it was a much shorter time, and always self corrected, but it was still puzzling.

A number of our old theories, such as “too many heads” were discounted by hg developers as both (a) we didn’t have that many heads, and (b) lots of heads shouldn’t be a significant issue – hg wants to support even more heads than we have on try.

Greg did a wonderful bit of sleuthing to find the impact of ^C during push. Our current belief is once the caching is fixed upstream, we’ll be in a pretty good spot. (Especially with the inclusion of some performance optimizations also possible with the new cache-fixed version.)

What is coming next

To take advantage of all the good stuff upstream Hg versions have, including the bug fixes we want, we’re going to be moving towards removing roadblocks to staying closer to the tip. Historically, we had some issues due to http header sizes and load balancers; ancient python or hg client versions; and similar. The client issues have been addressed, and a proper testing/staging environment is on the horizon.

There are a few competing priorities, so I’m not going to predict a completion date. But I’m positive the future is coming. I hope you have a glimpse into that as well.

July 02, 2014 07:00 AM

July 01, 2014

Armen Zambrano G. (@armenzg)

Down Memory Lane

It was cool to find an article from "The Senecan" which talks about how through Seneca, Lukas and I got involved and hired by Mozilla. Here's the article.



Here's an excerpt:
From Mozilla volunteers to software developers 
It pays to volunteer for Mozilla, at least it did for a pair of Seneca Software Development students. 
Armen Zambrano and Lukas Sebastian Blakk are still months away from graduating, but that hasn't stopped the creators behind the popular web browser Firefox from hiring them. 
When they are not in class learning, the Senecans will be doing a wide range of software work on the company’s browser including quality testing and writing code. “Being able to work on real code, with real developers has been invaluable,” says Lukas. “I came here to start a new career as soon as school is done, and thanks to the College’s partnership with Mozilla I've actually started it while still in school. I feel like I have a head start on the path I've chosen.”  
Firefox is a free open source web browser that can...



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 01, 2014 05:58 PM

June 30, 2014

Nick Thomas (nthomas)

Keeping track of buildbot usage

Mozilla Release Engineering provides some simple trending of the Buildbot continuous integration system, which can be useful to check how many jobs are currently running versus pending. There are graphs of the last 24 hours broken out in various ways – for example compilation separate from tests, compilation on try and everything else. This data also feeds into the pending queue on trychooser.

ImageUntil recently the mapping of job name to machine pool was out of date, due to our rapid growth for b2g and into Amazon’s AWS, so the graphs were more misleading than useful. This has now been corrected and I’m working on making sure it stays up to date automatically.

June 30, 2014 04:31 AM

June 24, 2014

Chris AtLee (catlee)

B2G now building using unified sources

Last week, with the help of Ehsan and John, we finally enabled unified source builds for B2G devices.

As a result we're building device builds approximately 40% faster than before.

Between June 12th and June 17th, 50% of our successful JB emulator builds on mozilla-inbound finished in 97 minutes or less. Using unified sources for these builds reduced the 50th percentile of build times down to 60 minutes (from June 19th to June 24th).

To mitigate the risks of changes landing that break non-unified builds, we're also doing periodic non-unified builds for these devices.

As usual, all our work here was done in the open. If you're interested, read along in bug 950676, and bug 942167.

Do you enjoy building, debugging and optimizing build, test & release pipelines? Great, because we're hiring!

June 24, 2014 07:23 PM

June 20, 2014

Kim Moir (kmoir)

Introducing Mozilla Releng's summer interns

The Mozilla Release Engineering team recently welcomed three interns to our team for the summer.

Ian Connolly is a student at Trinity College in Dublin. This is his first term with Mozilla and he's working on preflight slave tasks and an example project for Releng API.
Andhad Jai Singh is a student at Indian Institute of Technology Hyderabad.  This is his second term working at Mozilla, he was a Google Summer of Code student with the Ateam last year.  This term he's working on generating partial updates on request.
John Zeller is also a returning student and studies at Oregon State University.  He previously had a work term with Mozilla releng and also worked during the past school term as a student worker implementing Mozilla Releng apps in Docker. This term he'll work on updating our ship-it application  so that release automation updates ship it more frequently so we can see the state of the release, as well as integrating post-release tasks.

 
View from Mozilla San Francisco Office

Please drop by and say hello to them if you're in our San Francisco office.  Or say hello to them in #releng - their irc nicknames are ianconnolly, ffledgling and zeller respectively.

Welcome!

June 20, 2014 09:24 PM

This week in Mozilla Releng - June 20, 2014

Ben is away for the next few Fridays, so I'll be covering this blog post for the next couple of weeks.

Major highlights:


Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

June 20, 2014 09:23 PM

Armen Zambrano G. (@armenzg)

My first A-team project: install all the tests!


As a welcoming bug to the A-team I had to deal with changing what tests get packaged.
The goal was to include all tests on a tests.zip regardless if they are marked as disabled on the test manifests or not.

Changing it the packaging was not too difficult as I already had pointers from jgriffin, the problem came with the runners.
The B2G emulator and desktop mochitest runners did not read the manifests; what they did is to run all tests that came inside of the tests.zip (even disabled ones).

Unfortunately for me, the mochitest runners code is very very old and it was hard to figure out how to make it work as clean as possible. I did a lot of mistakes and landed it twice incorrectly (improper try landing and lost my good patch somewhere) - sorry Ryan!.

After a lot of tweaking it, reviews from jmaher and help from ted & ahal, it landed last week.

For more details you can read bug 989583.

PS = Using trigger_arbitrary_builds.py was priceless to speed up my development.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 20, 2014 08:06 PM

June 19, 2014

John Zeller (zeller)

Tupperware: Mozilla apps in Docker!

Announcing Tupperware, a setup for Mozilla apps in Docker! Tupperware is portable, reusable, and containerized. But unlike typical tupperware, please do not put it in the Microwave.

Screen Shot 2014-06-18 at 2.20.04 PM

Why?

This is a project born out of a need to lower the barriers to entry for new contributors to Release Engineering (RelEng) maintained apps and services. Historically, RelEng has had greater difficulty attracting community contributors than other parts of Mozilla, due in large part to how much knowledge is needed to get going in the first place. For a new contributor, it can be quite overwhelming to jump into any number of the code bases that RelEng maintains and often leads to quickly losing that new contributor out of exaspiration. Beyond new contributors, Tupperware is great for experienced contributors as well to assist in keeping an unpolluted development environment and testing patches.

What?

Currently Tupperware includes the following Mozilla apps:

 - BuildAPI – a Pylons project used by RelEng to surface information collected from two databases updated through our buildbot masters as they run jobs.
 - BuildBot – a job (read: builds and tests) scheduling system to queue/execute jobs when the required resources are available, and reporting the results.

Dependency apps currently included:
 - RabbitMQ – a messaging queue used by RelEng apps and services
 - MySQL – Forked from orchardup/mysql

How?

Vagrant is used as a quick and easy way to provision the docker apps and make the setup truly plug n' play. The current setup only has a single Vagrantfile which launches BuildAPI and BuildBot, with their dependency apps RabbitMQ and MySQL.

How to run:
 - Install Vagrant 1.6.3
 - hg clone https://hg.mozilla.org/build/tupperware/ && cd tupperware && vagrant up (takes >10 minutes the first time)

Where to see apps:
 - BuildAPI: http://127.0.0.1:8888/
 - BuildBot: http://127.0.0.1:8000/
 - RabbitMQ Management: http://127.0.0.1:15672/

Troubleshooting tips are available in the Tupperware README.

What's Next?

Now that Tupperware is out there, it's open to contributors! The setup does not need to stay solely usable for RelEng apps and services. So please submit bugs to add new ones! There are a few ideas for adding functionality to Tupperware already:

Have ideas? Submit a bug!

June 19, 2014 12:00 AM

June 16, 2014

Ben Hearsum (bhearsum)

June 17th Nightly/Aurora updates of Firefox, Fennec, and Thunderbird will be slightly delayed

As part of the ongoing work to move our Betas and Release builds to our new update server, I’ll be landing a fairly invasive change to it today. Because it requires a new schema for its data updates will be slightly delayed while the data repopulates in the new format as the nightlies stream in. While that’s happening, updates will continue to point at the builds from today (June 16th).

Once bug 1026070 is fixed, we will be able to do these sort of upgrades without any delay to users.

June 16, 2014 07:05 PM

How to not get spammed by Bugzilla

Bugmail is a running joke at Mozilla. Nearly everyone I know that works with Bugzilla (especially engineers) complains about the amount of bugmail they get. I too suffered from this problem for years, but with some tweaks to preferences and workflow, this problem can be solved. Here’s how I do it:

E-mail preferences

Here’s what my full e-mail settings look like:

And here’s my Zimbra filter for changes made by me (I think the “from” header part is probably unnecessary, though):

Workflow

This section is mostly just an advertisement for the “My Dashboard” feature on Mozilla’s Bugzilla. By default, it shows you your assigned bugs, requested flags, and flags requested of you. Look at it at regular intervals (I try to restrict myself to once in the morning, and once before my EOD), particularly the “flags requested of you” section.

The other important thing is to generally stop caring about a bug unless it’s either assigned to you, or there’s a flag requested of you specifically. This ties in to some of the e-mail pref changes above. Changing my default state from “I must keep track of all bugs I might care about” to “I will keep track of my bugs & my requests, and opt-in to keeping tracking of anything else” is a shift in mindset, but a game changer when it comes to the amount of e-mail (and cognitive load) that Bugzilla generates.

With these changes it takes me less than 15 minutes to go through my bugmail every morning (even on Mondays). I can even ignore it at times, because “My Dashboard” will make sure I don’t miss anything critical. Big thanks to the Bugzilla devs who made some of these new things possible, particularly glob and dkl. Glob also mentioned that even more filtering possibilities are being made possible by bug 990980. The preview he sent me looks infinitely customizable:

June 16, 2014 01:11 PM

June 13, 2014

Ben Hearsum (bhearsum)

This week in Mozilla RelEng – June 13th, 2014 – *double edition*

I spaced and forgot to post this last week, so here’s a double edition covering everything so far this month. I’ll also be away for the next 3 Fridays, and Kim volunteered to take the reigns in my stead. Now, on with it!

Major highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

June 13, 2014 04:45 PM

June 11, 2014

Armen Zambrano G. (@armenzg)

Who doesn't like cheating on the Try server?

Have you ever forgotten about adding a platform to your Try push and had to push again?
Have you ever wished to *just* make changes to a tests.zip file without having to build it first?
Well, this is your lucky day!

In this wiki page, I describe how to trigger arbitrary jobs on you try push.
As always be gentle with how you use it as we all share the resources.

Go crazy!













Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 11, 2014 04:23 PM

June 10, 2014

Kim Moir (kmoir)

Talking about speaking up

We all interpret life through the lens of our previous experiences.  It's difficult to understand what each day is like for someone who has had a life fundamentally different from your own because you simply haven't had those experiences.  I don't understand what it's like to transition from male to female while involved in an open source community.  I don't know the steps taken to become an astrophysicist.  To embark to a new country as an immigrant.   I haven't lived struggled to survive on the streets as homeless person. Or a person who has been battered by domestic abuse.  To understand the experiences of others, all we can do is listen and learn from others, with empathy.

There have been many news stories recently about women or other underrepresented groups in technology.   I won't repeat them because frankly, they're quite depressing.  They go something like this:
1.  Incident of harassment/sexism either online/at a company/in a community/at a conference
2.  People call out this behaviour online and ask the organization to apologize and take steps to prevent this in the future.
3.  People from underrepresented groups who speak up about behaviour are told that their feelings are not valid or they are overreacting.  Even worse, they are harassed online with hateful statements telling them they don't belong in tech or are threatened with sexual assault or other acts of violence.
4.  Company/community/conference apologizes and issue written statement. Or not.
5. Goto 1


I watched an extraordinary talk the other day that really provided a vivid perspective about the challenges that women in technology face and what people can do to help. Brianna Wu is head of development at Giant Spacekat, a game development company.  She gave the talk "Nine ways to stop hurting and start helping women in tech" at AltConf last week.  She is brutally honest with the problems that exist in our companies and communities, and the steps forward to make it better. 




She talks about how she is threatened and harassed online. She also discusses how random people threatening you on the internet is not a just theoretical, but really frightening because she knows it could result in actual physical violence.   The same thing applies to street harassment. 

Here's the thing about being a woman.  I'm a physically strong person. I can run.  But I'm keenly aware that men are almost always bigger than me, and by basic tenets of physiology, stronger than me. So if a man tried to physically attack me, chances are I'd lose that fight.  So when someone threatens you, online or not, it is profoundly frightening because you fear for your physical safety. And to have that happen over and over again, like many women in our industry experience, apart from being terrifying, is exhausting and has a huge emotional toll.

I was going to summarize the points she brings up in her talk but she speaks so powerfully that all I can do is encourage you to watch the talk.

One of her final points really drives home the need for change in our industry when she says to the audience "This is not a problem that women can solve on their own....If you talk to your male friends out there, you guys have a tremendous amount of power as peers.  To talk to them and say, look dude this isn't okay.  You can't do this, you can't talk this way.  You need to think about this behaviour. You guys need to make a difference in a way that I can't."  Because when she talks about this behaviour to men, it often goes in one ear and out the next.  To be a ally in any sense of the word, you need to speak up.

THIS 1000x THIS.

Thank you Brianna for giving this talk.  I hope that when others see it they will gain some insight and feel some empathy on the challenges that women, and other underrepresented groups in the technology industry face.  And that you will all speak up too.

Further reading
Ashe Dryden's The 101-Level Reader: Books to Help You Better Understand Your Biases and the Lived Experiences of People                                                                                                           
Ashe Dryden Our most wicked problem

June 10, 2014 01:30 AM

June 04, 2014

Ben Hearsum (bhearsum)

More on “How far we’ve come”

After I posted “How far we’ve come” this morning a few people expressed interest in what our release process looked like before, and what it looks like now.

The earliest recorded release process I know of was called the “Unified Release Process”. (I presume “unified” comes from unifying the ways different release engineers did things.) As you can see, it’s a very lengthy document, with lots of shell commands to tweak/copy/paste. A lot of the things that get run are actually scripts that wrap some parts of the process – so it’s not as bad as it could’ve been.

I was around for much of the improvements to this process. Awhile back I wrote a series of blog posts detailing some of them. For those interested, you can find them here:

I haven’t gotten around to writing a new one for the most recent version of the release automation, but if you compare our current Checklist to the old Unified Release Process, I’m sure you can get a sense of how much more efficient it is. Basically, we have push-button releases now. Fill in some basic info, push a button, and a release pops out:

June 04, 2014 06:57 PM

How far we’ve come

When I joined Mozilla’s Release Engineering team (Build & Release at the time) back in 2007, the mechanics of shipping a release were a daunting task with zero automation. My earliest memories of doing releases are ones where I get up early, stay late, and spend my entire day on the release. I logged onto at least 8 different machines to run countless commands, sometimes forgetting to start “screen” and losing work due to a dropped network connection.

Last night I had a chat with Nick. When we ended the call I realized that the Firefox 30.0 release builds had started mid-call – completely without us. When I checked my e-mail this morning I found that the rest of the release build process had completed without issue or human intervention.

It’s easy to get bogged down thinking about current problems. Times like this make me realize that sometimes you just need to sit down and recognize how far you’ve come.

June 04, 2014 01:26 PM

June 02, 2014

Kim Moir (kmoir)

Mozilla pushes - May 2014


Here's May's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file

Trends
This was a record breaking month where we overcame our previous record of 8100+ pushes with a record of 11000+ pushes this month.  Gaia-try, just created in April has become a popular branch with 29% of pushes.


Highlights
General Remarks
The introduction of Gaia-try in April has been very popular and comprised around 29% of pushes in May.  The Try branch itself consisted of around 38% of pushes.
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes, compared to 30% in the previous month.


Records
May 2014 was the month with most pushes (11711 pushes)
May 2014 has the highest pushes/day average with 378 pushes/day
May 2014 has the highest average of "pushes-per-hour" is 22 pushes/hour
May 29th, 2014 had the highest number of pushes in one day with 613 pushes

May 2014 is a record setting month, 11711 pushes!

Note that Gaia-try was added in April and has quickly become a high volume branch


I changed the format of this pie chart this month.  It seemed to be previously based on several months data, but not all data from the previous year.  So I changed it to be only based on the data from the current month which seemed more logical.

June 02, 2014 09:43 PM

May 30, 2014

Ben Hearsum (bhearsum)

This week in Mozilla RelEng – May 30th, 2014

Major highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

May 30, 2014 08:20 PM

May 28, 2014

Armen Zambrano G. (@armenzg)

How to create local buildbot slaves


For the longest time I have wished for *some* documentation on how to setup a buildbot slave outside of the Release Engineering setup and not needing to go through the Puppet manifests.

On a previous post, I've documented how to setup a production buildbot master.
In this post, I'm only covering the slaves side of the setup.

Install buildslave

virtualenv ~/venvs/buildbot-slave
source ~/venvs/buildbot-slave/bin/activate
pip install zope.interface==3.6.1
pip install buildbot-slave==0.8.4-pre-moz2 --find-links http://pypi.pub.build.mozilla.org/pub
pip install Twisted==10.2.0
pip install simplejson==2.1.3
NOTE: You can figure out what to install by looking in here: http://hg.mozilla.org/build/puppet/file/ad32888ce123/modules/buildslave/manifests/install/version.pp#l19

Create the slaves

NOTE: I already have build and test master in my localhost with ports 9000 and 9001 respecively.
buildslave create-slave /builds/build_slave localhost:9000 bld-linux64-ix-060 pass
buildslave create-slave /builds/test_slave localhost:9001 tst-linux64-ec2-001 pass

Start the slaves

On a normal day, you can do this to start your slaves up:
 source ~/venvs/buildbot-slave/bin/activate
 buildslave start /builds/build_slave
 buildslave start /builds/test_slave


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 28, 2014 07:05 PM

May 23, 2014

Ben Hearsum (bhearsum)

This week in Mozilla RelEng – May 23rd, 2014

Major highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

May 23, 2014 08:40 PM

Armen Zambrano G. (@armenzg)

Technical debt and getting rid of the elephants

Recently, I had to deal with code where I knew there were elephants in the code and I did not want to see them. Namely, adding a new build platform (mulet) and running a b2g desktop job through mozharness on my local machine.

As I passed by, I decided to spend some time to go and get some peanuts to get at least few of those elephants out of there:

I know I can't use "the elephant in the room" metaphor like that but I just did and you just know what I meant :)

Well, how do you deal with technical debt?
Do you take a chunk every time you pass by that code?
Do you wait for the storm to pass by (you've shipped your awesome release) before throwing the elephants off the ship?
Or else?

Let me know; I'm eager to hear about your own de-elephantization stories.





Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 23, 2014 03:35 AM

May 22, 2014

Peter Moore (pmoore)

Protected: Setting up a Mozilla vcs sync -> mapper development environment

This post is password protected. You must visit the website and enter the password to continue reading.


May 22, 2014 02:35 PM

May 16, 2014

Ben Hearsum (bhearsum)

This week in Mozilla RelEng – May 16th, 2014

Major highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

May 16, 2014 07:50 PM

Kim Moir (kmoir)

20 years on the web

Note: I started writing this a long time ago as part of #mynerdstory but never got around to finishing it until recently.  So I changed it a bit when I noticed it had been over 20 years since I first used the internet.

I found this picture the other day.  It's me on graduation day at Acadia,  twenty years ago this month.  A lot has changed since then.



In the picture, I'm in Carnegie Hall, where the Computer Science department had their labs, classrooms and offices. I'm sitting in front of a Sun workstation, which ran a early version of Mosaic.  I recall the first time I saw a web browser display a web page, I was awestruck.   I think it was NASA's web page.  My immediate reaction was that I wanted to work on that, to be on the web.

As a I've mentioned before, my Dad was a manager at a software and services firm in Halifax.  He brought home our first computer when I was 9.  Dad was always upgrading the computers or fixing them and I'd watch him and asked lots of questions about how the components connected together.  In junior high, I taught myself BASIC from the manual, wrote a bunch of simple programs, and played so many computer games that my dreams at night became pixelated.  When I was 16, I started working at my Dad's office doing clerical work during the school break.  One of my tasks was to run a series of commands to connect to BITNET via an accoustic coupler using Kermit and download support questions from their university customers.  I thought it was so magical that these computers that were so physically distant could connect and communicate.

In high school, I took computer science in grade 12 and we wrote programs in Pascal on Apple IIs.  My computer science teacher was very enthusiastic and welcoming.  He taught us sorting algorithms, and binary trees, and other advanced topics that weren't on the curriculum. Since he had such an interest he taught a lot of extra material.  Thanks Mr. B. 

When it was time to apply to university,  I didn't apply to computer science.  I don't know why, my grades were fine and I certainly had the background.  I really lacked self confidence that I could do it.  In retrospect, I would have been fine.  I enrolled at Acadia in their Bachelor of Business Administration program, probably because I liked reading the Globe and Mail.

I arrived on campus with a PC to write papers and do my accounting assignments.  The reason I had access to a computer was that the company my Dad worked for allowed their employees borrow a computer for home use for a year at a time, then return it.  Otherwise, they were prohibitively expensive at the time.  My third year of university I decided that I was better suited to computer science than business so started taking all my elective courses from the computer science faculty.  I still wanted to graduate on in four years so I didn't switch majors.  It was such a struggle to scrape together the money from part-time jobs and student loans to pay for four years of university, let alone six.

One of my part-time jobs was helping people in the university computer labs with questions and fixing problems.  Everything was very text based back then.  We used Archie to search for files, read books transcribed by the Gutenberg project and use uudecode to assemble pictures posted to Usenet groups.  I applied for a Unix account on the Sun system that only the Computer Science students had access to.   It was called dragon and the head sysadmin had a sig that said "don't flame me, I'm on dragon".  I loved learning all the obscure yet useful Unix commands.

My third year I had a 386 portable running Windows 3.1.  I carried this computer all over campus, plugging it in at the the student union centre and working on finance projects with my business school colleagues.  By my fourth year, they had installed Sun workstations in the Computer Science labs with Mosaic installed.   This was my first view of the world wide web.   It was beautiful.  The web held such promise.

I applied for 40 different jobs before I graduated from Acadia and was offered a job in Ottawa working for the IT department of Revenue Canada.  A ticket out of rural Nova Scotia! I didn't like my first job there that much but they paid for networking and operating system courses that I took at night.  I was able to move to a new job in a year and started being a sysadmin for their email servers that served 30,000 users.  It was a lot of fun and I learned a tremendous amount about networking, mail related protocols and operating systems.  I also spent a lot of time in various server rooms across Canada installing servers.  Always bring a sweater.

I left after a few years to work at Nortel as technical support for a telephony switch that offloaded internet traffic from voice switches to a dedicated switch.  Most internet traffic back then was via modem which were longer duration calls than most voice calls and caused traffic issues.  I took a lot of courses on telephony protocols, various Unix variants and networking. I traveled to several telco customers to help configure systems and demonstrate product features. More time in cold server rooms.

Shortly after Mr. Releng and I got married we moved to Athens, Georgia where he was completing his postdoc.  I found a great job as a sysadmin for the UGA's computer systems division.  The group provided FTP, electronic courseware and email services to the campus.  We also secured a lot of hacked Linux servers set up by unknowing graduate students in various departments.  When I started, I didn't know Linux very well so my manager just advised me to install Red Hat about 30 times and change the options every time, learn how to compile custom kernels and so on.  So that's what I did.  At that time you also had to compile Apache from source to include any modules such as ssl support, or different databases so I also had fun doing that. 

We used to do maintenance on the computer systems between 5 and 7am once a week.  Apparently not many students are awake at that hour.  I'd get up at 4am and drive in to the university in the early morning, the air heavy with the scent of Georgia pine and the ubiquitous humidity.  My manager M, always made a list the night before of what we had to do, how long it would take, and how long it would take to back the changes out.  His attention to detail and reluctance to ever go over the maintenance window has stayed with me over time. In fact, I'm still kind of a maintenance nerd, always figuring out how to conduct system maintenance in the least disruptive way to users.  The server room at UGA was huge and had been in operation since the 1960s.  The layers of cable under the tiles were an archeological record of the progress of cabling within the past forty years.  M typed on a DVORAK keyboard, and was one of the most knowledgeable people about all the Unix variants, and how they differed. If he found a bug in Emacs or any other open source software, he would just write a patch and submit it to their mailing list.  I thought that was very cool.

After Mr. Releng finished his postdoc, we moved back to Ottawa.  I got a job at a company called OTI as a sysadmin.  Shortly after joining, my colleague J said "We are going to release an open source project called Eclipse, are you interested in installing some servers for it?"  So I set up Bugzilla, CVS, mailman, nntp servers etc.  It was a lot of fun and the project became very popular and generated a lot of traffic.  A couple years later the Eclipse consortium became the Eclipse Foundation and all the infrastructure management moved there. 

I moved to the release engineering team at IBM and started working with S who taught me the fundamentals of release engineering.  We would spent many hours testing and implementing new features in the build, and test environment, and working with the development team to implement new functionality, since we used Eclipse bundles to build Eclipse.  I have written a lot about that before on my blog so I won't reiterate.  Needless to say, being paid to work full time in an open source community was a dream come true.

A couple of years ago, I moved to work at Mozilla.  And the 20 year old who looked Mosaic for the first time and saw the beauty and promise of the web, couldn't believe where she ended up almost 20 years later.


Many people didn't grow up with the privilege that I have, with access to computers at such a young age, and encouragement to pursue it as a career.  I thank all of you who I have worked with and learned so much from.  Lots still to learn and do!

May 16, 2014 07:40 PM

Release Engineering Special Issue


A different type of mobile farm  ©Suzie Tremmel, https://flic.kr/p/6tQ3H Creative Commons by-nc-sa 2.0

Are you a release engineer with a great story to share?  Perhaps the ingenious way that you optimized your build scripts to reduce end to end build time?  Or how you optimized your cloud infrastructure to reduce your IT costs significantly?  How you integrated mobile testing into your continuous integration farm?  Or are you a researcher who would like to publish their latest research in a area related to release engineering?

If so, please consider submitting a report or paper to the first IEEE Release Engineering special issue.   Deadline for submissions is August 1, 2014 and the special issue will be published in the Spring of 2015.

IEEE Release Engineering Special Issue


If you have any questions about the process or the special issue in general, please reach out to any of the guest editors.  We're happy to help!

We're also conducting a roundtable interview with several people from the release engineering community in the issue.  This should raise some interesting insights given the different perspectives that people from organizations with large scale release engineering efforts bring to the table.


May 16, 2014 07:23 PM

May 15, 2014

Massimo Gerva (mgerva)

local storage for builds

Using the instance storage space on aws

Bug 977611 enables the use of the instance storage space for builds. Instance storage comes for free with your instance and it is faster than EBS, especially if the  instance type comes with SSDs. Here is how we are managing the instances:

detect if any instance storage is available

instance storage can be a single disk or multiple volumes; to detect your instance storage query this url:

$ curl http://169.254.169.254/latest/meta-data/block-device-mapping/ 
ami 
ephemeral0 
ephemeral1

in this case, the instance has two volumes: ephemeral0 and ephemeral1. ephemeral0 maps to

$ curl http://169.254.169.254/latest/meta-data/block-device-mapping/ephemeral0
sdb

(and ephemral1 maps to /dev/sdc)

prepare the disk space

if the instance type has multiple disk, we need to use lvm

create a physical volume (man page)
for each device (/dev/xvdb, /dev/xvdc): 
dd if=/dev/zero of=/dev/xvdb bs=512 count=1
pvcreate -ff -v /dev/xvdb

now format the device:

mkfs.ext4 /dev/xvdb
create a volume group (man page)
vgcreate vg /dev/xvdb /dev/xvdc
create a logical volume (man page)
lvcreate -l 100%VG --name /dev/mapper/vg-local
mount the new disk

add the following line to /etc/fstab

/dev/mapper/vg-local /builds/slave ext4 defaults,noatime 0 0

reboot, and have fun!

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html
[2] http://linux.die.net/man/8/pvcreate
[3] http://linux.die.net/man/8/vgcreate
[4] http://linux.die.net/man/8/lvcreate


May 15, 2014 03:09 PM

May 13, 2014

Armen Zambrano G. (@armenzg)

Do you need a used Mac Mini for your Mozilla team? or your non-for-profit project?

If so, visit this form and fill it up by May 22nd (9 days from today).
There are a lot of disclaimers in the form. Please read them carefully.

These minis have been deprecated after 4 years of usage. Read more about it in here.
From http://en.wikipedia.org/wiki/Mac_Mini




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 13, 2014 05:38 PM

Chris AtLee (catlee)

Limiting coalescing on the build/test farm

tl;dr - as of yesterday we've limited coalescing on all builds/tests to merge at most 3 pending jobs together

Coalescing (aka queue collapsing aka merging) has been part of Mozilla's build/test CI for a long, long time. Back in the days of Tinderbox, a single machine would do a checkout/build/upload loop. If there were more checkins while the build was taking place, well, they would get built on the next iteration through the loop.

Fast forward a few years later to our move to buildbot, and having pools of machines all able to do the same builds. Now we create separate jobs in the queue for each build for each push. However, we didn't always have capacity to do all these builds in a reasonable amount of time, so we left buildbot's default behaviour (merging all pending jobs together) enabled for the majority of jobs. This means that if there are pending jobs for a particular build type, the first free machine skips all but the most recent item on the queue. The skipped jobs are "merged" into the job that was actually run.

In the case that all builds and tests are green, coalescing is actually a good thing most of the time. It saves you from doing a bunch of extra useless work.

However, not all pushes are perfect (just see how often the tree is closed due to build/test failures), and coalescing makes bisecting the failure very painful and time consuming, especially in the case that we've coalesced away intermediate build jobs.

To try and find a balance between capacity and sane results, we've recently added a limit to how many jobs can be coalesced at once.

By rigorous statistical analysis:

@catlee     so it's easiest to pick a single upper bound for coalescing and go with that at first
@catlee     did you have any ideas for what that should be?
@catlee     I was thinking 3
edmorley|sheriffduty        catlee: that sounds good to me as a first go :-)
mshal       chosen by fair dice roll? :)
@catlee     1d4
bhearsum    Saving throw failed. You are dead.
philor      wfm

we've chosen 3 as the upper bound on the number of jobs we'll coalesce, and we can tweak this as necessary.

I hope this makes the trees a bit more manageable! Please let us know what you think!

As always, all our work is done in the open. See the bug with the patch here: https://bugzilla.mozilla.org/show_bug.cgi?id=1008213

May 13, 2014 11:17 AM

May 12, 2014

Kim Moir (kmoir)

Mozilla pushes - April 2014

Here's April's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

Trends

Highlights




General Remarks

Records

Disclaimer

The data collected prior to 2014 could be slightly off since different data collection methods were used.

May 12, 2014 01:33 PM

Nick Thomas (nthomas)

Rethinking rsync at Mozilla

Some time ago Mozilla moved away from volunteer servers for delivering installers and updates to our end users, but we still offer two rsync modules

Both of these have been unmaintained for some time, so they are out of date (1 year) and huge (500GB) respectively. We still serve quite a lot of traffic through those modules, although there were no complaints when it was down for a few days recently.

I’m interested to hear opinions on whether we should maintain rsync access to release bits. From the logs it’s clear some of former mirrors are still pulling data, which is fine if it’s intentional rather than legacy. There may be other use cases we’re not aware of, so please let us know in the comments or on bug 807543.

May 12, 2014 04:08 AM

May 09, 2014

Ben Hearsum (bhearsum)

This week in Mozilla RelEng – May 9th, 2014

This was a quieter week than most. With everybody flying home from Portland on Friday/Saturday, it took some time for most of us to get back into the swing of things.

Major highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

May 09, 2014 08:10 PM

John Zeller (zeller)

MySQL databases are all setup in BuildAPI-app docker container!

As I stated in the previous post, the next step here was to setup databases. I spent time attempting to have sqlite work in this situation, but ran into issues with buildapi connecting to the sqlite databases. Rather than chase that rabbithole, I doublechecked the configuration in production buildapi and was reminded by the configs that production is running mysql. So I went ahead and did so. This setup required adding the following to the Dockerfile:

RUN apt-get install -y mysql-server

RUN chown mysql.mysql /var/run/mysqld/

RUN mysql_install_db # Installs mysql database schemas

RUN /usr/bin/mysqld_safe &

After this, everything was peachy except for the sql schemas available in the current buildapi repo. Those schemas are for sqlite, so I dumped my own mysql schemas for use here, and loaded them with the following commands:

mysql < status_schema.mysql

mysql < scheduler_schema.mysql

I went ahead and submitted a patch to add the mysql specific schemas to the buildapi repo in Bug 1007994, but for now I added the schemas in with the files in the buildapi-app directory.

I uploaded the current contents of the buildapi-app docker container and it launches with schemas all loaded and running well.

I am still having some issues verifying that selfserve-agent can execute commands from data sent to it over the amqp by buildapi. Further testing is needed to fix this issue. I am currently getting 404 error with my tests, but that might be a peripheral problem rather than selfserve-agent not getting data from the amqp.

Left to do on buildapi-app is to:

Links I found useful for this:

May 09, 2014 12:51 AM

May 08, 2014

Aki Sasaki (aki)

brainstorm: splitting mozharness

[stating the problem]

Mozharness currently handles a lot of complexity. (It was designed to be able to, but the ideal is still elegantly simple scripts and configs.)

Our production-oriented scripts take (and sometimes expect) config inputs from multiple locations, some of them dynamic; and they contain infrastructure-oriented behavior like clobberer, mock, and tooltool, which don't apply to standalone users.

We want mozharness to be able to handle the complexity of our infrastructure, but make it elegantly simple for the standalone user. These are currently conflicting goals, and automating jobs in infrastructure often wins out over making the scripts user friendly. We've brainstormed some ideas on how to fix this, but first, some more details:

[complex configs]

A lot of the current complexity involves config inputs from many places:

We want to lock the running config at the beginning of the script run, but we also don't want to have to clone a repo or make external calls to web resources during __init__(). Our current solution has been to populate runtime configs during one of our script actions, but then to support those runtime configs we have to check multiple config locations for our script logic. (self.buildbot_config, self.test_config, self.config, ...)

We're able to handle this complexity in mozharness, and we end up with a single config dict that we then dump to the log + to a json file on disk, which can then be reused to replicate that job's config. However, this has a negative effect on humans who need to either change something in the running configs, or who want to simplify the config to work locally.

[in-tree vs out-of-tree]

We also want some of mozharness' config and logic to ride the trains, but other portions need to be able to handle outside-of-tree processes and config, for various reasons:


[brainstorming solutions]

Part of the solution is to move logic out of mozharness. Desktop Firefox builds and repacks moving to mach makes sense, since they're

  1. configurable by separate mozconfigs,
  2. tasks completely shared by developers, and
  3. completely dependent on the tree, so tying them to the tree has no additional downside.

However, Andrew Halberstadt wanted to write the in-tree test harnesses in mozharness, and have mach call the mozharness scripts. This broke some of the above assumptions, until we started thinking along the lines of splitting mozharness: a portion in-tree running the test harnesses, and a portion out-of-tree doing the pre-test-run machine setup.

(I'm leaning towards both splitting mozharness and using helper objects, but am open to other brainstorms at this point...)

[splitting mozharness]

In effect, the wrapper, out-of-tree portion of mozharness would be taking all of the complex inputs, simplifying them for the in-tree portion, and setting up the environment (mock, tooltool, downloads+installs, etc.); the in-tree portion would take a relatively simple config and run the tests.

We could do this by having one mozharness script call another. We'd have to fix the logging bug that causes us to double-log lines when we instantiate a second BaseScript, but that's not an insurmountable problem. We could also try execing the second script, though I'd want to verify how that works on Windows. We could also modify our buildbot ScriptFactory to be able to call two scripts consecutively, after the first script dynamically generates the simplified config for the second script.

We could land the portions of mozharness needed to run test harnesses in-tree, and leave the others out-of-tree. There will be some duplication, especially in the mozharness.base code, but that's changing less than the scripts and mozharness.mozilla modules.

We would be able to present a user-friendly "inner" script with limited inputs that rides the trains, while also allowing for complex inputs and automation-oriented setup beforehand in the "outer" script. We'd most likely still have to allow for automation support in the inner script, if there's some reporting or error checking or other automation task that's needed after the handoff, but we'd still be able to limit the complexity of that inner script. And we could wrap that inner script in a mach command for easy developer use.

[helper objects]

Currently, most of mozharness' logic is encapsulated in self. We do have helper objects: the BaseConfig and the ReadOnlyDict self.config for config; the MultiFileLogger self.log_obj that handles all logging; MercurialVCS for cloning, ADBDeviceHandler and SUTDeviceHandler for mobile device wrangling. But a lot of what we do is handled by mixins inherited by self.

A while back I filed a bug to create a LocalLogger and BaseHelper to enable parallelization in mozharness scripts. Instead of cloning 90 locale repos serially, we could create 10 helper objects that each clone a repo in parallel, and launch new ones as the previous ones finish. This would have simplified Armen's parallel emulator testing code. But even if we're not planning on running parallel processes, creating a helper object allows us to simplify the config and logic in that object, similar to the "inner" script if we split mozharness into in-tree and out-of-tree instances, which could potentially also be instantiated by other non-mozharness scripts.

Essentially, as long as the object has a self.log_obj, it will use that for logging. The LocalLogger would log to memory or disk, outside of the main script log, to avoid parallel log interleaving; we would use this if we were going to run the helper objects in parallel. If we wanted the helper object to stream to the main log, we could set its log_obj to our self.log_obj. Similarly with its config. We could set its config to our self.config, or limit what config we pass to simplify.

(Mozharness' config locking is a feature that promotes easier debugging and predictability, but in practice we often find ourselves trying to get around it somehow. Other config dicts, self.variables, editing self.config in _pre_config_lock() ... Creating helper objects lets us create dynamic config at runtime without violating this central principle, as long as it's logged properly.)

Because this "helper object" solution overlaps considerably with the "splitting mozharness" solution, we could use a combination of the two to great efficacy.

[functions and globals]

This idea completely alters our implementation of mozharness, by moving self.config to a global config, directly calling logging methods (or wrapped logging methods). By making each method a standalone function that's only slightly different from a standard python function, it lowers the bar for contribution or re-use of mozharness code. It does away with both the downsides and benefits of objects.

The first, large downside I see is this solution appears incompatible with the "helper objects" solution. By relying on a global config and logging in our functions, it's difficult to create standalone helpers that use minimized configs or alternate logging configurations. I also think the global logging may make the double-logging bug more prevalent.

It's quite possible I'm downplaying the benefit of importing individual functions like a standard python script. There are decorators to transform functions into class methods and vice versa, which might allow for both standalone functions and object-based methods with the same code.

[related links]

  • Jordan Lund has some ideas + wip patches linked from bug 753547 comment 6.
  • Andrew Halberstadt's Sharing code not always a good thing and How to deal with IFFY requirements
  • My mozharness core principles example scripts+configs and video
  • Lars Lohn's Crouching Argparse Hidden Configman. Afaict configman appears to solve similar problems to mozharness' BaseConfig, but Argparse requires python 2.7 and mozharness locks the config.


  • comment count unavailable comments

    May 08, 2014 04:09 AM

    May 07, 2014

    Ben Hearsum (bhearsum)

    Redo – Utilities to retry Python callables

    We deal with a lot of flaky things in RelEng. The network can drop. Code can have race conditions. Servers can go offline temporarily. Freak errors can happen (more often than you’d think). One of the ways we’ve learned to cope with this is to add “retry” behaviour to damn near everything that could fail intermittently. We use it so much that we’ve got a Python library and command line tool that are used all over the place.

    Last week I finally got around to packaging and publishing ours, and I’m happy to present: Redo – Utilities to retry Python callables. Redo provides a decorator, context manager, plain old function, and even a command line tool to retry all sorts of things that may break. It’s very simple to use, here’s some examples from the docs:
    The plain old function:

    def maybe_raises(foo, bar=1):
        ...
        return 1
    
    def cleanup():
        os.rmtree("/tmp/dirtydir")
    
    ret = retry(maybe_raises, retry_exceptions=(HTTPError,),
                cleanup=cleanup, args=1, kwargs={"bar": 2})
    

    The decorator:

    from redo import retriable
    
    @retriable()
    def foo()
        ...
    
    @retriable(attempts=100, sleeptime=10)
    def bar():
        ...
    

    The context manager:

    def foo(a, b):
        ...
    
    with retrying(foo, retry_exceptions=(HTTPError,)) as retrying_foo:
        r = retrying_foo(1, 3)
    

    You can grab version 1.0 from PyPI, or find it on Github, where you can send issues or pull requests.

    May 07, 2014 07:07 PM

    May 06, 2014

    Kim Moir (kmoir)

    Releng 2014 invited talks available

    On April 11,  there was a Releng workshop held at Google in Mountain View. The two keynote talks and panel at the end of the day were recorded and made available on the Talks at Google channel on YouTube.  Thank you Google!


    Moving to mobile: The challenges of moving from web to mobile releases, Chuck Rossi, Facebook https://www.youtube.com/watch?v=Nffzkkdq7GM



    Some interesting notes from Chuck's talk:
     
    The 10 Commandments of Release Engineering Dinah McNutt, Google
    https://www.youtube.com/watch?v=RNMjYV_UsQ8


    Some notes from Dinah's talk. 
    • Release engineering is accelerating the path for development to operations
    • You want to be able to reproduce your build environment and source code management system if you have to recreate a very old build
    • Configuration management and release engineering as disciplines will probably merge as the next few years
    • Reproducibility is a virtue. Binaries don't belong in SCMs.  However, it's important to be able to reproduce binaries.  If you do need a repo with binaries, put them in a separate repo. 
    • Use the right tool for the job, you will have multiple tools.  Both commercial and open source.
    • View the job of a release engineer and making a developers job easier.  By setting up tooling and best practices.
    • Package management.  Provides auditing, upgrading, installation, removals. Tars and jars are not package managers.
    • You need to think about your upgrade process before you release 1.0.
    • Customers find problems we cannot find ourselves. Even if we're dogfooding.
    • As release engineers, step back and look at the big picture.  Look and see how we can make things better from a cost perspective so we have the resources we need to do our jobs.
    • It's a great year to be a release engineer. Dinah is the organizing committee for Release Engineering Summit June 20 in Philadelphia as part of USENIX. There is also one as a part of LISA in Seattle in November.  Overwhelming interest for a first time summit in terms of submissions!
    Closing discussion panel:
    https://www.youtube.com/watch?v=D4jLlWdsrWg



    Stephanie Bellomo, one of my colleagues from the organizing committee for this workshop moderated the panel.  Really interesting discussions, well worth a listen.  I like that the first question of "What is your worst operational nightmare and how did you recover from it?"  I love war stories.

    As an aside, we charged $50 per attendee for this workshop.  We talked to other people who had organized similar events and they suggested this would be an appropriate fee.  I've read that if you don't charge a fee for an event, you have more no-shows on the day of the event because psychologically,  they attach a lesser value to the event since they didn't pay for it.  However, we didn't have many expenses to pay for the workshop other than speaker gifts, eventbrite fees and badges.  Google provided the venue and lunch, again, thank you for the sponsorship. So we donated $1,531.00 USD to each of the following organizations from the remaining proceeds.

    YearUp is an organization that in their words "Year Up empowers low-income young adults to go from poverty to professional careers in a single year."  I know Mozilla has partnered with YearUp to provide mentoring opportunities within the IT group and it was an amazing experience for all involved. 
     
    The second organization we donated to is the Tanzania Education Fund.  This organization was one that Stephany mentioned since she had colleagues that were involved with for many years.  They provide pre-school, elementary, secondary and high school education for students in Tanzania.   Secondary education is not publicly funded in Tanzania.  In addition, 50% of their students are girls, in an area where education for girls is given low priority.  Education is so important to empower people.

    Thanks to all those that attended and spoke at the workshop!

    May 06, 2014 06:40 PM

    Aki Sasaki (aki)

    Gaia Try

    We're now running Gaia tests on TBPL against Gaia pull requests.

    John Ford wrote a commit hook in bug 989131 to push an update to the gaia-try repo that looks like this. The buildbot test scheduler master is polling this repo; when it sees a change, it triggers tests against the latest b2g desktop builds. The test jobs download the pre-built desktop binaries, and clone the appropriate pull-request Gaia repo and revision on top, then run the tests. The buildbot work was completed in bug 986209.

    This should allow us to verify our code doesn't break anything before it's merged to Gaia proper, saving human time and reducing code churn.

    Armen pointed out how gaia code changes currently show up in the push count via bumper processes, but that only reflects merged pull requests, not these un-reviewed pull requests. Now that we've turned on gaia-try, this is equivalent to another mozilla-inbound (an additional 10% of our push load, iirc). Our May pushes should see a significant bump.



    comment count unavailable comments

    May 06, 2014 06:15 PM

    May 05, 2014

    Armen Zambrano G. (@armenzg)

    Releng goodies from Portlandia!

    Last week, Mozilla's Release Engineering met at the Portland office for a team week.
    The week was packed with talks and several breakout sessions.
    We recorded a lot of our sessions and put all of them in here for your enjoyment! (with associated slide decks if applicable).

    Here's a brief list of the talks you can find:
    Follow us at @MozReleng and Planet Releng.

    Many thanks to jlund to help me record it all.

    UPDATE: added thanks to jlund.

    The Releng dreams are alive in Portland














    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    May 05, 2014 08:03 PM

    Kim Moir (kmoir)

    Remote work in review

    I recently read two books on remote work.  Scott Berkun's Year without Pants and Remote: Office Not Required by Jason Fried and David Heinemeier Hansson.  The first book describes Scott Berkun's year as a remote manager at Automattic, which runs Wordpress.com.  The second book describes the authors' experiences at the fully distributed company 37signals, which has since renamed itself to Basecamp to reflect the importance of its flagship product.  Both books were interesting reflections on the nature of remote work in the context of their respective companies.  They were not books that addressed the nature of remote work in general, but described the approach they felt was successful within their companies. Of the two books, I would really recommend reading the Year Without Pants.  Remote isn't as compelling but it's a short read.




    Some notes from "The Year without Pants" 


    Communication
    "To work at a remote company demanded great communication skills and everyone had them"

    Culture
    Chapter 4 on culture always wins is  is a fantastic read, it's available for free on his website. 

    "Trust is everything". 

    "1. Hire great people
     2. Set good priorities
     3. Remove distractions
     4. Stay out of the way"


    In other words, treat people like grown ups and they will do good work.
     
    "In every meeting in every organization around the world where bad behavior is happening, there is someone with the most power in the room who can do something about it.  What that person does shapes the culture.  If the most powerful person is silent, this signals passive acceptance of whatever is going on."

    Wow, this is very significant.  If you are the most powerful person in the room, speak up and call out bad behaviour.  The people with less power are often hesitant to speak up because there may be consequences for them and they feel they lack authority.

     Hiring
    "Hire self sufficient, passionate people"

    I often get questions from people who don't work at home how I don't get distracted and goof off all day since I work from home.  It's simple.  I love my job.  It's a lot of fun.  I want to be shipping, not slacking.

    Shipping
    Shipping every day gives people a sense of accomplishment.  Many bug fixes are deployed to Wordpress.com every day. No gatekeepers to deployment but the people deploying the change are expected to watch the site after they deploy for a few hours to ensure there aren't unexpected problems.


    Some notes from "Remote: Office Not Required"

    Communication
    The book suggests asking managers or employees to work from home a few days a week to level the playing field with respect to communications between employees who work in an office and those who are remote.   This will ensure they appreciate what hinders communication and take steps for improvement.  And it will reduce the tendency to treat remote workers as second class citizens and cut them out of essential conversations.

    Culture
    "...coming  into the office just means that people have to put on pants.  There is not guarantee of productivity."

    "If you view those who work under you as capable adults who will push themselves to excel even when they're not breathing down their necks, they'll delight you in return"

    Again, trust people and have high expectations of them and you'll be rewarded with excellence. 

    Hiring
    The authors note that international exposure is good as a selling point with clients.  Hiring around the world increases the talent pool available, but is not without tax or legal complications.  Also, given the degree of written communication with remote work, it's best to hire people with the language skills that can thrive in this situation.  

    Productivity
    The book stresses that workflow tools need to be available to all team members at all times in order to be productive. i.e. recording the state of a project in a wiki, bug tracker and recording meetings.  If a team member is working in a timezone offset from the majority of team members and doesn't have this in place, it can be a productivity drain.

     "Would-be remote workers and managers have a lot to learn from how the open source software movement has conquered the commercial giants over the past few decades. Open source is a triumph of asynchronous collaboration and communication like few the world has ever seen."

    Absolutely.  I learned so much working in open source for so many years.

    The authors also mention that you'll have to worry about your employees overworking, not underworking.  Because they office is physically in your home, it's easy to get sucked in at all hours to just work on one little thing that takes longer than you expect.


    My thoughts on remote work
    If you had asked me five years ago  if I ever thought I'd work full time from my home my answer would have been a definitive no.  But I wanted to work for Mozilla, and wasn't interested in moving to an city where they had a physical office. Here are my personal suggestions for working successfully on a remote team, given my two years of experience being a part of one:

    As aside, I was supposed to be at a Mozilla work week in Portland this past week.  But I didn't fly there because I came down with a bad cold.  Despite this, I could connect to the same room the they were in and see all the talks they were giving and also give a presentation.  This was so excellent.  Since we are so used to being a distributed team, having one person remote when we were supposed to be all together wasn't a problem.  We already had the culture and tools in place to accommodate this. Thank you Mozilla releng for being such an amazing team to work with.

    My idea for another book on remote work would be to have one with the format where there were about 10+ chapters, each written by an employee from a different company about how they approach it, what tools they use and so on. I think this could be a very interesting read.

    I'll close with some thoughts from Scott Berkun's book on whether remote work is for everyone 

    "For me,  I know that for any important relationship I'd want to be physically around that person as much as possible. If I started a rock band or a company, I'd want to share the same physical space often.  The upsides outweigh the downsides.  However, if the people I wanted to work with were only available remotely, I'm confident that we could do great work from thousands of miles away."

    What do you think are the keys to successfully working as a remote employee?

    Further reading
    John O'Duinn's We are all Remoties Talk


    May 05, 2014 02:23 PM

    May 01, 2014

    Chris Cooper (coop)

    Dispatches from the releng team week: Portland

    UntitledReleng has been much more diligent during our current team week about preparing presentations and, more importantly, recording sessions for posterity.

    Sessions are still ongoing, but the list of presentations is in the wiki. We will continue to add links there.

    Special thanks to Armen for helping remoties get dialed-in and for getting everything recorded.

    May 01, 2014 09:11 PM

    April 29, 2014

    Peter Moore (pmoore)

    How we do automated mobile device testing at Mozilla – Part 3

    Video of this presentation from Release Engineering work week in Portland, 29 April 2014

    Part 3: Keeping the devices running

    So in Part 1 and 2, we saw how Buildbot tegra and panda masters can assign jobs to Buildbot slaves, and that these slaves run on foopies, and that these foopies then connect to the SUT Agent on the device, to deploy and perform the tests, and pull back results.

    However, over time, since these devices can fail, how do we make sure they are running ok, and handle the case that they go awol?

    The answer has two parts:

    1. watch_devices.sh
    2. mozpool

    What is watch_devices.sh?

    You remember that in Part 2, we said you need to create a directory under /builds on the foopy for any device that foopy should be taking care of.

    Well there is a cron job installed under /etc/cron.d/foopy that takes care of running watch_devices.sh every 5 mins.

    This script will look for device directories under /tools to see which devices are associated to this foopy. For each of these, it will check there is a buildbot slave running for that device. It handles the case of automatically starting buildbot slaves as necessary, if they are not running, but also checks the health of the device, by using the verification tools of SUT tools (discussed in Part 2). If it finds a problem with a device, it will also shutdown the buildbot slave, so that it does not get new jobs. In short, it keeps the state of the buildbot slave consistent with what it believes the availability of the device to be. If the device is faulty, it brings down the buildbot slave for that device. If it is a healthy device, passing the verification tests, it will start up the buildbot slave if it is not running.

    It also checks the “disabled” state of the device from slavealloc, and makes sure if it is “disabled” in slavealloc, that the buildbot slave will be shutdown.

    Therefore if you need to disable a device, by marking it as disabled in slavealloc, watch_devices.sh running from a cron tab on the foopy, will bring down the buildbot slave of the device.

    Where are the log files of watch_devices.sh?

    They are on the foopy:

    If during a buildbot test we determine that a device is not behaving properly, how do we pull it out of use?

    If a serious problem is found with a device during a buildbot job, the buildbot job will create an error.flg file under the device directory on the foopy. This signals to watch_devices.sh that when that job has completed, it should kill the buildbot slave, since the device is faulty. It should not respawn a buildbot slave while that error.flg file remains. Once per hour, it will delete the error.flg file, to force another verification test of the device.

    But wait, I heard that mozpool verifies devices and keeps them alive?

    Yes and no. Mozpool is a tool (written by Dustin) to take care of the life-cycle management of panda boards. It does not manage tegras. Remember: tegras cannot be automatically reimaged – you need fingers to press buttons on the devices, and physically connect a laptop to them. Pandas can. This is why mozpool only takes care of pandas.

    Mozpool is made up of three layered components. From the mozpool overview (http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/):

    1. Mozpool is the highest-level interface, where users request a device in a certain condition, and Mozpool finds a suitable device.
    2. Lifeguard is the middle level. It manages the state of devices, and knows how to cajole and coddle them to achieve reliable behavior.
    3. Black Mobile Magic is the lowest level. It deals with devices directly, including controlling their power and PXE booting them. Be careful using this level!

    So the principles behind mozpool, is that all the logic you have around getting a panda board, making sure it is clean and ready to use, contains the right OS image you want to run it with, etc – can be handled outside of the buildbot jobs. You would just query mozpool, tell it you’d like a device, specify the operating system image you want, and it will get you one.

    In the background it is monitoring the devices and checking they are ok, only handing you a “good” device, and cleaning up when you finish with it.

    So watch_devices and mozpool are both routinely running verification tests against the pandas?

    No. This used to be the case, but now the verification test of watch_devices.sh for pandas simply queries mozpool to get the status of the device. It no longer directly runs verification tests against the panda, to avoid that we have two systems doing the same. It trusts mozpool to tell it the correct state.

    So if I dynamically get a device from mozpool when I ask for one, does that mean my buildbot slave might get different devices at different times, depending on which devices are currently available and working at the time of the request?

    No. Since the name of the buildbot slave is the same as the name of the device, the buildbot slave is bound to the one device only. This means it cannot take advantage of the “give me a panda with this image, i don’t care which one” model.

    Summary part 3

    So we’ve learned:

    < Part 2


    April 29, 2014 06:02 AM

    How we do automated mobile device testing at Mozilla – Part 2

    Video of this presentation from Release Engineering work week in Portland, 29 April 2014

    Part 2: The foopy, Buildbot slaves, and SUT tools

    So how does buildbot interact with a device, to perform testing?

    By design, Buildbot masters require a Buildbot slave to perform any job. For example, if we have a Windows slave for creating Windows builds, we would expect to run a Buildbot slave on the Windows machine, and this would then be assigned tasks from the Buildbot master, which it would perform, and feed results back to the Buildbot master.

    In the mobile device world, this is a problem:

    1. Running a slave process on the device would consume precious limited resources
    2. Buildbot does not run on phones, or mobile boards

    Thus was born …. the foopy.

    What the hell is a foopy?

    A foopy is a machine, running Centos 6.2, that is devoted to the task of interfacing with pandas or tegras, and running buildbot slaves on their behalf.

    My first mistake was thinking that a “foopy” is special piece of hardware. This is not the case. It is nothing more than a regular Centos 6.2 machine – just a regular server, that does not have any special physical connection to the mobile device boards – it is simply a machine that has been set aside for this purpose, that has network access to the devices, just like other machines in the same network.

    For each device that a foopy is responsible for, it runs a dedicated buildbot slave. Typically each foopy serves between 10 and 15 devices. That means it will have around 10-15 buildbot slaves running on it, in parallel (assuming all devices are running ok).

    When a Buildbot master assigns a job to a Buildbot slave running on the foopy, it will run the job inside its slave, but parts of the job will involve communicating with the device, pushing binaries onto it, running tests, and gathering results. As far as the Buildbot master is concerned, the slave is the foopy, and the foopy is doing all the work. It doesn’t need to know that the foopy is executing code on a tegra or panda. As far as the device is concerned, it is receiving tasks over the SUT Agent listener network interface, and performing those tasks.

    So does the foopy always connect to the same devices?

    Yes. Each foopy has a static list of devices for it to manage jobs for.

    How do you see which devices a foopy manages?

    If you ssh onto the foopy, you will see the devices it manages as subdirectories under /builds:

    pmoore@fred:~/git/tools/sut_tools master $ ssh foopy106
    Last login: Mon Apr 28 22:01:18 2014 from 10.22.248.82
    Unauthorized access prohibited
    [pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$ find /builds -maxdepth 1 -type d -name 'tegra-*' -o -name 'panda-*'
    /builds/panda-0078
    /builds/panda-0066
    /builds/panda-0064
    /builds/panda-0071
    /builds/panda-0072
    /builds/panda-0080
    /builds/panda-0070
    /builds/panda-0074
    /builds/panda-0062
    /builds/panda-0063
    /builds/panda-0067
    /builds/panda-0073
    /builds/panda-0076
    /builds/panda-0075
    /builds/panda-0079
    /builds/panda-0077
    /builds/panda-0068
    /builds/panda-0061
    /builds/panda-0065
    [pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$

    How did those directories get created?

    Manually. Each directory contains artefacts related to that panda or tegra, such as log files for verify checks, error flags if it is broken, disable flags if it has been disabled, etc. More about this later. Just know at this point that if you want that foopy to look after that device, you better create a directory for it.

    So the directory existence on the foopy is useful to know which devices the foopy is responsible for, but how do you know which foopy manages an arbitrary device, without logging on to all foopies?

    In the tools repository, the file buildfarm/mobile/devices.json also defines the mapping between foopy and device. Here is a sample:

    {
     "tegra-010": {
     "foopy": "foopy109",
     "pdu": "pdu1.r602-11.tegra.releng.scl3.mozilla.com",
     "pduid": ".AA1"
     },
     "tegra-011": {
     "foopy": "foopy109",
     "pdu": "pdu2.r602-11.tegra.releng.scl3.mozilla.com",
     "pduid": ".AA1"
     },
     "tegra-012": {
     "foopy": "foopy109",
     "pdu": "pdu3.r602-11.tegra.releng.scl3.mozilla.com",
     "pduid": ".AA1"
     },
    ......
     "panda-0168": {
     "foopy": "foopy45",
     "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
     "relayid": "2:6"
     },
     "panda-0169": {
     "foopy": "foopy45",
     "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
     "relayid": "2:7"
     },
     "panda-0170": {
     "foopy": "foopy46",
     "relayhost": "panda-relay-015.p2.releng.scl1.mozilla.com",
     "relayid": "1:1"
     },
    ......
    }

    So what if the devices.json lists different foopy -> devices mappings than the foopy filesystems list? Isn’t there a danger this data gets out of sync?

    Yes, there is nothing checking that these two data sources are equivalent. For example, if /builds/tegra-0123 was created on foopy39, but devices.json said tegra-0123 was assigned to foopy65, nothing would report this difference, and we would have non-deterministic behaviour.

    Why is the foopy data not in slavealloc?

    Currently the fields for the slaves are static across different slave types – so if we added a field for “foopy” for the foopies, it would also appear for all other slave types, which don’t have a foopy association.

    What is that funny other data in the devices.json file?

    The “pdu” and “pduid” are the coordinates required to determine the physical power supply of the tegra. These are the values that you call the PDU API with to enable/disable power for that particular tegra.

    The “relayhost” and “relayid” are the equivalent values for the panda power supplies.

    Where does this data come from?

    This data is maintained in IT’s inventory database. We duplicate this information in this file.

    Example: https://inventory.mozilla.org/en-US/systems/show/2706/

    So is a PDU and a relay board essentially the same thing, just one is for pandas, and the other for tegras?

    Yes.

    What about if we want to write comments in this file? json doesn’t support comments, right?

    For example, you want to put a comment to explain why a tegra is not assigned to a PDU. For this, since json currently does not support comments, we add a _comment field, e.g.:

     "tegra-024": {
     "_comment": "Bug 727345: Assigned to WebQA",
     "foopy": "None"
     },

    Is there any sync process between inventory and devices.json to guarantee integrity of the relayboard and PDU data?

    No. We do not sync the data, so there is a risk our data can get out-of-sync. This could be solved by having an auto-sync to the devices.json file, or using inventory as the data source, rather than the devices.json file.

    So how do we interface with the PDUs / relay boards to hard reboot devices?

    This is done using sut_tools reboot.py script.

    Is there anything else useful in this “sut tools” folder?

    Yes, lots. This provides scripts for doing all sorts, like deploying artefacts on tegras and pandas, rebooting, running smoke tests and verifying the devices, cleaning up devices, accessing device logs, etc.

    Summary part 2

    So we’ve learned:

    < Part 1    Part 3 >


    April 29, 2014 05:00 AM

    April 26, 2014

    Peter Moore (pmoore)

    How we do automated mobile device testing at Mozilla – Part 1

    Video of this presentation from Release Engineering work week in Portland, 29 April 2014

    Part 1: Back to basics

    What software do we produce for mobile phones?

    What environments do we use for building and testing this software?

    Building Testing
    Fennec CentOS 6.2
    (bld-linux64-ix-*) in-house
    (bld-linux64-ec2-*) AWS
    Tegra / Panda / Emulator
    B2G CentOS 6.2 Emulator

    So first key point unveiled:

    Second key point:

    So why do we test Fennec on tegras, pandas and emulators?

    To answer this, first remember the wide variety of builds and tests we perform:

    Screenshot from tbpl

    Screenshot from tbpl

    The answer is:

    Notice:

    What are the main differences between our tegras and pandas?

    Tegras Pandas
    Look like this: Look like this:
    Tegra250_plugged_in 2012-08-06-10.23.28-768x1024
    Racked up like this: Racked up like this:
    blog_racks_in_faraday_cage 2012-11-09-08.30.50
    Older Newer
    Running Android 2.2 Running Android 4.0
    Hanging in shoe racks Racked professionally in Faraday cages
    Can only be reimaged by physically connecting them to a laptop, and pressing buttons in a magical sequence can be remotely reimaged by mozpool (moar to come later)
    Not very reliable Quite reliable
    Is connected to a “PDU” which allows us to programatically call an API to “pull the power” Is connected to a “relay host” which allows us to programatically call an API to “pull the power”

    So as you see, a panda is a more serious piece of kit than a tegra. Think of a tegras as a toy.

    So what are tegras and a pandas, actually?

    Both are mobile device boards, as you see above, like you would get in a phone, but not actually in a phone.

    So why don’t we just use real phones?

    1. Real phones use batteries
    2. Real phones have wireless network

    Basically, by using the boards directly, we can:

    1. control the power supply (by connecting them to power units – PDUs) which we have API access to (i.e. we have an API to pull the power to a device)
    2. use ethernet, rather than wireless (which is more reliable, wireless signals don’t interfere with each other, less radiation, …)

    OK, so we have phones (or “phone circuit boards”) wired up to our network – but how do we communicate with them?

    Fennec historically ran on more platforms than just Android. It also ran on:

    For this reason, it was decided to create a generic interface, which would be implemented on all supported platforms. The SUT Agent was born.

    Please note: nowadays, Fennec it only available for Android 2.2+. It is not available for iOS (iPhone, iPad, iPod Touch), Windows Phone, Windows RT, Bada, Symbian, Blackberry OS, webOS or other operating systems for mobile.

    Therefore, the original reason for creating a standard interface to all devices (the SUT Agent) no longer exists. It would also be possible to use a different mechanism (telnet, ssh, adb, …) to communicate with the device. However, this is not what we do.

    So what is the SUT Agent, and what can it do?

    The SUT Agent is a listener running on the tegra or panda, that can receive calls over its network interface, to tell it to perform tasks. You can think of it as something like an ssh daemon, in the sense that you can connect to it from a different machine, and issue commands.

    How do you connect to it?

    You simply telnet to the tegra or foopy, on port 20700 or 20701.

    Why two ports? Are the different?

    Only marginally. The original idea was that users would connect on port 20701, and that automated systems would connect on port 20700. For this reason, if you connect on port 20700, you don’t get a prompt. If you connect on port 20701, you do. However, everything else is the same. You can issue commands to both listeners.

    What commands does it support?

    The most important command is “help”. It displays this output, showing all available commands:

    pmoore@fred:~/git/tools/sut_tools master $ telnet panda-0149 20701
    Trying 10.12.128.132...
    Connected to panda-0149.p1.releng.scl1.mozilla.com.
    Escape character is '^]'.
    $>help
    run [cmdline] - start program no wait
    exec [env pairs] [cmdline] - start program no wait optionally pass env
     key=value pairs (comma separated)
    execcwd <dir> [env pairs] [cmdline] - start program from specified directory
    execsu [env pairs] [cmdline] - start program as privileged user
    execcwdsu <dir> [env pairs] [cmdline] - start program from specified directory as privileged user
    execext [su] [cwd=<dir>] [t=<timeout>] [env pairs] [cmdline] - start program with extended options
    kill [program name] - kill program no path
    killall - kill all processes started
    ps - list of running processes
    info - list of device info
     [os] - os version for device
     [id] - unique identifier for device
     [uptime] - uptime for device
     [uptimemillis] - uptime for device in milliseconds
     [sutuptimemillis] - uptime for SUT in milliseconds
     [systime] - current system time
     [screen] - width, height and bits per pixel for device
     [memory] - physical, free, available, storage memory
     for device
     [processes] - list of running processes see 'ps'
    alrt [on/off] - start or stop sysalert behavior
    disk [arg] - prints disk space info
    cp file1 file2 - copy file1 to file2
    time file - timestamp for file
    hash file - generate hash for file
    cd directory - change cwd
    cat file - cat file
    cwd - display cwd
    mv file1 file2 - move file1 to file2
    push filename - push file to device
    rm file - delete file
    rmdr directory - delete directory even if not empty
    mkdr directory - create directory
    dirw directory - tests whether the directory is writable
    isdir directory - test whether the directory exists
    chmod directory|file - change permissions of directory and contents (or file) to 777
    stat processid - stat process
    dead processid - print whether the process is alive or hung
    mems - dump memory stats
    ls - print directory
    tmpd - print temp directory
    ping [hostname/ipaddr] - ping a network device
    unzp zipfile destdir - unzip the zipfile into the destination dir
    zip zipfile src - zip the source file/dir into zipfile
    rebt - reboot device
    inst /path/filename.apk - install the referenced apk file
    uninst packagename - uninstall the referenced package and reboot
    uninstall packagename - uninstall the referenced package without a reboot
    updt pkgname pkgfile - unpdate the referenced package
    clok - the current device time expressed as the number of millisecs since epoch
    settime date time - sets the device date and time
     (YYYY/MM/DD HH:MM:SS)
    tzset timezone - sets the device timezone format is
     GMTxhh:mm x = +/- or a recognized Olsen string
    tzget - returns the current timezone set on the device
    rebt - reboot device
    adb ip|usb - set adb to use tcp/ip on port 5555 or usb
    activity - print package name of top (foreground) activity
    quit - disconnect SUTAgent
    exit - close SUTAgent
    ver - SUTAgent version
    help - you're reading it
    $>quit
    quit
    $>Connection closed by foreign host.

    Typically we use the SUT Agent to query the device, push Fennec and tests onto it, run tests, perform file system commands, execute system calls, and retrieve results and data from the device.

    What is the difference between quit and exit commands?

    I’m glad you asked. “quit” will terminate the session. “exit” will shut down the sut agent. You really don’t want to do this. Be very careful.

    Is the SUT Agent a daemon? If it dies, will it respawn?

    No, it isn’t, but yes, it will!

    The SUT Agent can die, and sometimes does. However, it has a daddy, who watches over it. The Watcher is a daemon, also running on the pandas and tegras, that monitors the SUT Agent. If the SUT Agent dies, the Watcher will spawn a new SUT Agent.

    Probably it would be possible to have the SUT Agent as an auto-respawning daemon – I’m not sure why it isn’t this way.

    Who created the Watcher?

    Legend has it, that the Watcher was created by Bob Moss.

    Where is the source code for the SUT Agent and the Watcher?

    The SUT Agent codebase lives in the firefox desktop source tree: http://hg.mozilla.org/mozilla-central/file/tip/build/mobile/sutagent

    The Watcher code lives there too: http://hg.mozilla.org/mozilla-central/file/tip/build/mobile/sutagent/android/watcher

    Does the Watcher and SUT Agent get automatically deployed when there are new changes?

    No. If there are changes, they need to be manually built (no continuous integration) and manually deployed to all tegras, and a new image needs to be created for pandas in mozpool (will be explained later).

    Fortunately, there are very rarely changes to either component.

    Summary part 1

    So we’ve learned:

    > Part 2


    April 26, 2014 11:27 PM

    April 25, 2014

    Ben Hearsum (bhearsum)

    This week in Mozilla RelEng – April 25th, 2014

    Major highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 25, 2014 08:31 PM

    April 23, 2014

    Armen Zambrano G. (@armenzg)

    Gaia code changes and how the trickle-down into Mozilla's RelEng CI

    Too long; did not read: In our pushes' monthly report we more or less count all Gaia commits through the B2G-Inbound repository.

    For the last few months, I've been creating reports about the pushes to the tbpl trees and I had to add some disclaimers about the code pushes to the Gaia repositories. I've decided to write the disclaimer in here and simply put a hyperlink to this post.

    Contributions to the Gaia repositories are done through GitHub and are run through the Travis CI (rather than through the Release Engineering infrastructure). However, independently from the Travis CI, we bring the Gaia merges into the Release Engineering systems this way:
      • We mirror the github changes into our git setup (gaia.git)
      • These changes trigger the Travis CI

    Here's an example:

    Long-story-short: Even though we don't have a Gaia tree on tbpl.mozilla.org, we test the Gaia changes through the B2G-Inbound tree, hence, we take Gaia pushes into account for the monthly pushes report.

    For more information, the B2G bumper bot was designed in this bug.


    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    April 23, 2014 07:32 PM

    April 17, 2014

    Armen Zambrano G. (@armenzg)

    Mozilla's pushes - March 2014

    Here's March's monthly analysis of the pushes to our Mozilla development trees (read about Gaia merges at the end of the blog post).
    You can load the data as an HTML page or as a json file.

    TRENDS

    March (as February did) has the highest number of pushes EVER.
    We will soon have 8,000 pushes/month as our norm.
    The only noticeable change in the distribution of pushes is that non-integration trees had a higher share of the cake (17.80% on Mar. vs 14.60% on Feb.).

    HIGHLIGHTS


    GENERAL REMARKS

    Try keeps on having around 50% of all the pushes.
    The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 30% of all the pushes.

    RECORDS

    • March 2014 was the month with most pushes (7,939 pushes)
    • March 2014 has the highest pushes/day average with 284 pushes/day
    • February 2014 has the highest average of "pushes-per-hour" is 16.57 pushes/hour
    • March 4th, 2014 had the highest number of pushes in one day with 435 pushes



    DISCLAIMERS

    • The data collected prior to 2014 could be slightly off since different data collection methods were used
    • Gaia pushes are more or less counted. I will write a blog post about in the near term.

    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    April 17, 2014 09:18 PM

    Ben Hearsum (bhearsum)

    This week in Mozilla RelEng – April 17th, 2014

    Major Highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 17, 2014 06:29 PM

    April 16, 2014

    Armen Zambrano G. (@armenzg)

    Kiss our old Mac Mini test pool goodbye

    Today we have stopped running test jobs on our old Revision 3 Mac Mini test pool (see previous announcement).

    There's a very very long lit of people that have been involved in this project (see bug 864866).
    I want to thank ahal, fgomes, jgriffin, jmaher, jrmuizel and rail for their help on the last mile.

    We're very happy to have finally decommissioned this non-datacenter-friendly infrastructure.

    A bit of history

    These minis were purchased back in early 2010 and we bought more than 300 of them.
    At first, we run on them Fedora 12, Fedora 12 x64, Windows Xp, Windows 7 and Mac 10.5. Later on we also added 10.6 to the mix (if my memory doesn't fail me).

    Somewhere in 2012, we moved the Mac 10.6 testings to the revision 4 new mac server minis and deprecated the 10.5 rev3 testing pool. We then re-purposed those machines to increase the Windows and the Fedora pools.

    By May of 2013, we stopped running Windows on them.
    During 2013, we moved a lot of the Fedora testing to EC2.
    Now, we managed to move the B2G reftests and Firefox debug mochitest-browser-chrome to EC2.

    NOTE: I hope my memory does not fail me

    Delivery of the Mac minis (photo credit to joduinn)
    Racked at the datacenter (photo credit to joduinn)



    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    April 16, 2014 05:54 PM

    April 15, 2014

    Chris Cooper (coop)

    If I had a million dollars

    Kraft DinnerArmen has a blog post up about the cost savings Mozilla has been able to realize in its continuous integration infrastructure in Amazon over just the last 3 months. This has been a bit of a sea change for release engineering, who have historically been conservative with regards to changing core infrastructure and practices. We’re all coming to grips with the new world order, but I’m quite excited about the possibilities.

    Some quick back-of-the-envelope calculations based on other recent numbers from Armen:

    If history has taught us anything, continued growth will eat in to at least part of that savings, but think of what Mozilla could do with an extra million dollars. Depending on where we hire them, that money could easily buy 5-10 more engineers to continue driving the mission forward.

    April 15, 2014 07:16 PM

    April 11, 2014

    Ben Hearsum (bhearsum)

    This week in Mozilla RelEng – April 11th, 2014

    Major highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 11, 2014 08:03 PM

    This week in Mozilla RelEng – April 11th, 2014

    Major highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 11, 2014 08:03 PM

    April 04, 2014

    Ben Hearsum (bhearsum)

    This week in Mozilla RelEng – April 4th, 2014

    Major highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 04, 2014 07:53 PM

    This week in Mozilla RelEng – April 4th, 2014

    Major highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    April 04, 2014 07:53 PM

    Justin Wood (Callek)

    Keeping track of MQ patchsets…

    Hey Everyone!

    First some brief Background, Mozilla Releng has our code in a *lot* of repos, most being in Mercurial (a few other needs are in git or svn, but those are very rare relatively). I also do work for SeaMonkey which has needs with m-c, m-i, m-*, c-c, c-* etc. And needs with l10n

    I personally manage all my patches with MQ. Which presents a problem for me, “keeping track of it all”. I used to try keeping open bugs, but thats hard with releng because while a bug may be open, we tend to have a good handful of patches attached to it, for various repos, and they need to land in certain orders sometimes.

    Other ways I’ve tried to cope have been with landing as soon as the review comes in and avoiding writing patches for parts that need to land later until the first parts are landed/deployed. I found that method encompasses unneeded end-to-end times on bugs, and unnecessary context-switching.

    To curb that I wrote a mozilla-build (bash) script [in ~/.bash_profile ] that sets an alias `patchset` that I run, and it works!

    It especially works because I keep my code in /c/Sources/hg/* some repos are multi-levels deep, so this code could/should be improved or at least edited for your uses, but without further ado, this is how I manage my patchset (again note, all my work is in Mercurial, I do convert my stuff over to git/etc as needed though):

    EDIT: I forgot to give credit for my normalize_path() implemented I stole Borrowed from http://www.linuxjournal.com/content/normalizing-path-names-bash

    Provided as-is, without alteration (again cleanups likely):

    function normalize_path()
    {
        # Remove all /./ sequences.
        local   path=${1//\/.\//\/}
    
        # Remove first dir/.. sequence.
        local   npath=$(echo $path | sed -e 's;[^/][^/]*/\.\./;;')
    
        # Remove remaining dir/.. sequence.
        while [[ $npath != $path ]]
        do
            path=$npath
            npath=$(echo $path | sed -e 's;[^/][^/]*/\.\./;;')
        done
        path=$npath
        npath=$(echo $path | sed -e 's;[^/][^/]*/\.\.$;;')
        echo $npath
    }
    
    function patchset() {
        pushd /c/Sources/hg >/dev/null
        for i in `find . -maxdepth 2 ! \( -name l10n -prune \) -a -name .hg`;
          do
            pushd $i/.. >/dev/null;
            if [ `hg --config color.mode=auto qseries | wc -l` != 0 ]; then
                echo -n "======= "; echo -n $(normalize_path $i/..); echo " =====";
                hg qseries;
            fi
            popd >/dev/null;
        done
        for i in `find ./users -maxdepth 3 -name .hg`;
          do
            pushd $i/.. >/dev/null;
            if [ `hg --config color.mode=auto qseries | wc -l` != 0 ]; then
                echo -n "======= "; echo -n $(normalize_path $i/..); echo " =====";
                hg qseries;
            fi
            popd >/dev/null;
        done
        for i in `find ./l10n -maxdepth 3 -name .hg`;
          do
            pushd $i/.. >/dev/null;
            if [ `hg --config color.mode=auto qseries | wc -l` != 0 ]; then
                echo -n "======= "; echo -n $(normalize_path $i/..); echo " =====";
                hg qseries;
            fi
            popd >/dev/null;
        done
        popd >/dev/null
    }

    And the output of that, as it stands for me _today_:

    Justin@AQUARIUS /c/Sources/hg/mozharness
    $ patchset
    ======= ./braindump/ =====
    seamonkey-bouncer
    ======= ./buildbot-configs/ =====
    ionmonkey
    ======= ./buildbotcustom/ =====
    ionmonkey
    ======= ./mozharness/ =====
    ionmonkey
    ======= ./slaveapi/ =====
    timestamp
    docs

    Lastly my qty of repos:

    $ pushd /c/Sources/hg
    /c/Sources/hg /c/Sources/hg/mozharness
    
    Justin@AQUARIUS /c/Sources/hg
    $ find . -maxdepth 2 ! \( -name l10n -prune \) -a -name .hg | wc -l
    17
    
    Justin@AQUARIUS /c/Sources/hg
    $ find ./users -maxdepth 3 -name .hg | wc -l
    19
    
    Justin@AQUARIUS /c/Sources/hg
    $ find ./l10n -maxdepth 3 -name .hg | wc -l
    52

    Hope this helps!

    April 04, 2014 12:11 AM

    April 02, 2014

    Kim Moir (kmoir)

    Enabling tests that ride the trains

    At Mozilla, we have a train schedule for releases.  Changes are first landed on a trunk branches, then are uplifted to other branches for stabilization.  Johnathan Nightingale has a very good blog post that explains this concept.   For instance, the usual route for a change to land from a trunk branch such as mozilla-central would be to uplift to mozilla-aurora, mozilla-beta, and then finally mozilla-release where it  would be available to users in the next release. 

    Merge day, which occurs every six weeks is when changes are uplifted to the next stable branch. Here's a picture from a talk John O'Duinn gave last year that shows an example of how changes move between branches1.

    Picture from John O'Duinn's Release Engineering as a Force Multiplier talk at Releng 2013

     
    For the release engineering team, each merge day we would update code in our buildbot configs to reflect changes that needed to be made after uplift.  For instance, we often deprecate platforms or add new tests that only apply to certain branches.  We used to have to to specify the branches that these applied to in our configs and update them every merge day.  It was tricky to get this right.   Last fall, Steve Fink fixed a bug that would allow us to base config versions on the version of gecko that rides the trains.  So each merge do we update the version of gecko in our configs on a per branch basis, and then have code like this so that the tests are only enabled for branches where the gecko version applies




    ......
    Enable jittests for desktop where gecko is 31 or more





    Load jetpack where gecko is at least 21



    To test these changes, you can set up your buildbot test master and run builder_list.py against your master.   The builder_list.py script will output the list of build/test jobs (builders) that are enabled on your master.  Then apply your patch against the master and diff the resulting builder files to ensure that the tests are enabled in the branches you want.  As a side note, if you are enabling tests on platforms that would be on different test masters, you'll have to configure your master for mac, linux and windows test masters and diff the builders for each platform. If you are enabling tests on trunk trees for the first time, your diff should not reveal any new builders on mozilla-aurora, mozilla-beta, mozilla-release but just on mozilla-central, mozilla-inbound and the associated twig branches.
     
    I recently fixed a few bugs where there was a request enable tests on trunk branches and ride the trains, so I thought I'd write about if others had to implement a similar request.

    Train-related graffiti in Antwerp (Belgium), near Antwerpen-Centraal train station by  ©vitalyzator, https://flic.kr/p/6tQ3H Creative Commons by-nc-sa 2.0


    Further reading and notes
    1 This applies to Firefox Desktop and mobile only, Firefox OS is on a different cadence and there are different branches involved
    Release Management's rapid release calender
    Release Engineering Merge duty
    Release Engineering Testing Techniques

    April 02, 2014 07:21 PM

    Armen Zambrano G. (@armenzg)

    Mozilla's recent CI improvements saves roughly 60-70% on our AWS bill

    bhearsum, catlee, glandium, taras and rail have been working hard for the last few months at cutting our AWS bills by improving Mozilla RelEng's CI.


    From looking at it, I can say that with the changes they have made we're roughly saving the 60-70% on our AWS bill.

    If you see them, give them a big pat on the back, this is huge for Mozilla.

    Here’s some of the projects that helped with this:


    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    April 02, 2014 03:53 PM

    March 31, 2014

    Kim Moir (kmoir)

    Schooling yourself in release engineering

    Traditionally, there haven't been many courses offered in colleges or universities that cover the fundamentals of release engineering.  This means that students don't get exposed to the potential that a career in release engineering has to offer.  Conversely, it also doesn't provide students who become employed in more traditional developer roles the background regarding the complexity and challenges that arise within the scope of  release engineering.  However, this is beginning to change which is fantastic!  For example:

    Release Engineering as a Discipline,  Center of Computer Science, RWTH Aachen University in Aachen Germany

    Overview of the Build and Release Process, (updated link) Seneca College, Toronto


    Release Engineering -- Applications of Mining Software Repositories, École Polytechnique, Montréal

    Software Release Planning, University of Calgary

    Seneca College Library Image ©moqubhttps://flic.kr/p/9PyVVm Creative Commons by-nc-sa 2.0


    If anyone knows of other courses that are offered, I'd love to hear about them.  Maybe someday I won't have to explain to new people I meet what a release engineer does all day.  Just kidding, this will still happen :-)

    March 31, 2014 06:38 PM

    March 28, 2014

    Armen Zambrano G. (@armenzg)

    Mozilla's Pushes - February 2014

    Here's February's monthly analysis (a bit late) of the pushes to our Mozilla development trees (Gaia trees are excluded).

    You can load the data as an HTML page or as a json file.

    TRENDS

    • We are staying on the 7,000 pushes/month range
    • Last year we only had 4 months with more than 7,000 pushes


















    HIGHLIGHTS

    • 7,275 pushes
    • 260 pushes/day (average)
      • NEW RECORD
    • Highest number of pushes/day: 421 pushes on 02/26
      • Current record is 427 on January
    • Highest pushes/hour (average): 16.57 pushes/hour
      • NEW RECORD

    GENERAL REMARKS

    • Try keeps on having around 50% of all the pushes
    • The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 30% of all the pushes

    RECORDS

    • August of 2013 was the month with most pushes (7,771 pushes)
    • February 2014 has the highest pushes/day average with 260 pushes/day
    • February 2014 has the highest average of "pushes-per-hour" is 16.57 pushes/hour
    • January 27th, 2014 had the highest number of pushes in one day with 427 pushes

    DISCLAIMER



    Creative Commons License
    This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

    March 28, 2014 07:39 PM

    Ben Hearsum (bhearsum)

    This week in Mozilla RelEng – March 28th, 2014

    Major Highlights:

    Completed work (resolution is ‘FIXED’):

    In progress work (unresolved and not assigned to nobody):

    March 28, 2014 06:41 PM