Planet Release Engineering

April 27, 2016

Rail Alliev (rail)

Firefox 46.0 and SHA512SUMS

In my previous post I introduced the new release process we have been adopting in the 46.0 release cycle.

Release build promotion has been in production since Firefox 46.0 Beta 1. We have discovered some minor issues; some of them are already fixed, some still waiting.

One of the visible bugs is Bug 1260892. We generate a big SHA512SUMS file, which should contain all important checksums. With numerous changes to the process the file doesn't represent all required files anymore. Some files are missing, some have different names.

We are working on fixing the bug, but you can use the following work around to verify the files.

For example, if you want to verify http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe, you need use the following 2 files:

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc

Example commands:

# download all required files
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/KEY
# Import Mozilla Releng key into a temporary GPG directory
$ mkdir .tmp-gpg-home && chmod 700 .tmp-gpg-home
$ gpg --homedir .tmp-gpg-home --import KEY
# verify the signature of the checksums file
$ gpg --homedir .tmp-gpg-home --verify firefox-46.0.checksums.asc && echo "OK" || echo "Not OK"
# calculate the SHA512 checksum of the file
$ sha512sum "Firefox Setup 46.0.exe"
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479  Firefox Setup 46.0.exe
# lookup for the checksum in the checksums file
$ grep c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 firefox-46.0.checksums
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 sha512 46275456 install/sea/firefox-46.0.ach.win64.installer.exe

This is just a temporary work around and the bug will be fixed ASAP.

April 27, 2016 04:47 PM

April 23, 2016

Hal Wine (hwine)

Enterprise Software Writers R US

Enterprise Software Writers R US

Someone just accused me of writing Enterprise Software!!!!!

Well, the “someone” is Mahmoud Hashemi from PayPal, and I heard him on the Talk Python To Me podcast (episode 54). That whole episode is quite interesting - go listen to it.

Mahmoud makes a good case, presenting nine “hallmarks” of enterprise software (the more that apply, the more “enterprisy” your software is). Most of the work RelEng does easily hits 7 of the points. You can watch Mahmoud define Enterprise Software for free by following the link from his blog entry (link is 2.1 in table of contents). (It’s part of his “Enterprise Software with Python” course offered on O’Reilly’s Safari.) One advantage of watching his presentation is that PayPal’s “Mother of all Diagrams” make ours_ look simple! (Although “blue spaghetti” is probably tastier.)

Do I care about “how enterprisy” my work is? Not at all. But I do like the way Mahmoud explains the landscape and challenges of enterprise software. He makes it clear, in the podcast, how acknowledging the existence of those challenges can inform various technical decisions. Such as choice of language. Or need to consider maintenance. Or – well, just go listen for yourself.

April 23, 2016 07:00 AM

April 22, 2016

Armen Zambrano G. (@armenzg)

The Joy of Automation

This post is to announce The Joy of Automation YouTube channel. In this channel you should be able to watch presentations about automation work by Mozilla's Platforms Operations. I hope more folks than me would like to share their videos in here.

This follows the idea that mconley started with The Joy of Coding and his livehacks.
At the moment there is only "Unscripted" videos of me hacking away. I hope one day to do live hacks but for now they're offline videos.

Mistakes I made in case any Platform Ops member wanting to contribute want to avoid:




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 22, 2016 02:26 PM

April 18, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - April 18, 2016

SF2 Balrog character select portrait“My update requests have your blood on them.”
This is release candidate week, traditionally one of the busiest times for releng. Your patience is appreciated.

Improve Release Pipeline:

Varun began work on improving Balrog’s backend to make multifile responses (such as GMP) easier to understand and configure. Historically it has been hard for releng to enlist much help from the community due to the access restrictions inherent in our systems. Kudos to Ben for finding suitable community projects in the Balrog space, and then more importantly, finding the time to mentor Varun and others through the work.

Improve CI Pipeline:

With build promotion well underway for the upcoming Firefox 46 release, releng is switching gears and jumping into the TaskCluster migration with both feet. Kim and Mihai will be working full-time on migration efforts, and many others within releng have smaller roles. There is still a lot of work to do just to migrate all existing Linux workloads into TaskCluster, and that will be our focus for the next 3 months.

Release:

We started doing the uplifts for the Firefox 46 release cycle late last week. Release candidates builds should be starting soon. As mentioned above, this is the first non-beta release of Firefox to use the new build promotion process.

Last week, we shipped Firefox and Fennec 45.0.2 and 46.0b10, Firefox 45.0.2esr and Thunderbird 45.0. For further details, check out the release notes here:

See you next week!

April 18, 2016 06:28 PM

April 17, 2016

Armen Zambrano G. (@armenzg)

Project definition: Give Treeherder the ability to schedule TaskCluster jobs

This is a project definition that I put up for GSoC 2016. This helps students to get started researching the project.

The main things I give in here are:


NOTE: This project has few parts that have risks and could change the implementation. It depends on close collaboration with dustin.


-----------------------------------
Mentor: armenzg 
IRC:   #ateam channel

Give Treeherder the ability to schedule TaskCluster jobs

This work will enable "adding new jobs" on Treeherder to work with pushes lacking TaskCluster jobs (our new continuous integration system).
Read this blog post to know how the project was built for Buildbot jobs (our old continous integration system).

The main work for this project is tracked in bug 1254325.

In order for this to work we need the following pieces:

A - Generate data source with all possible tasks

B - Teach Treeherder to use the artifact

C - Teach pulse_actions to listen for requests from Treeherder

  • pulse_actions is a pulse listener of Treeherder actions
  • You can see pulse_actions’ workflow in here
  • Once part B is completed, we will be able to listen for messages requesting certain TaskCluster tasks to be scheduled and we will schedule those tasks on behalf of the user
  • RISK: Depending if the TaskCluster actions project is completed on time, we might instead make POST requests to an API


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 17, 2016 04:01 PM

Project definition: SETA re-write

As an attempt to attract candidates to GSoC I wanted to make sure that the possible projects were achievable rather than lead them on a path of pain and struggle. It also helps me picture the order on which it makes more sense to accomplish.

It was also a good exercise for students to have to read and ask questions about what was not clear and give lots to read about the project.

I want to share this and another project definition in case it is useful for others.

----------------------------------
We want to rewrite SETA to be easy to deploy through Heroku and to support TaskCluster (our new continuous integration system) [0].

Please read carefully this document before starting to ask questions. There is high interest in this project and it is burdensome to have to re-explain it to every new prospective student.

Main mentor: armenzg (#ateam)
Co-mentor: jmaher (#ateam)

Please read jmaher’s blog post carefully [1] before reading anymore.

Now that you have read jmaher’s blog post, I will briefly go into some specifics.
SETA reduces the number of jobs that get scheduled on a developer’s push.
A job is every single letter you see on Treeherder. For every developer’s push there is a number of these jobs scheduled.
On every push, Buildbot [6] decides what to schedule depending on the data that it fetched from SETA [7].

The purpose of this project is two-fold:
  1. Write SETA as an independent project that is:
    1. maintainable
    2. more reliable
    3. automatically deployed through Heroku app
  2. Support TaskCluster, our new CI (continuous integration system)

NOTE: The current code of SETA [2] lives within a repository called ouija.

Ouija does the following for SETA:
  1. It has a cronjob which kicks in every 12 hours to scrape information about jobs from every push
  2. It takes the information about jobs (which it grabs from Treeherder) into a database

SETA then goes a queries the database to determine which jobs should be scheduled. SETA chooses jobs that are good at reporting issues introduced by developers. SETA has its own set of tables and adds the data there for quick reference.

Involved pieces for this project:
  1. Get familiar with deploying apps and using databases in Heroku
  2. Host SETA in Heroku instead of http://alertmanager.allizom.org/seta.html
  3. Teach SETA about TaskCluster
  4. Change the gecko decision task to reliably use SETA [5][6]
    1. If the SETA service is not available we should fall back to run all tasks/jobs
  5. Document how SETA works and auto-deployments of docs and Heroku
    1. Write automatically generated documentation
    2. Add auto-deployments to Heroku and readthedocs
  6. Add tests for SETA
    1. Add tox/travis support for tests and flake8
  7. Re-write SETA using ActiveData [3] instead of using data collected by Ouija
  8. Make the current CI (Buildbot) use the new SETA Heroku service
  9. Create SETA data for per test information instead of per job information (stretch goal)
    1. On Treeherder we have jobs that contain tests
    2. Tests re-order between those different chunks
    3. We want to run jobs at a per-directory level or per-manifest
  10. Add priorities into SETA data (stretch goal)
    1. Priority 1 gets every time
    2. Priority 2 gets triggered on Y push

[0] http://docs.taskcluster.net/
[1] https://elvis314.wordpress.com/tag/seta/
[2] https://github.com/dminor/ouija/blob/master/tools/seta.py
[3] http://activedata.allizom.org/tools/query.html
[4] https://bugzilla.mozilla.org/show_bug.cgi?id=1243123
[5] https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=gecko
[6] testing/taskcluster/mach_commands.py#l280
[7] http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla-tests/config_seta.py
[8] http://alertmanager.allizom.org/seta.html


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 17, 2016 03:54 PM

April 13, 2016

Armen Zambrano G. (@armenzg)

Improving how I write Python tests

The main focus of this post is about what I've learning about writing Python tests, using mocks and patching functions properly. This is not an exhaustive post.

What I'm writing now is something I should have learned many years ago as a Python developer. It can be embarrassing to recognize it, however, I've thought of sharing this with you since I know it would have helped me earlier on my career and I hope it might help you as well.

Somebody has probably written about this topic and if you're aware of a good blog post covering this similar topic please let me know. I would like to see what else I've missed.

Also, if you want to start a Python project from scratch or to improve your current one, I suggest you read "Open Sourcing a Python Project the Right Way". Many of the things he mentions is what I follow for mozci.

This post might also be useful for new contributors trying to write tests for your project.

My takeaway

These are some of the things I've learned

  1. Make running tests easy
    • We use tox to help us create a Python virtual environment, install the dependencies for the project and to execute the tests
    • Here's the tox.ini I use for mozci
  2. If you use py.test learn how to not capture the output
    • Use the -s flag to not capture the output
    • If your project does not print but instead it uses logging, add the pytest-capturelog plugin to py.test and it will immediately log for you
  3. If you use py.test learn how to jump into the debugger upon failures
    • Use --pdb to using the Python debugger upon failure
  4. Learn how to use @patch and Mock properly

How I write tests

This is what I do:


@patch properly and use Mocks

What I'm doing now to patch modules is the following:



The way that Mozilla CI tools is designed it begs for integration tests, however, I don't think it is worth doing beyond unit testing + mocking. The reason is that mozci might not stick around once we have fully migrated from Buildbot which was the hard part to solve.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 13, 2016 07:26 PM

April 12, 2016

Armen Zambrano G. (@armenzg)

mozci-trigger now installs with pip install mozci-scripts

If you use mozci from the command line this applies to you; otherwise, carry on! :)

In order to use mozci from the command line you now have to install with this:
pip install mozci-scripts
instead of:
pip install mozci

This helps to maintain the scripts separately from the core library since we can control which version of mozci the scripts use.

All scripts now lay under the scripts/ directory instead of the library:
https://github.com/mozilla/mozilla_ci_tools/tree/master/scripts




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 12, 2016 07:32 PM

April 11, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - April 11, 2016

Wildstyle and Freedom FridayFreedom Friday!
Welcome back to our weekly summary of releng-related activities on our new date & time (Monday mornings). Did you miss us? ;)

Modernize infrastructure:

Rail and Nick made taskcluster uploads more resilient to flaky network conditions. (https://bugzil.la/1250458)

Improve Release Pipeline

Rail blogged about our ongoing efforts with using build promotion for releases. Firefox 46 is being targeted as the first release build to use build promotion. A steady stream of beta builds have already been released via promotion, so we’re pretty confident in the process now. Promoted builds for Fennec won’t make the first pass, but we plan to add them in the Firefox 47 cycle.

Improve CI Pipeline:

I want to call out the recent work being done by the build team to modernize the build system. As David reports in his firefox-dev post, the team has recently managed to realize a drastic reduction in Windows PGO build times. This reduction brings the build time in line with those for Linux PGO builds. Since Windows PGO builds are currently a long pole in both the CI and release process, this allows us to provide more timely feedback about build quality to developers and sheriffs. Pretty graphs are available.

Release:

Last week we shipped Firefox 46.0b9, and there are several other releases still in flight. See the weekly post-mortem notes for further details.

See you next week!

April 11, 2016 05:35 PM

April 05, 2016

Rail Alliev (rail)

Release Build Promotion Overview

Hello from Release Engineering! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Release Build Promotion.

What is Release Build Promotion?

Release build promotion (or "build promotion", or "release promotion" for short), is the latest release pipeline for Firefox being developed by Release Engineering at Mozilla.

Release build promotion starts with the builds produced and tested by CI (e.g. on mozilla-beta or mozilla-release). We take these builds, and use them as the basis to generate all our l10n repacks, partial updates, etc. that are required to release Firefox. We "promote" the CI builds to the release channel.

How is this different?

The previous release pipeline also started with builds produced and tested by CI. However, when it came time to do a release, we would create an entirely new set of builds with slightly different build configuration. These builds would not get the regular CI testing.

Release build promotion improves the process by removing the second set of builds. This drastically improves the total time to do a release, and also increases our confidence in our products since we now are shipping exactly what's been tested. We also improve visibility of the release process; all the tasks that make up the release are now reported to Treeherder along with the corresponding CI builds.

Current status

Release build promotion is in use for Firefox desktop starting with the 46 beta cycle. ESR and release branches have not yet been switched over.

Firefox for Android is also not yet handled. We plan to have this ready for Firefox 47.

Some figures

One of the major reasons of this project was our release end-to-end times. I pulled some data to compare:

  • One of the Firefox 45 betas took almost 12 hours
  • One of the Firefox 46 betas took less than 3 hours

What's next?

  • Support Firefox for Android
  • Support release and ESR branches
  • Extend this process back to the aurora and nightly channels

Can I contribute?

Yes! We still have a lot of things to do and welcome everyone to contribute.

  • Bug 1253369 - Notifications on release promotion events.
  • (No bug yet) Redesign and modernize Ship-it to reflect the new release work flow. This will include new UI, multiple sign-offs, new release-runner, etc.
  • Tracking bug

More information

For more information, please refer to these other resources about build promotion:

There will be multiple blog posts regarding this project. You have probably seen Jordan's blog on how to be productive when distributed teams get together. It covers some of our experience we had during the project sprint week in Vancouver.

April 05, 2016 01:34 PM

April 04, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - April 4, 2016

We skipped a week of updates due to the Easter Holiday. I’ve also moved the timing of these update emails/posts From Friday afternoon to Monday so that more people will see them. Look for your releng/relops highlights on Mondays now going forward.

Improve CI Pipeline:

Aki submitted a pull request for generated async code+tests for taskcluster-client.py, with 100% async test coverage. (https://github.com/taskcluster/taskcluster-client.py/pull/49)

Callek got us running the mozharness-tests (CI tests for mozharness) [they used to run on Travis-CI, and that we lost when we moved Mozharness in tree[. Based on intree code with taskcluster. These tests only run when someone touches mozharness code. (http://bugzil.la/1240184)

Operational:

Callek closed out a bunch of old bugs that related to foopies, pandas, and a few lingering ones about tegras. Since we have retired that infrastructure in favor of Android Emulators.

Kendall, Jake, and Mark worked to patch much of our infrastructure against the git 0-day vulnerability. They’re finishing up the tail end of those machines that have significantly less exposure/risk.

Rob landed patches to increase the stability of our Windows AWS AMI generation, making that process more robust. There’s some additional work to be done around verifying certificate downloads to fix the remaining issues we know about.

Nick landed some patches to improve our AWS recovery time (by about an hour) when we terminate many instances at once.

Amy has initiated a purchase for another 192 mac minis to expand our existing OS X 10.10 test pool in support of e10s and other load.

Release:

We released Firefox 45.0.1esr, 38.7.1esr as well as Firefox and Fennec 45.0.1, 46.0b2 and 46.0b4. Check out the post-mortem notes for more details: https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-23

See you next week!

April 04, 2016 04:22 PM

March 31, 2016

Jordan Lund (jlund)

being productive when distributed teams get together

I'm a Mozilla Release Engineer. Which means I am a strong advocate for remote teams. Our team is very distributed and I believe we are successful at it too.

Funnily enough, I think a big part of the distributed model includes getting remote people physically together once in a while. When you do, you create a special energy. Unfortunately that energy can sometimes be wasted.

Have you ever had a work week that goes something like the following:

You stumble your way to the office in a strange environment on day one. You arrive, find some familiar faces, hug, and then pull out your laptop. You think, 'okay, first things first, I'll clear my email, bugmail, irc backscroll, and maybe even that task I left hanging from the week before.'

At some point, someone speaks up and suggests you come up with a schedule for the week. A google doc is opened and shared and after a few 'bikeshedding' moments, it's lunch time! A local to the area or advanced yelper in the group advertises a list of places to eat and after the highest rated food truck wins your stomach's approval, you come back to the office and ... quickly check your email.

The above scenario plays out in a similar fashion for the remainder of the week. Granted, I exaggerate and some genuine ideas get discussed. Maybe even a successful side sprint happens. But I am willing to bet that you, too, have been to a team meet up like this.

So can it be done better? Well I was at one recently in Vancouver and this post will describe what I think made the difference.

forest firefighting

Prior to putting out burning trees at Mozilla, I put out burning trees as a Forest Firefighter in Canada. BC Forest Protection uses the Incident Command System (ICS). That framework enabled us to safely and effectively suppress wildfires. So how does it work and why am I bringing it up? Well, without this framework, firefighters would spend a lot of time on the fire line deciding where to camp, what to eat, what part of the fire to suppress first, and how to do it. But thanks to ICS, these decisions are already made and the firefighters can get back to doing work that feels productive!

You can imagine how team meet ups could benefit from such organization. With ICS, there are four high level branches: Logistics, Planning, Operations, and Finance & Administration. The last one doesn't really apply to our 'work week' scenario as we use Egencia prior to arriving and Expensify after leaving so it doesn't really affect productivity during the week. However, let's dive into the other three and discover how they correlate to team meet ups.

For each of these branches, someone should be nominated or assigned and complete the branch responsibilities.

Logistics

Ideally the Logistics lead should be someone who is local to the area or has been there before. This person is required to create an etherpad/Google Doc that:

Now you might be saying, "wait a second, I can do all those things myself and don't need to be hand held." And while that is true, the benefit here is you reduce the head space required on each individual, the time spent debating, and you get everyone doing the same thing at the same time. This might not sound very flexible or distributed but remember, that's the point; you're centralized and together for the week! You might also be thinking "I really enjoy choosing a hotel and restaurant." That's fine too, but I propose you coordinate with the logistics assignee prior to the work week rather than spend quality work week time on these decisions.

Planning

Now that you have logistics sorted, it's time to do all the action planning. Traditionally we've had work weeks where we pre-plan high level goals we want to accomplish but we don't actually fill out the schedule until Monday as a group. The downside here is this can chew up a lot of time and you can easily get side track before completing the schedule. So, like Logistics, assign someone to Planning.

This person is required to create a [insert issue tracker of choice] list and determine a bugs/issues that should be tackled during the week. The way this is done of course depends on the issue tracker, style of the group, and type of team meet up but here is an example we used for finishing a deliverable related goal.

write a list of issues for each of the following categories:

For the above, we used Trello which is nice as it's really just a board of sticky notes. I could write a whole separate blog post on how to to be effective with it by incorporating bugzilla links, assignees to cards, tags, sub-lists, and card voting but for now, here is a visual example:

Trello Work Week Board

The beauty here is that all of the tasks (sticky notes) are done upfront and each team member simply plucks them off the 'hard blockers' and 'nice to have' lists one by one, assigns them to themselves, and moves them into the completed section.

No debating or brainstorming what to do, just sprint!

Operations

The Operations assignee here should:

If you want to take advantage of a successful physical team meetup, forget about the communication tools that are designed for distributed environments.

During the work week I think it is best to ignore email, bug mail, and irc. Treat the week like you are on PTO: change your bugzilla name and create a vacation responder. Have the Operations assignee respond to urgent requests and act as a proxy to the outside world.

It is also nice to have the Operations assignee moderate internally by constantly iterating over the trello board state, grasping what needs to be done, where you are falling behind, and what new things have come up.

Vancouver by accident

This model wasn't planned or agreed upon prior to the Vancouver team meetup. It actually just happened by accident. I (jlund) took on the Logistics, rail handled Planning, and catlee acted as that sort of moderator/proxy role in Operations. Everyone at the meet up finished the week satisfied and I think hungry to try it again.

I'm looking forward to using a framework like this in the future. What's your thoughts?

March 31, 2016 09:32 PM

March 18, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - March 18, 2016

After the huge push last week to realize our first beta release using build promotion, there’s not a whole lot of new work to report on this week. We continue to polish the build promotion process in the face of a busier-than-normal week in terms of operational work.

Release:

Firefox 46.0 beta 1 was finally released to world last week, and the ninth rebuild was, in fact, the charm. As the first release attempted using the new build promotion process, this is a huge milestone for Mozilla releases.

As proof we’re getting better, Firefox 46.0 beta 2 was released this week using the same build promotion process and only required three build iterations. Progress!

This was also a week for dot releases, with security releases in the pipe for Firefox 45 and our extended support release (ESR) versions.

Kim just stepped into the releaseduty rotation for the Firefox 46 cycle. Kudos to Mihai for fixing up the releaseduty docs during his rotation so the process is easy to step into! We released Firefox 45.0esr, Firefox 46.0b1, Thunderbird 38.7.0 and Firefox 38.7.0esr with several other releases in the pipeline. See the notes for details:

Operational:

There is new capacity in AWS for our Linux64 and Emulator 64 jobs thanks to Vlad and Kim’s work in bug 1252248.

Alin and Amy moved 10 WinXP machines to the Windows 8 pool to reduce pending counts on that platform. (bug 1255812)

Kim removed all the code used to run our panda infrastructure from our repos in bug 1186617. Amy is in the process of decommissioning the associated hardware in bug 1189476.

Speaking of Amy, she received a much-deserved promotion this week. To quote from Lawrence’s announcement email:

“I’m excited to promote Amy Rich into the newly created position of Head of Operations [for Platform Operations]. This new role, which reports directly to me, expands the purview of her existing systems ops team, and includes assisting me with more management leadership responsibility.

“Amy’s unique mix of skills make her a great fit for this role. She has a considerable systems engineering background, and she and her team have been responsible for greatly improving our release infrastructure over the past five years. As a people manager, her commitment to both individuals and the big picture engenders loyalty, respect, and admiration. She is inquisitive and reflective, bringing strategic perspective to decision-making processes such as setting the relative priority between TaskCluster migration and Windows 7 in AWS. As a leader, she has recently stepped up to shepherd projects aimed at creating a more cohesive Platform Operations team, and she is also assisting with Mozilla’s company-wide Diversity & Inclusion strategy.

“Amy’s team will focus on systems automation, lifecycle, and operations support. This involves taking on systems ops responsibilities for Engineering Productivity (VCS, MozReview, MXR/DXR, Bugzilla, and Treeherder) in addition to those of Release Engineering. The long-term vision for this team is that they will support the systems ops needs for all of Platform Operations.

Please join me in congratulating Amy on her new role!”

Indeed! Congratulations, Amy!

See you next week!

March 18, 2016 10:16 PM

March 11, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - March 11, 2016

My apologies for a somewhat reduced set of highlights this week. I only returned from vacation on Wednesday, and am still trying to get back up-to-speed with what’s going on.

Thanks to Kim for taking care of these releng highlights while I was away. In case you missed them, those posts can be found on her blog:

Improve CI Pipeline:

With lots of hard work from numerous people, we have expanded the scope of TaskCluster linux builds to include all twig branches as well as Aurora, and are on-track to make these builds Tier-1, and move Buildbot builds to Tier 2, in the next week or two.

Rok is extending the clobberer tool to be able to purge the cache for taskcluster workers (https://bugzil.la/1174263). This should be landing soon.

Aki added buildtime-generated code to the python taskcluster client, for easier code inspection and better stack traces. The code is still pending merge.

Improve Release Pipeline:

Firefox 46.0b1 is the first release we’ve attempted using build promotion, a new release process that multiple team members have been working on since last year. As is typical with new systems, we encountered some issues on this first attempt, and have so far iterated 9 times trying to get it right as continue to fix bugs. Among the issues we found this week was a discrepancy between how our manual update checks were attempting to invoke Ba'al, the Soul-eater, when compared with our automated tests. This is how the sausage is made, people.

Release:

We released Firefox and Fennec 45.0, as well as Fennec 46.0b1. As mentioned, Firefox 46.0b1 was still in progress as we went to press.

The RockWe’re actually not that sad about it.

Operational:

Kim disabled the last of the Android 4.0 jobs running on pandas (rack-mounted Android reference cards). We are in the process of cleaning up the code that was associated with them, as well as decommissioning the remaining pandas and associated hardware. Thank you pandas for service, enjoy your well-deserved retirement! Android performance tests will now run via autophone and results displayed via perfherder thanks to the hard work of many people on the developer productivity team.

The “TaskCluster login v3” effort is drawing to a close, and everyone can now login and create their own TaskCluster clients for whatever mad-science automation they want to do. This change makes the TaskCluster authentication system more maintainable and scalable, and will help us in encouraging other services such as RelengAPI, treeherder, and ship-it to use TaskCluster authentication. Dustin is in touch with owners of the old “permacreds” issued to interested people over the last few years to switch over to the new system.

Two platform support discussions that I want to highlight this week:

See you next week!

March 11, 2016 11:06 PM

March 07, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - March 4, 2016

It was a busy week with many releases in flight, as well as preparation for running beta 1 with release promotion next week.  We also are in the process of adding more capacity to certain test platform pools to lower wait times given all the new e10s tests that have been enabled.

Improve Release Pipeline:
Everyone gets a release promotion!  Source: http://i.imgur.com/WMmqSDI.jpg

Improve CI Pipeline:

Release:

The releases calendar is getting busier as we get closer to the end of the cycle. Many releases were shipped or are still in-flight:
As always, you can find more specific release details in our post-mortem minutes:
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-02 
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-09

Operational:

Until next time!

March 07, 2016 03:56 PM

February 29, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - February 26, 2016

It was a busy week for release engineering as several team members travelled to the Vancouver office to sprint on the release promotion project. The goal of the release promotion project is to promote continuous integration builds to release channels, allowing us to ship releases much more quickly.



Improve Release Pipeline:


Improve CI Pipeline:

Release:

Operational:

February 29, 2016 08:22 PM

February 19, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - February 19, 2016

A quieter week than last in terms of meetings and releases, but a glibc stack overflow exploit made things “fun” from an operational standpoint this week.

Improve CI Pipeline:

Dustin deployed a change to the TaskCluster login service to allow logins by people who have LDAP accounts (e.g,. for try pushes) but do not have access to the company’s single-sign-on provider. This closes a gap that excluded some of our most productive contributors from access to TaskCluster. With this change, anyone who has a Mozillians account or an LDAP account can connect to TaskCluster and have appropriate access based on group membership.

Ben wrote a blog post about using the Balrog agent to streamline the throttled rollout of Firefox releases. This is one of the few remaining interactive activities in the Firefox release process. Being able to automate it will eliminate some email hand-offs, leading to faster turnaround.

Release:

As opposed to last week’s congestion, this week had a rather normal pace. Various releases have been shipped or are still in-flight:

As always, you can find more specific release details in our post-mortem minutes: https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-02-17 and https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-02-24

Next week a handful of the people working on “Release Promotion” will be in Vancouver to try and sprint our way to the finish line. Among them are Jlund, Rail, Kmoir, and Mtabara. Callek won’t be able to make it in person, but will be joining them remotely.

Operational:

Over the course of the week, Jake, Hal, and Amy have worked to patch and reboot our infrastructure to make it safe against the glibc gethostinfo exploit.

Many people from various different teams pitched in to diagnose a bug that was causing our Windows 7 test pool to shut down. Special thanks to philor who finally tracked it down to a Firefox graphics problem. The patch was backed out, and operations are back to normal. (https://bugzil.la/1248347)

Alin landed changes to make the pending counts alerts more granular on a per platform basis (https://bugzil.la/1204970)

Outreach:

Aki wrote a blog post this week about how releng should get better about providing generically packaged tools. Not only would this make our own internal testing story better, but would make easier for contributors outside of releng to hack and help.

See you next week!

February 19, 2016 09:33 PM

February 12, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - February 12, 2016

2015-10-16 11.10.30

This past week, the release engineering managers – Amy, catlee, and coop (hey, that’s me!) – were in Washington, D.C. meeting up with the other managers who make up the platform operations management team. Our goal was to try to improve the ways we plan, prioritize, and cooperate. I thought it was quite productive. I’ll have more to say about next week once I gather my thoughts a little more.

Everyone else was *very* busy while we were away. Details are below.

Modernize infrastructure:

Dustin deployed a change to the TaskCluster authorization service to support “named temporary credentials”. With this change, credentials can come with a unique name, allowing better identification, logging, and auditing. This is a part of Dustin’s work to implement “TaskCluster Login v3” which should provide a smoother and more flexible way to connect to TaskCluster and create credentials for all of the other tasks you need to perform.

Windows 10 in the cloud is being tested. All the ground work is done to make golden AMIs, mirroring the first stages of work done for Windows 7 in the cloud. Being able to perform some subset of Windows 10 testing in the cloud should allow us to purchase less hardware than we had originally anticipated for this quarter.

Improve CI pipeline:

One of the subjects discussed at Mozlando was improving the overall integration of localization (l10n) builds with our continuous integration (CI) system. Mike fixed an l10n packaging bug this week that I first remember looking at over 4 years ago. This fix allows us to properly test l10n packaging of Mac builds in a release configuration on check-in, thereby avoiding possible headaches later in the release cycle. (https://bugzil.la/700997)

Armen, Joel, Dustin, and Greg worked together to green up even more Linux test jobs in TaskCluster. Among other things, this involved upgrading to the latest Docker (1.10.0) and diagnosing some test runner scripts which use 1.3GB of RAM – not counting the Firefox binaries they run! This project has already been a long slog, but we are constantly making progress and will soon have all jobs in-tree at Tier 2.

Release:

Ben and Nick started designing a new Balrog feature that will make it possible to change update rules in response to certain events. Ben is planning to blog about this is more detail next week.

It’s was a busy week for releases. Many were shipped or are still in-flight:

As always, you can find more specific release details in our post-mortem minutes: https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-02-10 and https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-02-17

Operational:

Kim landed a patch to enable Mac OS X 10.10.5 testing on try by default and disable 10.6 testing. This allowed us to disable some old r5 machines and install around 30 new 10.10.5 machines and enable them in production. Hooray for increased capacity! (https://bugzil.la/1239731)

See you next week!

February 12, 2016 09:29 PM

February 05, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - February 5, 2016

This week, we have two new people starting in Release Engineering: Aki Sasaki (:aki) and Rok Garbas (:garbas). Please stop by #releng and say hi!

Modernize infrastructure:

This week, Jake and Mark added check_ami.py support to runner for our Windows 2008 instances running in Amazon. This is an important step towards parity with our Linux instances in that it allows our Windows instances to check when a newer AMI is available and terminate themselves to be re-created with the new image. Until now, we’ve need to manually refresh the whole pool to pick up changes, so this is a great step forward.

Also on the Windows virtualization front, Rob and Mark turned on puppetization of Windows 2008 golden AMIs this week. This particular change has taken a long time to make it to production, but it’s hard to overstate the importance of this development. Windows is definitely *not* designed to manage its configuration via puppet, but being able to use that same configuration system across both our POSIX and Windows systems will hopefully decrease the time required to update our reference platforms by substantially reducing the cognitive overhead required for configuration changes. Anyone who remembers our days using OPSI will hopefully agree.

Improve CI pipeline:

Ben landed a Balrog patch that implements JSONSchemas for Balrog Release objects. This will help ensure that data entering the system is more consistent and accurate, and allows humans and other systems that talk to Balrog to be more confident about the data they’ve constructed before they submit it.

Ben also enabled caching for the Balrog admin application. This dramatically reduces the database and network load it uses, which makes it faster, more efficient, and less prone to update races.

Release:

We’re currently on beta 3 for the Firefox 45. After all the earlier work to unhork gtk3 (see last week’s update), it’s good to see the process humming along.

A small number of stability issues have precipitated a dot release for Firefox 44. A Firefox 44.0.1 release is currently in progress.

Operational:

Kim implemented changes to consume SETA information for Android API 15+ data using data from API 11+ data until we have sufficient data for API 15+ test jobs. This reduced the number of high number of pending counts for the AWS instance types used by Android. (https://bugzil.la/1243877)

Coop (hey, that’s me!) did a long-overdue pass of platform support triage. Lots of bugs got closed out (30+), a handful actually got fixed, and a collection of Windows test failures got linked together under a root cause (thanks, philor!). Now all we need to do is find time to tackle the root cause!

See you next week!

February 05, 2016 09:31 PM

Welcome (back), Aki!

Aki in Slave UnitThis actually is Aki.

In addition to Rok who also joined our team week, I’m ecstatic to welcome back Aki Sasaki to Mozilla release engineering.

If you’ve been a Mozillian for a while, Aki’s name should be familiar. In his former tenure in releng, he helped bootstrap the build & release process for both Fennec *and* FirefoxOS, and was also the creator of mozharness, the python-based script harness that has allowed us to push so much of our configuration back into the development tree. Essentially he was devops before it was cool.

Aki’s first task in this return engagement will be to figure out a generic way to interact with Balrog, the Mozilla update server, from TaskCluster. You can follow along in bug 1244181.

Welcome back, Aki!

February 05, 2016 04:28 PM

Welcome, Rok!

The RockThis is *not* our Rok.

I’m happy to announce a new addition to the Mozilla release engineering. This week, we are lucky to welcome Rok Garbas to the team.

Rok is a huge proponent of Nix and NixOS. Whether we end up using those particular tools or not, we plan to leverage his experience with reproducible development/production environments to improve our service deployment story in releng. To that end, he’s already working with Dustin who has also been thinking about this for a while.

Rok’s first task is to figure out how the buildbot-era version of clobberer, a tool for clearing and resetting caches on build workers, can be rearchitected to work with TaskCluster. You can follow along in bug 1174263 if you’re interested.

Welcome, Rok!

February 05, 2016 03:53 PM

January 29, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - January 29, 2016

Well, that was a quick month! Time flies when you’re having fun…or something.

Modernize infrastructure:

In an effort to be more agile in creating and/or migrating webapps, Releng has a new domain name and SSL wildcard! The new domain (mozilla-releng.net) is setup for management under inventory and an ssl endpoint has been established in Heroku. See https://wiki.mozilla.org/ReleaseEngineering/How_Tos/Heroku:Add_a_custom_domain

Improve CI pipeline:

Coop (hey, that’s me!) re-enabled partner repacks as part of release automation this week, and was happy to see the partner repacks for the Firefox 44 release get generated and published without any manual intervention. Back in August, we moved the partner repack process and configuration into github from mercurial. This made it trivially easy for Mozilla partners to issue a pull request (PR) when a configuration change was needed. This did require some re-tooling on the automation side, and we took the opportunity to fix and update a lot of partner-related cruft, including moving the repack hosting to S3. I should note that the EME-free repacks are also generated automatically now as part of this process, so those of you who prefer less DRM with your Firefox can now also get your builds on a regular basis.

Release:

One of the main reasons why build promotion is so important for releng and Mozilla is that it removes the current disconnect between the nightly/aurora and beta/release build processes, the builds for which are created in different ways. This is one of the reasons why uplift cycles are so frequently “interesting” - build process changes on nightly and aurora don’t often have an existing analog in beta/release. And so it was this past Tuesday when releng started the beta1 process for Firefox 45. We quickly hit a blocker issue related to gtk3 support that prevented us from even running the initial source builder, a prerequisite for the rest of the release process. Nick, Rail, Callek, and Jordan put their heads together and quickly came up with an elegant solution that unblocked progress on all the affected branches, including ESR. In the end, the solution involved running tooltool from within a mock environment, rather than running it outside the mock environment and trying to copy relevant pieces in. Thanks for the quick thinking and extra effort to get this unblocked. Maybe the next beta1 cycle won’t suck quite as much! The patch that Nick prepared (https://bugzil.la/886543) is now in production and being used to notify users on unsupported versions of GTK why they can’t update. In the past, they would’ve simply received no update with no information as to why.

Operational:

Dustin made security improvements to TaskCluster, ensuring that expired credentials are not honored.

We had a brief Balrog outage this morning [Fri Jan 29]. Balrog is the server side component of the update system used by Firefox and other Mozilla products. Ben quickly tracked the problem down to a change in the caching code. Big thanks to mlankford, Usul, and w0ts0n from the MOC for their quick communication and help in getting things back to a good state quickly.

Outreach:

On Wednesday, Dustin spoke at Siena College, holding an information session on Google Summer of Code and speaking to a Software Engineering class about Mozilla, open source, and software engineering in the real world.

See you next week!

January 29, 2016 10:10 PM

January 22, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - January 22, 2016

wine-and-piesReleng: drinkin’ wine and makin’ pies.
It’s encouraging to see more progress this week on both the build/release promotion and TaskCluster migration fronts, our two major efforts for this quarter.

Modernize infrastructure:

In a continuing effort to enable faster, more reliable, and more easily-run tests for TaskCluster components, Dustin landed support for an in-memory, credential-free mock of Azure Table Storage in the azure-entities package. Together with the fake mock support he added to taskcluster-lib-testing, this allows tests for components like taskcluster-hooks to run without network access and without the need for any credentials, substantially decreasing the barrier to external contributions.

All release promotion tasks are now signed by default. Thanks to Rail for his work here to help improve verifiability and chain-of-custody in our upcoming release process. (https://bugzil.la/1239682) Beetmover has been spotted in the wild! Jordan has been working on this new tool as part of our release promotion project. Beetmover helps move build artifacts from one place to another (generally between S3 buckets these days), but can also be extended to perform validation actions inline, e.g. checksums and anti-virus. (https://bugzil.la/1225899)

Dustin configured the “desktop-test” and “desktop-build” docker images to build automatically on push. That means that you can modify the Dockerfile under `testing/docker`, push to try, and have the try job run in the resulting image, all without pushing any images. This should enable much quicker iteration on tweaks to the docker images. Note, however, that updates to the base OS images (ubuntu1204-build and centos6-build) still require manual pushes.

Mark landed Puppet code for base windows 10 support including secrets and ssh keys management.

Improve CI pipeline:

Vlad and Amy repurposed 10 Windows XP machines as Windows 7 to improve the wait times in that test pool (https://bugzil.la/1239785) Armen and Joel have been working on porting the Gecko tests to run under TaskCluster, and have narrowed the failures down to the single digits. This puts us on-track to enable Linux debug builds and tests in TaskCluster as the canonical build/test process.

Release:

Ben finished up work on enhanced Release Blob validation in Balrog (https://bugzil.la/703040), which makes it much more difficult to enter bad data into our update server.

You may recall Mihai, our former intern who we just hired back in November. Shortly after joining the team, he jumped into the releaseduty rotation to provide much-needed extra bandwidth. The learning curve here is steep, but over the course of the Firefox 44 release cycle, he’s taken on more and more responsibility. He’s even volunteered to do releaseduty for the Firefox 45 release cycle as well. Perhaps the most impressive thing is that he’s also taken the time to update (or write) the releaseduty docs so that the next person who joins the rotation will be that much further ahead of the game. Thanks for your hard work here, Mihai!

Operational:

Hal did some cleanup work to remove unused mozharness configs and directories from the build mercurial repos. These resources have long-since moved into the main mozilla-central tree. Hopefully this will make it easier for contributors to find the canonical copy! (https://bugzil.la/1239003)

Hiring:

We’re still hiring for a full-time Build & Release Engineer, and we are still accepting applications for interns for 2016. Come join us!

Well, I don’t know about you, but all that hard work makes me hungry for pie. See you next week!

January 22, 2016 08:49 PM

January 21, 2016

Rail Alliev (rail)

Rebooting productivity

Every new year gives you an opportunity to sit back, relax, have some scotch and re-think the passed year. Holidays give you enough free time. Even if you decide to not take a vacation around the holidays, it's usually calm and peaceful.

This time, I found myself thinking mostly about productivity, being effective, feeling busy, overwhelmed with work and other related topics.

When I started at Mozilla (almost 6 years ago!), I tried to apply all my GTD and time management knowledge and techniques. Working remotely and in a different time zone was an advantage - I had close to zero interruptions. It worked perfect.

Last year I realized that my productivity skills had faded away somehow. 40h+ workweeks, working on weekends, delivering goals in the last week of quarter don't sound like good signs. Instead of being productive I felt busy.

"Every crisis is an opportunity". Time to make a step back and reboot myself. Burning out at work is not a good idea. :)

Here are some ideas/tips that I wrote down for myself you may found useful.

Concentration

  • Task #1: make a daily plan. No plan - no work.
  • Don't start your day by reading emails. Get one (little) thing done first - THEN check your email.
  • Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
  • Meetings are time consuming, so "Set a goal for each meeting". Consider skipping a meeting if you don't have any goal set, unless it's a beer-and-tell meeting! :)
  • Constantly ask yourself if what you're working on is important.
  • 3-4 times a day ask yourself whether you are doing something towards your goal or just finding something else to keep you busy. If you want to look busy, take your phone and walk around the office with some papers in your hand. Everybody will think that you are a busy person! This way you can take a break and look busy at the same time!
  • Take breaks! Pomodoro technique has this option built-in. Taking breaks helps not only to avoid RSI, but also keeps your brain sane and gives you time to ask yourself the questions mentioned above. I use Workrave on my laptop, but you can use a real kitchen timer instead.
  • Wear headphones, especially at office. Noise cancelling ones are even better. White noise, nature sounds, or instrumental music are your friends.

(Home) Office

  • Make sure you enjoy your work environment. Why on the earth would you spend your valuable time working without joy?!
  • De-clutter and organize your desk. Less things around - less distractions.
  • Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them. Your health is more important and expensive. Thanks to mhoye for this advice!

Other

  • Don't check email every 30 seconds. If there is an emergency, they will call you! :)
  • Reward yourself at a certain time. "I'm going to have a chocolate at 11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you are Pavlov's dog too!
  • Don't try to read everything NOW. Save it for later and read in a batch.
  • Capture all creative ideas. You can delete them later. ;)
  • Prepare for next task before break. Make sure you know what's next, so you can think about it during the break.

This is my list of things that I try to use everyday. Looking forward to see improvements!

I would appreciate your thoughts this topic. Feel free to comment or send a private email.

Happy Productive New Year!

January 21, 2016 02:06 AM

January 15, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - January 15, 2016

One of releng’s big goals for Q1 is to deliver a beta via build promotion. It was great to have some tangible progress there this week with bouncer submission.

Lots of other stuff in-flight, more details below!

Modernize infrastructure:

Dustin worked with Armen and Joel Maher to run Firefox tests in TaskCluster on an older EC2 instance type where the tests seem to fail less often, perhaps because they are single-CPU or slower.

Improve CI pipeline:

We turned off automation for b2g 2.2 builds this week, which allowed us to remove some code, reduce some complexity, and regain some small amount of capacity. Thanks to Vlad and Alin on buildduty for helping to land those patches. (https://bugzil.la/1236835 and https://bugzil.la/1237985)

In a similar vein, Callek landed code to disable all b2g desktop builds and tests on all trees. Another win for increased capacity and reduced complexity! (https://bugzil.la/1236835)

Release:

Kim finished integrating bouncer submission with our release promotion project. That’s one more blocker out of the way! (https://bugzil.la/1215204)

Ben landed several enhancements to our update server: adding aliases to update rules (https://bugzil.la/1067402), and allowing fallbacks for rules with whitelists (https://bugzil.la/1235073).

Operational:

There was some excitement last Sunday when all the trees were closed due to timeouts connectivity issues between our SCL3 datacentre and AWS. (https://bugzil.la/238369)

Build config:

Mike released v0.7.4 of tup, and is working on generating the tup backend from moz.build. We hope to offer tup as an alternative build backend sometime soon.

See you all next week!

January 15, 2016 10:44 PM

January 08, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - January 8, 2016

Happy new year from all of us in releng! Here’s a quick rundown of what’s happened over the holidays.

Modernize infrastructure:

We are now running 100% of our Windows builds (including try) in AWS. This greatly improves the scalability of our Windows build infrastructure. It turns out these AWS instances are also much faster than the in-house hardware we were using previously. On AWS, we save over 45 minutes per try run on Windows, with more modest improvements on the integration branches. Thanks to Rob, Mark, and Q for making this happen!

Dustin added a UI frontend to the TaskCluster “secrets” service and landed numerous fixes to it and to the hooks service.

Rob implemented some adjustments to 2008 userdata in cloud-tools that allow us to re-enable puppetisation of 2008 golden AMIs.

Callek added buildbot master configuration that enables parallel t-w732 testing prototype instances in EC2. This is an important step as we try to virtualize more of our testing infrastructure to reduce our maintenance burden and improve burst capacity.

Q implemented a working mechanism for building Windows 7/10 cloud instance AMIs that behave as other EC2 Windows instances (EC2Config, Sysprep, etc) and can be configured for test duty.

Mark landed Puppet code for base Windows 7 support including secrets and ssh keys management.

Improve CI pipeline:

Dustin completed modifications to the docker worker to support “coalescing”, the successor to what is now known as queue collapsing or BuildRequest merging.

Release:

Ben modernized Balrog’s toolchain, including switching it from Vagrant to Docker, enabling us to start looking at a more modern deployment strategy.

Operational:

Hal introduced the team learned about Lean Coffee at Mozlando. The team has adopted it wholeheartedly, and is using it for both project and team meetings with success. We’re using Trello for virtual post-it notes in vidyo meetings.

Rob fixed a problem where our AWS instances in us-west-2 were pulling mercurial bundles from us-east-1. This saves us a little bit of money every month in transfer costs between AWS regions. (bug 1232501)

See you all next week!

January 08, 2016 09:35 PM

Kim Moir (kmoir)

Tips from a resume nerd

Before I begin this post a few caveats:
I'm kind of a resume and interview nerd.  I like helping friends fix their resumes and write amazing cover letters. In the past year I've helped a few (non-Mozilla) friends fix up their resumes, write cover letters, prepare for interviews as they search for new jobs.  This post will discuss some things I've found to be helpful in this process.

Picture by GotCredit - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/jakerust/16223669794/sizes/l

Preparation
Everyone tends to jump into looking at job descriptions and making their resume look pretty. Another scenario is that people have a sudden realization that they need to get out of their current position and find a new job NOW and frantically start applying for anything that matches their qualifications.  Before you do that, take a step back and make a list of things that are important to you.  For example, when I applied at Mozilla, my list was something like this

People spend a lot of time at work. Life is too short to be unhappy every day.  Writing a list of what is important serves as a checklist to when you are looking at job descriptions and immediately weed out the ones that don't match your list.  

Picture by Mufidah Kassalias - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/mufidahkassalias/10519774073/sizes/o/
 
People tend focus a lot on the technical skills they want to use or new ones you want to learn.  You should also think about what kind of culture where you want to work.  Do the goals and ethics of the organization align with your own? Who will you be working with? Will you enjoy working with this team?  Are you interested in remote work or do you want to work in an office? How will a long commute impact or relocation your quality of life? What is the typical career progression of someone in this role? Are there both management and technical tracks for advancement?


Picture by mugley - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/mugley/4221455156/sizes/o/


To summarize, itemize the skills you'd like to use or learn, the culture of the company and the team and why you want to work there.

Cover letter

Your cover letter should succinctly map your existing skills to the role you are applying for and convey enthusiasm and interest.  You don't need to have a long story about how you worked on a project at your current job that has no relevance to your potential new employer.  Teams that are looking to hire have problems to solve.  Your cover letter needs to paint a picture that your have the skills to solve them.

Picture by Jim Bauer - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/lens-cap/10320891856/sizes/l


Refactoring your resume

Developers have a lot of opportunities these days, but if you intend to move from another industry, into a tech company, it can be more tricky.  The important thing is to convey the skills you have in a a way that people can see they can be applied to the problems they want to hire you to fix. 

Many people describe their skills and accomplishments in a way that is too company specific.  They may have a list of acronyms and product names on their resume that are unlikely to be known by people outside the company.  When describing the work you did in a particular role, describe the work that you did in a that is measurable way that highlights the skills you have.  An excellent example of a resume that describes the skills that without going into company specific detail is here. (Julie Pagano also has a terrific post about how she approached her new job search.)

Another tip is to leave out general skills that are very common.  For instance, if you are a technical writer, omit the fact that you know how to use Windows and Word and focus on highlighting your skills and accomplishments. 


Non-technical interview preparation

Every job has different technical requirements and there are many books and blog posts on how to prepare for this aspect of the interview process. So I'm going to just cover the non-technical aspects.

When I interview someone, I like to hear lots of questions.  Questions about the work we do and upcoming projects.  This indicates that have taken the time to research the team, company and work that we do.  It also shows that enthusiasm and interest.

Here is a list suggestions to prepare for interviews

1.  Research the company make a list of relevant questions
Not every company is open about the work that they do, but most will be have some public information that you can use to formulate questions during the interviews.  Do you know anyone you can have coffee or skype with to who works for the company and can provide insight? What products/services do the company produce? Is the product nearing end of life?  If so, what will it be replaced by? What is the companies market share, is it declining, stable or experiencing growth? Who are their main competitors? What are some of the challenges they face going forward? How will this team help address these challenges?

2.  Prepare a list of questions for every person that interviews you ahead of time
Many companies will give you the list of names of people who will interview you.
Have they recently given talks? Watch the videos online or read the slides.
Does the team have github or other open repositories?  What are recent projects are they working on? Do they have a blog or are active on twitter? If so, read it and formulate some questions to bring to the interview.
Do they use open bug tracking tools?  If so, look at the bugs that have recent activity and add them to the list of questions for your interview. 
A friend of mine read the book of a person that interviewed him had written and asked questions about the book in the interview.  That's serious interview preparation!

Photo by https://www.flickr.com/photos/wocintechchat/ https://www.flickr.com/photos/wocintechchat/22506109386/sizes/l


3. Team dynamics and tools
Is the team growing or are you hiring to replace somebody who left?
What's the onboarding process like? Will you have a mentor?
How is this group viewed by the rest of the company? You want to be in a role where you can make a valuable contribution.  Joining a team where their role is not valued by the company or not funded adequately is a recipe for disappointment.
What does a typical day look like?  What hours do people usually work?
What tools do people use? Are there prescribed tools or are you free to use what you'd like?

4.  Diversity and Inclusion
If you're a member of an underrepresented group in tech, the numbers are lousy in this industry with some notable exceptions. And I say that while recognizing that I'm personally in the group that is the lowest common denominator for diversity in tech. 

The entire thread on this tweet is excellent  https://twitter.com/radiomorillo/status/589158122108932096


I don't really have good advice for this area other than do your research to ensure you're not entering a toxic environment.  If you look around the office where you're being interviewed and nobody looks like you, it's time for further investigation.   Look at the company's website - is the management team page white guys all the way down?  Does the company support diverse conferences, scholarships or internships? Ask on a mailing list like devchix if others have experience working at this company and what it's like for underrepresented groups. If you ask in the interview why there aren't more diverse people in the office and they say something like "well, we only hire on merit" this is a giant red flag. If the answer is along the lines of "yes, we realize this and these are the steps we are taking to rectify this situation",  this is a more encouraging response.

A final piece of advice, ensure that you meet with your manager that you're going to report to as part of your hiring process.  You want to ensure that you have rapport with them and can envision a productive working relationship. 

What advice do you have for people preparing to find a new job?

Further reading

Katherine Daniels gave at really great talk at Beyond the Code 2014 about how to effectively start a new job.  Press start: Beginning a New Adventure Job
She is also the co-author of Effective Devops which has fantastic chapter on hiring.
Erica Joy writes amazing articles about the tech industry and diversity.
Cate Huston has some beautiful posts on how to conduct technical interviews and how to be a better interviewer
Camille Fournier's blog is excellent reading on career progression and engineering management.
Mozilla is hiring!

January 08, 2016 08:37 PM

December 18, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - December 18, 2015 - Mozlando edition

Talk smooth like Lando CalrissianThis is also *not* Mihai.
All of Mozilla gathered in Orlando, Florida last week for one of our twice-yearly all-hands meetings. Affectionately called “Mozlando”, it was a chance for Mozilla contributors (paid and not) to come together to celebrate successes and plan for the future…in between riding roller coasters and drinking beer.

Even though I’ve been involved in the day-to-day process, have manage a bunch of people working on the relevant projects, and indeed have been writing these quasi-weekly updates, it was until Chris AtLee put together a slide deck retrospective of what we had accomplished in releng over the last 6 months that it really sunk in (hint: it’s a lot):

But enough about the “ancient” past, here’s what has been happening since Mozlando:

Modernize infrastructure: There was a succession of meetings at Mozlando to help bootstrap people on using TaskCluster (TC). These were well-attended, at least by people in my org, many of whom have a vested interest in using TC in the ongoing release promotion work.

Speaking of release promotion, the involved parties met in Orlando to map out the remaining work that stands between us and releasing a promoted CI as a beta, even if just in parallel to an actual release. We hope to have all the build artifacts generated via release promotion by the end of 2016 — l10n repacks are the long pole here — with the remaining accessory service tasks like signing and updates coming online early in 2016.

Improve CI pipeline: Mozilla announced a change of strategy in Orlando with regards to FirefoxOS.

In theory, the switch from phones to connected devices should improve our CI throughput in the near-term, provided we are actually able to turn off any of the existing b2g build variants or related testing. This will depend on commitments we’ve already made to carriers, and what baseline b2g coverage Mozilla deems important.

During Mozlando, releng held a sprint to remove the jacuzzi code from our various repos. Jacuzzis were once an important way to prevent “bursty,” prolific jobs (namely l10n) from claiming all available capacity in our machine pools. With the recent move to AWS for Windows builds, this is really only an issue for our Mac build platform now, and even that *should* be fixed sometime soon if we’re able to repack Mac builds on Linux. In the interim, the added complexity of the jacuzzi code wasn’t deemed worth the extra maintenance hassle, so we ripped it out. You served your purpose, but good riddance.

Release: Sadly, we are never quite insulated from the ongoing needs of the release process during these all-hands events. Mozlando was no different. In fact, it’s become such a recurrent issue for release engineering, release management, and release QA that we’ve started discussing ways to be able to timeshift the release schedule either forward or backward in time. This would also help us deal with holidays like Thanksgiving and Christmas when many key players in the release process (and devs too) might normally be on vacation. No changes to announce yet, but stay tuned.

With the upcoming deprecation of SHA-1 support by Microsoft in 2016, we’ve been scrambling to make sure we have a support plan for Firefox users on older versions on Windows operating systems. We determined that we would need to offer multiple dot releases to our users: a first one to update the updater itself and the related maintenance service to recognize SHA-2, and then a second update where we begin signing Firefox itself with SHA-2. (https://bugzil.la/1079858)

Jordan was on the hook for the Firefox 43.0 final release that went out the door on Tuesday, December 15.

As with any final release, there is a related uplift cycle. These uplift cycles are also problematic, especially between the aurora and beta branches where there continues to be discrepancies between the nightly- and release-model build processes. The initial beta build (b1) for Firefox 44 was delayed for several days while we resolved a suite of issues around GTK3, general crashes, and FHR submissions on mobile. Much of this work also happened at Mozlando.

Operational: We continue the dance of disabling older r5 Mac minis running 10.10.2 to replace them with new, shiny r7 Mac minis running 10.10.5. As our r7 mini capacity increases, we also able/required to retire some of the *really* old r4 Mac minis running OS X 10.6, mostly because we need the room in the datacenter. The gating factor here has been making sure that tests works still work on the various release branches on the new r7 minis. Joel has been tackling this work, and this week was able to verify the tests on the mozilla-release branch. Only the esr38 branch is still running on the r5 minis. Big thanks to Kim and our stalwart buildduty contractors, Alin and Vlad, for slogging through the buildbot-configs with patches for this.

Speaking of our buildduty contractors, Alin and Vlad both received commit level 2 access to the Mozilla repos in Mozlando. This makes them much more autonomous, and is a result of many months of steady effort with patches and submissions. Good work, guys!

The Mozilla VR Team may soon want a Gecko branch for generating Windows builds with a dedicated update channel. The VR space at Mozilla is getting very exciting!

I can’t promise much content during the Christmas lull, but look for more releng updates in the new year.

December 18, 2015 11:46 PM

December 10, 2015

Nick Thomas (nthomas)

Updates for Nightly on Windows

You may have noticed that Windows has had no updates for Nightly for the last week or so. We’ve had a few issues with signing the binaries as part of moving from a SHA-1 certificate to SHA-2. This needs to be done because Windows won’t accept SHA-1 signed binaries from January 1 2016 (this is tracked in bug 1079858).

Updates are now re-enabled, and the update path looks like this

older builds  →  20151209095500  →  latest Nightly

Some people may have been seeing UAC prompts to run the updater, and there could be one more of those when updating to the 20151209095500 build (which is also the last SHA-1 signed build). Updates from that build should not cause any UAC prompts.

December 10, 2015 03:57 PM

December 03, 2015

Hal Wine (hwine)

Tuning Legacy vcs-sync for 2x profit!

Tuning Legacy vcs-sync for 2x profit!

One of the challenges of maintaining a legacy system is deciding how much effort should be invested in improvements. Since modern vcs-sync is “right around the corner”, I have been avoiding looking at improvements to legacy (which is still the production version for all build farm use cases).

While adding another gaia branch, I noticed that the conversion path for active branches was both highly variable and frustratingly long. It usually took 40 minutes for a commit to an active branch to trigger a build farm build. And worse, that time could easily be 60 minutes if the stars didn’t align properly. (Actually, that’s the conversion time for git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to generate the trigger.)

The full details are in bug 1226805, but a simple rearrangement of the jobs removed the 50% variability in the times and cut the average time by 50% as well. That’s a savings of 20-40 minutes per gaia push!

Moral: don’t take your eye off the legacy systems – there still can be some gold waiting to be found!

December 03, 2015 08:00 AM

December 01, 2015

Chris AtLee (catlee)

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

December 01, 2015 09:31 PM

November 27, 2015

Chris AtLee (catlee)

Firefox builds on the Taskcluster Index

RIP FTP?

You have have heard rumblings that FTP is going away...

61319299.jpg

Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???

65722041.jpg

This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the
gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:

gecko.v2.mozilla-central.nightly.revision.47b49b0d32360fab04b11ff9120970979c426911.firefox.win64-opt

Or the date:

gecko.v2.mozilla-central.nightly.2015.11.27.latest.firefox.win64-opt

The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

[1]A historical name referring back to the time when we used the FTP prototol to serve these files. Today, the files are available only via HTTP(S)
[2]in fact, all Firefox builds right now are currently uploaded to S3. we've just had to implement some compatibility layers to make S3 appear in many ways like the old FTP service.
[3]yes, you need to know the version number...for now. we're considering stripping that from the filenames. if you have thoughts on this, please get in touch!
[4]ignore the warning on the right about "Task not found" - that just means there are no tasks with that exact route; kind of like an empty directory

November 27, 2015 09:21 PM

November 26, 2015

Armen Zambrano G. (@armenzg)

Mozhginfo/Pushlog client released

Hi,
If you've ever spent time trying to query metadata from hg with regards to revisions, you can now use a Python library we've released to do so.

In bug 1203621 [1], our community contributor @MikeLing has helped us release the pushlog.py module we had written for Mozilla CI tools.

You can find the pushlog_client package in here [3] and you can find the code in here [4]

Thanks MikeLing!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1203621
[2] https://github.com/MikeLing
[3] https://pypi.python.org/pypi/pushlog_client
[4] https://hg.mozilla.org/hgcustom/version-control-tools/rev/6021c9031bc3


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 26, 2015 03:17 PM

November 24, 2015

Armen Zambrano G. (@armenzg)

Welcome F3real, xenny and MikeLing!

As described by jmaher, we started this week our first week of mozci's quarter of contribution.

I want to personally welcome Stefan, Vaibhav and Mike to mozci. We hope you get to learn and we thank you for helping Mozilla move forward in this corner of our automation systems.

I also want to give thanks to Alice for committing at mentoring. This could not be possible without her help.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:58 PM

Mozilla CI tools meet up

In order to help the contributors' of mozci's quarter of contribution, we have set up a Mozci meet up this Friday.

If you're interested on learning about Mozilla's CI, how to contribute or how to build your own scheduling with mozci come and join us!

9am ET -> other time zones
Vidyo room: https://v.mozilla.com/flex.html?roomdirect.html&key=GC1ftgyxhW2y


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:52 PM

Kim Moir (kmoir)

USENIX Release Engineering Summit 2015 recap

November 13th, I attended the USENIX Release Engineering Summit in Washington, DC.  This summit was along side the larger LISA conference at the same venue. Thanks to Dinah McNutt, Gareth Bowles, Chris Cooper,  Dan Tehranian and John O'Duinn for organizing.



I gave two talks at the summit.  One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.

Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/tim_norris/2600844073/sizes/o/

Scaling mobile testing on AWS: Emulators all the way down from Kim Moir

I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city.  We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l


Distributed Systems at Scale: Reducing the Fail from Kim Moir

In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems.  In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
I'd also like to thank all the members of Mozilla releng/ateam who reviewed my slides and provided feedback before I gave the presentations.
The attendees of the summit attended the same keynote as the LISA attendees.  Jez Humble, well known for his Continuous Delivery and Lean Enterprise books provided a keynote on Lean Configuration Management which I really enjoyed. (Older version of slides from another conference, are available here and here.)



In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release".  He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success.  For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".

Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing.  Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.


http://www.slideshare.net/jezhumble/architecting-for-continuous-delivery-54192503


At the release engineering summit itself,   Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company.  This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.


Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!


Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.


For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US. 

Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines.   Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster

There were many more great talks, hopefully more slides will be up soon!

November 24, 2015 03:57 PM

November 19, 2015

Chris Cooper (coop)

Clarification about our “Build and Release Intern - Toronto” position

We’ve had lots of interest already in our advertised internship position, and that’s great. However, many of the applications I’ve looked at won’t pan out because they overlooked a key line in the posting:

*Only local candidates will be considered for this role.*

That’s right, we’re only able to accept interns who are legally able to work in Canada.

The main reason behind this is that all of our potential mentors are in Toronto, and having an engaged, local mentor is one of the crucial determinants of a successful internship. In the past, it was possible for Mozilla to sponsor foreign students to come to Canada for internships, but recent changes to visa and international student programs has made the bureacratic process (and concomitant costs) a nightmare to manage. Many applicants simply aren’t eligible any more under the new rules either.

I’m not particularly happy about this, but it’s the reality of our intern hiring landscape. Some of our best former interns have come from abroad, and I’ve already seen some impressive resumes this year from international students. Hopefully one of the non-Toronto-based positions will still appeal to them.

November 19, 2015 07:17 PM

Armen Zambrano G. (@armenzg)

Buildapi client released - thanks F3real!

When we wrote Mozilla CI tools, we created a module called buildapi.py in order to schedule Buildbot jobs via Self-serve/Buildapi.

We recently ported it as a Python package and released it:
https://pypi.python.org/pypi/buildapi_client

This was all thanks to F3real, who joined us from Mozilla's community and released his first Python package. He has also brought forth the integration tests we wrote for it. Here's the issue and  PR if you're curious.

F3real will now be looking at removing the buildapi module from mozci and making use of the python package instead.

Thanks F3real!


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 19, 2015 04:31 PM

Chris AtLee (catlee)

MozFest 2015

I had the privilege of attending MozFest last week. Overall it was a really great experience. I met lots of really wonderful people, and learned about so many really interesting and inspiring projects.

My biggest takeaway from MozFest was how important it is to provide good APIs and data for your systems. You can't predict how somebody else will be able to make use of your data to create something new and wonderful. But if you're not making your data available in a convenient way, nobody can make use of it at all!

It was a really good reminder for me. We generate a lot of data in Release Engineering, but it's not always exposed in a way that's convenient for other people to make use of.

The rest of this post is a summary of various sessions I attended.

Friday

Friday night started with a Science Fair. Lots of really interesting stuff here. Some of the projects that stood out for me were:

  • naturebytes - a DIY wildlife camera based on the raspberry pi, with an added bonus of aiding conservation efforts.
  • histropedia - really cool visualizations of time lines, based on data in Wikipedia and Wikidata. This was the first time I'd heard of Wikidata, and the possibilities were very exciting to me! More on this later, as I attended a whole session on Wikidata.
  • Several projects related to the Internet-of-Things (IOT)

Saturday

On Saturday, the festival started with some keynotes. Surman spoke about how MozFest was a bit chaotic, but this was by design. In a similar way that the web is an open platform that you can use as a platform for building your own ideas, MozFest should be an open platform so you can meet, brainstorm, and work on your ideas. This means it can seem a bit disorganized, but that's a good thing :) You get what you want out of it.

I attended several good sessions on Saturday as well:

  • Ending online tracking. We discussed various methods currently used to track users, such as cookies and fingerprinting, and what can be done to combat these. I learned, or re-learned, about a few interesting Firefox extensions as a result:

    • privacybadger. Similar to Firefox's tracking protection, except it doesn't rely on a central blacklist. Instead, it tries to automatically identify third party domains that are setting cookies, etc. across multiple websites. Once identified, these third party domains are blocked.
    • https everywhere. Makes it easier to use HTTPS by default everywhere.
  • Intro to D3JS. d3js is a JS data visualization library. It's quite powerful, but something I learned is that you're expected to do quite a bit of work up-front to make sure it's showing you the things you want. It's not great as a data exploration library, where you're not sure exactly what the data means, and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.

  • 6 kitchen cases for IOT We discussed the proposed IOT design manifesto briefly, and then split up into small groups to try and design a product, using the principles outlined in the manifesto. Our group was tasked with designing some product that would help connect hospitals with amateur chefs in their local area, to provide meals for patients at the hospital. We ended up designing a "smart cutting board" with a built in display, that would show you your recipes as you prepared them, but also collect data on the frequency of your meal preparations, and what types of foods you were preparing.

    Going through the exercise of evaluating the product with each of the design principles was fun. You could be pretty evil going into this and try and collect all customer data :)

Sunday

  • How to fight an internet shutdown - we role played how we would react if the internet was suddenly shut down during some political protests. What kind of communications would be effective? What kind of preparation can you have done ahead of time for such an event?

    This session was run by Deji from accessnow. It was really eye opening to see how internet shutdowns happen fairly regularly around the world.

  • Data is beaufitul Introduction to wikidata Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody can edit and query the database. One of the really interesting features of Wikidata is that localization is kind of built-in as part of the design. Each item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas Adams. The description for each item is simply a table of locale -> localized description. There's no inherent bias towards English, or any other language. The beauty of this is that you can reference the same piece of data from multiple languages, only having to focus on localizing the various descriptions. You can imagine different translations of the same Wikipedia page right now being slightly inconsistent due to each one having to be updated separately. If they could instead reference the data in Wikidata, then there's only one place to update the data, and all the other places that reference that data would automatically benefit from it.

    The query language is quite powerful as well. A simple demonstration was "list all the works of art in the same room in the Louvre as the Mona Lisa."

    It really got me thinking about how powerful open data is. How can we in Release Engineering publish our data so others can build new, interesting and useful tools on top of it?

  • Local web Various options for purely local web / networks were discussed. There are some interesting mesh network options available commotion was demo'ed. These kind of distributions give you file exchange, messaging, email, etc. on a local network that's not necessarily connected to the internet.

November 19, 2015 01:35 PM

November 18, 2015

Chris Cooper (coop)

Welcome back, Mihai!

Mr. KotterThis is *not* Mihai.

I’ve been remiss in (re)introducing our latest hire in release engineering here at Mozilla.

Mihai Tabara is a two-time former intern who joins us again, now in a full-time capacity, after a stint as a release engineer at Hortonworks. He’s in Toronto this week with some other members of our team to sprint on various aspects of release promotion.

After a long hiring drought for releng, it’s great to be able to welcome someone new to the team, and even better to be able to welcome someone back. Welcome, Mihai!

November 18, 2015 10:44 PM

November 16, 2015

Nick Thomas (nthomas)

The latest on firefox/releases/latest

The primary way to download Firefox is at www.mozilla.org, but Mozilla’s Release Engineering team has also maintained directories like

https://ftp.mozilla.org/pub/firefox/releases/latest/

to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.

Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:

https://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US

Modifying the product, os, and/or lang allow other combinations. This is described in the README.txt files for beta, release, and esr, as well as the Thunderbird equivalents release and beta.

Please adapt your scripts to use download.mozilla.org links. We hope it will help you simplify at the same time, as scraping to determine the current version is no longer necessary.

PS. We’ve also removed some latest- directories which were old and crufty, eg firefox/releases/latest-3.6.

November 16, 2015 11:37 PM

November 13, 2015

Hal Wine (hwine)

Complexity & * Practices

Complexity & * Practices

I was fortunate enough to be able to attend Dev Ops Days Silicon Valley this year. One of the main talks was given by Jason Hand, and he made some great points. I wanted to highlight two of them in this post:

  1. Post Mortems are really learning events, so you should hold them when things go right, right? RIGHT!! (Seriously, why wouldn’t you want to spot your best ideas and repeat them?)
  2. Systems are hard – if you’re pushing the envelope, you’re teetering on the line between complexity and chaos. And we’re all pushing the envelope these days - either by getting fancy or getting lean.

Post Mortems as Learning Events

Our industry has talked a lot about “Blameless Post Mortems”, and techniques for holding them. Well, we can call them “blameless” all we want, but if we only hold them when things go wrong, folks will get the message loud and clear.

If they are truly blameless learning events, then you would also hold them when things go right. And go meh. Radical idea? Not really - why else would sports teams study game films when they win? (This point was also made in a great Ignite by Katie Rose: GridIronOps - go read her slides.)

My $0.02 is - this would also give us a chance to celebrate success. That is something we do not do enough, and we all know the dedication and hard work it takes to not have things go sideways.

And, by the way, terminology matters during the learning event. The person who is accountable for an operation is just that: capable of giving an account of the operation. Accountability is not responsibility.

Terminology and Systems – Setting the right expectations

Part way through Jason’s talk, he has this awesome slide about how system complexity relates to monitoring which relates to problem resolution. Go look at slide 19 - here’s some of what I find amazing in that slide:

  • It is not a straight line with a destination. Your most stable system can suddenly display inexplicable behavior due to any number of environmental reasons. And you’re back in the chaotic world with all that implies.
  • Systems can progress out of chaos, but that is an uphill battle. Knowing which stage a system is in (roughly) informs the approach to problem resolution.
  • Note the wording choices: “known” vs “unknowable” – for all but the “obvious” case, it will be confusing. That is a property of the system, not a matter of staff competency.

While not in his slide, Jason spoke to how each level really has different expectations. Or should have, but often the appropriate expectation is not set. Here’s how he related each level to industry terms.

Best Practices:

The only level with enough certainty to be able to expect the “best” is the known and familiar one. This is the “obvious” one, because we’ve all done exactly this before over a long enough time period to fully characterize the system, its boundaries, and abnormal behavior.

Here, cause and effect are tightly linked. Automation (in real time) is possible.

Good Practices:

Once we back away from such certainty, it is only realistic to have less certainty in our responses. With the increased uncertainty, the linkage of cause and effect is more tenuous.

Even if we have all the event history and logs in front of us, more analysis is needed before appropriate corrective action can be determined. Even with automation, there is a latency to the response.

Emergent Practices:

Okay, now we are pushing the envelope. The system is complex, and we are still learning. We may not have all the data at hand, and may need to poke the system to see what parts are stuck.

Cause and effect should be related, but how will not be visible until afterwards. There is much to learn.

Novel Practices:
For chaotic systems, everything is new. A lot is truly unknowable because that situation has never occurred before. Many parts of the system are effectively black boxes. Thus resolution will often be a process of trying something, waiting to see the results, and responding to the new conditions.
Next Steps

There is so much more in that diagram I want to explore. The connecting of problem resolution behavior to complexity level feels very powerful.

<hand_waving caffeine_level=”deprived”>

My experience tells me that many of these subjective terms are highly context sensitive, and in no way absolute. Problem resolution at 0300 local with a bad case of the flu just has a way of making “obvious” systems appear quite complex or even chaotic.

By observing the behavior of someone trying to resolve a problem, you may be able to get a sense of how that person views that system at that time. If that isn’t the consensus view, then there is a gap. And gaps can be bridged with training or documentation or experience.

</hand_waving>

November 13, 2015 08:00 AM

October 30, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - October 30, 2015

Much of Q4 is spent planning and budgeting for the next year, so there’s been lots of discussion about which efforts will need support next year and which things we can accommodate with existing resources and which will need additional resources.

And if planning and budgeting doesn’t scare some Hallowe'en spirit into you, I don’t know what will.

Modernize infrastructure: Q got most of the automated installation of w10 working and is now focusing on making sure that jobs run.

Improve CI pipeline: Andrew (with help from Dustin) will be running a bunch of test suites in TaskCluster (based on TaskCluster-built binaries) at Tier 2.

Release: Callek built v1.5 of the OpenH264 Plugin, and pushed it out to the testing audience. Expecting to go live to our users in the next few weeks.

Callek managed to get “final verify” (an update test with all live urls) working on taskcluster in support of the “release promotion” project.

Firefox 42, our big moment-in-time release for the second half of 2015, gets released to the general public next week. Fingers are crossed.

Operational: Kim and Amy decommissioned about 50% of the remaining panda infrastructure (physical mobile testing boards) after shifting the load to AWS.

We repurposed 30 of our Linux64 talos machines for Windows 7 testing in preparation for turning on some e10s tests.

Kim turned off WinXP tests off by default on try to try to reduce some of our windows backlog (https://bugzil.la/1219434).

Kim implemented some changes to SETA which would allow us to configure the SETA parameters on a per platform basis (https://bugzil.la/1175568).

Rail performed the mozilla-central train uplifts a week early when the Release Management plans shifted, turning Nightly into Gecko 45. FirefoxOS v2.5 branch based on Gecko 44 has been created as a part of the uplift.

Callek investigated a few hours of nothing being scheduled on try on Tuesday, to learn there was an issue with a unicode character in the commit message which broke JSON importing of pushlog. And then did work to reschedule all those jobs (https://bugzil.la/1218943).

Industry News: In addition to the work we do at Mozilla, a number of our people are leaders in industry and help organize, teach, and speak. These are some of the upcoming events people are involved with:

SUntil next week, remember: don’t take candy from strangers!

October 30, 2015 09:04 PM

October 23, 2015

Nick Thomas (nthomas)

Updates disabled for Android Nightly and Aurora

Due to a bug with the new ftp server we’ve had to disable updates for

They’ll resume just as soon as we can get the fix landed.

Update (Oct 25th): Updates are re-enabled, thanks to Mike Shal for the fix.

October 23, 2015 08:39 AM

October 21, 2015

Nick Thomas (nthomas)

Try Server – please use up-to-date code to avoid upload failures

Today we started serving an important set of directories on ftp.mozilla.org using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.

Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.

October 21, 2015 10:02 AM

October 15, 2015

Morgan Phillips (mrrrgn)

Better Secret Injection with TaskCluster Secrets

Many secret injection services simply store a private key and encrypt data for users. The users then add those encrypted payloads to a job, and the payload is decrypted using the private key associated with their account at run time, I see a few problems with this system:Today we deployed TaskCluster Secrets, a new service I've been working on for the last two weeks which stores encrypted json payloads on behalf of taskcluster clients. I'm excited about this service because it's going to form the foundation for a new method of secret injection which solves all of the problems listed above.
How does it work?

In TaskCluster Secrets, each submitted payload (encrypted at rest) is associated with an OAuth scope. The scope also defines which actions a user may make against the secret. For example, to write a secret named 'someNamespace:foo' you'd need an OAuth scope 'secrets:set:someNamespace:foo,' to read it you'd need 'secrets:get:someNamespace:foo,' and so on.

Tying each secret to a scope, we're able to generate an interesting work flow for access from within tasks. In short, we can generate and inject temporary credentials with read only access. This forces secrets to be accessed via the api and yields the following benefits:What's more, we can store the temporary OAuth credentials in an http proxy running alongside of a task instead of within it, so that even the credentials are not exposed by default. This way someone could have a snapshot of your task at startup and not gain access to any private data. \o/

Case Study: Setting/Getting a secret for TaskCluster GitHub jobs

1.) Submit a write request to the TaskCluster GitHub API : PUT taskcluster-github.com/v1/write-secret/org/myOrg/repo/myRepo/key/myKey {githubToken: xxxx, secret: {password: xxxx}, expires: '2015-12-31'}

2.) GitHub OAuth token is used to verify org/repo ownership. Once verified, we generate a temporary account and write the following secret on behalf of our repo owner : myOrg/myRepo:myKey {secret: {password: xxxx}, expires: '2015-12-31'}

3.) CI jobs are launched alongside HTTP Proxies which will attach an OAuth header to outgoing requests to taskcluster-secrets.com/v1/.... The attached token will have a scope: secrets:get:myOrg/myRepo:* which allows any secret set by the owner of the myOrg/myRepo repository to be accessed.

4.) Within a CI task, a secret may be retrieved by simple HTTP calls such as: curl $SECRETS_PROXY_URL/v1/secret/myOrg/myRepo:myKey

Easy, secure, and 100% logged.

October 15, 2015 08:40 PM

October 13, 2015

Armen Zambrano G. (@armenzg)

mozci 0.15.1: Buildbot Bridge support + bug fixes


It's been a while since our last announced released and I would like to highlight some of the significant changes since then.
A major highlight for the latest release is the support to schedule Buildbot jobs through TaskCluster making use of the Buildbot Bridge. For more information about this please read in here.

Contributors

@nikkisquared is our latest contributor who landed more formalized exceptions and error handling. Thanks!
As always, many thanks to @adusca and @vaibhavmagarwal for their endless contributions.

How to update

Run "pip install -U mozci" to update

New Features

  • Query Treeherder for hidden jobs
  • BuildbotBridge support
    • It allows submitting a graph of Buildbot jobs with dependencies
    • The graph is submitted to TaskCluster

Minor changes

  • Fix backfilling issue
  • Password prompt improvements
  • More specific exceptions
  • Skip Windows 10 jobs for talos

All changes

You can see all changes in here:
0.14.0...0.15.1


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

October 13, 2015 07:55 PM

October 09, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - October 9, 2015

The beginning of October means autumn in the Northern hemisphere. Animals get ready for winter as the leaves change colour, and managers across Mozilla struggle with deliverables for Q4. Maybe we should just investigate that hibernating thing instead.

Modernize infrastructure: Releng, Taskcluster, and A-team sat down a few weeks ago to hash out an updated roadmap for the buildbot-to-taskcluster migration (https://docs.google.com/document/d/1CfiUQxhRMiV5aklFXkccLPzSS-siCtL00wBCCdS4-kM/edit). As you can see from the document, our nominal goal this quarter is to have 64-bit linux builds *and* tests running side-by-side with the buildbot equivalents, with a stretch goal to actually turn off the buildbot versions entirely. We’re still missing some big pieces to accomplish this, but Morgan and the Taskcluster team are tackling some key elements like hooks and coalescing schedulers over the coming weeks.

Aside from the Taskcluster, the most pressing releng concern is release promotion. Release promotion entails taking an existing set of builds that have already been created and passed QA and “promoting” them to be used as a release candidate. This represents a fundamental shift in how we deliver Firefox to end users, and as such is both very exciting and terrifying at the same time. Much of the team will be working on this in Q4 because it will greatly simplify a future transition of the release process to Taskcluster.

Improve CI pipeline: Vlad and Alin have 10.10.5 tests running on try and are working on greening up tests (https://bugzil.la/1203128)

Kim started discussion on dev.planning regarding reducing frequency of linux32 builds and tests https://groups.google.com/forum/#!topic/mozilla.dev.planning/wBgLRXCTlaw (Related bug: https://bugzil.la/1204920)

Windows tests take a long time, in case you hadn’t noticed. This is largely due to e10s (https://wiki.mozilla.org/Electrolysis) which has effectively doubled the number of tests we need to run per-push. We’ve been able to absorb this extra testing on other platforms, but Windows 7 and Windows 8 have been particularly hard hit by the increased demand, often taking more than 24 hours to work through backlog from the try server. While e10s is a product decision and ultimately in the best interest for Firefox, we realize the current situation is terrible in terms of turnaround time for developer changes. Releng will be investigating updating our hardware pool for Windows machines in the new year. In the interim, please be considerate with your try usage, i.e. don’t test on Windows unless you really need to. If you can help fix e10s bugs so to make that the default on beta/release ASAP, that would be awesome.

Release: The big “moment-in-time” release of Firefox 42 approaches. Rail is on the hook for releaseduty for this cycle, and is overseeing beta 5 builds currently.

Operational: Kim increased size of tst-emulator64 spot pool (https://bugzil.la/1204756) so we’ll be able to enable additional Android 4.3 tests on debug once we have when SETA data for them (https://bugzil.la/1201236)

Coop (me) spent last week in Romania getting to know our Softvision contractors in person. Everyone was very hospitable and took good care of me. Alin and Vlad took full advantage of the visit to get better insight into how the various releng systems are interconnected. Hopefully this will pay off with them being able to take on more challenging bugs to advance the state of buildduty. Already they’re starting to investigate how they could help contribute to the slave loan tool. Alin and Vlad will also be joining us for Mozlando in December, so look forward to more direct interaction with them there.

See you next week!

October 09, 2015 09:58 PM

October 02, 2015

Hal Wine (hwine)

duo MFA & viscosity no-cell setup

duo MFA & viscosity no-cell setup

The Duo application is nice if you have a supported mobile device, and it’s usable even when you you have no cell connection via TOTP. However, getting Viscosity to allow both choices took some work for me.

For various reasons, I don’t want to always use the Duo application, so would like for Viscosity to alway prompt for password. (I had already saved a password - a fresh install likely would not have that issue.) That took a bit of work, and some web searches.

  1. Disable any saved passwords for Viscosity. On a Mac, this means opening up “Keychain Access” application, searching for “Viscosity” and deleting any associated entries.

  2. Ask Viscosity to save the “user name” field (optional). I really don’t need this, as my setup uses a certificate to identify me. So it doesn’t matter what I type in the field. But, I like hints, so I told Viscosity to save just the user name field:

    defaults write com.viscosityvpn.Viscosity RememberUsername -bool true

With the above, you’ll be prompted every time. You have to put “something” in the user name field, so I chose to put “push or TOTP” to remind me of the valid values. You can put anything there, just do not check the “Remember details in my Keychain” toggle.

October 02, 2015 07:00 AM

September 25, 2015

Kim Moir (kmoir)

The mystery of high pending counts

In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android).  High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs.  The time developers have to wait for their test results is longer than ideal.


Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests.  However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches.  This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.


The work to determine the cause of high pending counts is always an interesting mystery.
Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem.  We have determined that since the end of August a number of new test jobs were enabled that increased the compute time per push on Windows by 13% or 2.5 hours per push.  Most of these new test jobs are for e10s
Increase in seconds that new jobs added to the total compute time per push.  (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)
The e10s initiative is an important initiative for Mozilla to make Firefox performance and security even better.  However, since new e10s and old tests will continue to run in parallel, we need to get creative on how to have acceptable wait times given the limitations of our current Windows tests pools.  (All of our Windows test run on bare metal in our datacentre, not on Amazon).
 
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives: 

To reduce Linux pending counts:
  • Added 200 new instances to the tst-emulator64 pool (run Android test jobs on Linux emulators) (bug 1204756)
  • In process of adding more Linux32 and Linux64 buildbot masters (bug 1205409) which will allow us to expand our capacity more

Ongoing work to reduce the Windows pending counts:


How can you help? 

Please be considerate when invoking try pushes and only select the platforms that you explicitly require to test.  Each try push for all platforms and all tests invokes over 800 jobs.

September 25, 2015 07:47 PM

September 22, 2015

Hal Wine (hwine)

Using Password Store

Using Password Store

Password Store (aka “pass”) is a very handy wrapper for dealing with pgp encrypted secrets. It greatly simplifies securely working with multiple secrets. This is still true even if you happen to keep your encrypted secrets in non-password-store managed repositories, although that setup isn’t covered in the docs. I’ll show my setup here. (See the Password Store page for usage: “pass show -c <spam>” & “pass search <eggs>” are among my favorites.)

Short version:
  1. Have gpg installed on your machine.

  2. Install Password Store on your machine. There are OS specific instructions. Be sure to enable tab completion for your shell!

  3. Setup a local password store. Scroll down in the usage section to “Setting it up” for instructions.

  4. Clone your secrets repositories to your normal location. Do not clone inside of ~/.password-store/.

  5. Set up symlinks inside of ~/.password-store/ to directories inside your clone of the secrets repository. I did:

    ln -s ~/path/to/secrets-git/passwords rePasswords
    ln -s ~/path/to/secrets-git/keys reKeys
  6. Enjoy command line search and retrieval of all your secrets. (Use the regular method for your separate secrets repository to add and update secrets.)

Rationale:

  • By using symlinks, pass will not allow me to create or update secrets in the other repositories. That prevents mistakes, as the process is different for each of those alternate stores.
  • I prefer to have just one tree of secrets to search, rather than the “multiple configuration” approach documented on the Password Store site.
  • By using symlinks, I can control the global namespace, and use names that make sense to me.
  • I’ve migrated from using KeePassX to using pass for my personal secret management. That is my “main” password-store setup (backed by a git repo).

Notes:

  • If you’d prefer a GUI, there’s qtpass which also works with the above setup.

September 22, 2015 07:00 AM

September 21, 2015

Armen Zambrano G. (@armenzg)

Minimal job scheduling

One of the biggest benefits of moving the scheduling into the tree is that you can adjust the decisions on what to schedule from within the tree.

As chmanchester and I were recently discussing, this is very important as we believe we can do much better on deciding what to schedule on try.

Currently, too many developers push to try with  -p all -u all (which schedules all platforms and all tests). It is understandable, it is the easiest way to reduce the risk of your change being backed out when it lands on one of the integration trees (e.g. mozilla-inbound).

In-tree scheduling analysis

What if your changes would get analysed and we would determine the best educated guess set of platforms and test jobs required to test your changes in order to not be backed out on an integration tree?

For instance, when I push Mozharness changes to mozilla-inbound, I wish I could tell the system that I only need these set of platforms and not those other ones.

If everyone had the minimum amount of jobs added to their pushes, our systems would be able to return results faster (less load) and no one would need to take short cuts.

This would be the best approximation and we would need to fine tune the logic over time to get things as right as possible. We would need to find the right balance of some changes being backed out because we did not get the right scheduling on try and getting results faster for everyone.

Prioritized tests

There is already some code that chmanchester landed where we can tell the infrastructure to run a small set of tests based on files changed. In this case we hijack one of the jobs (e.g. mochitest-1) to run the most relevant tests to your changes which would can normally be tested on different chunks. Once the prioritized tests are run, we can run the remaining tests as we would normally do. Prioritized tests also applies to suites that are not chunked (run a subset of tests instead of all).

There are some UI problems in here that we would need to figure out with Treeherder and Buildbot.

Tiered testing

Soon, we will have all technological pieces to create a multi tiered job scheduling system.

For instance, we could run things in this order (just a suggestion):

This has the advantage of using prioritized tests as a canary job which would prevent running the remaining tests if we fail the canary (shorter) job.

Post minimal run (automatic) precise scheduling (manual)

This is not specifically to scheduling the right thing automatically but to extending what gets scheduled automatically.
Imagine that you're not satisfied with what gets scheduled automatically and you would like to add more jobs (e.g. missing platforms or missing suites).
You will be able to add those missing jobs later directly from Treeherder by selecting which jobs are missing.
This will be possible once bug 1194830 lands.

NOTE: Mass scheduling (e.g. all mochitests across all platforms) would be a bit of a pain to do through Treeherder. We might want to do a second version of try-extender.



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 21, 2015 01:15 PM

September 18, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly Highlights - September 18, 2015

Pending job numbers continued to be a concern this week. Investigations are underway to look for slowdowns unrelated to the enabling of e10s tests, which on its own has double the number of test run in many cases. More information below.

Modernize infrastructure: Dustin participated in the TaskCluster work-week, discussing plans for TaskCluster itself and for Releng’s work to port the CI and release processes to run on the TaskCluster platform.

Morgan gave a fantastic presentation on air mozilla describing how github / TaskCluster integration works: https://air.mozilla.org/taskcluster-github-continuous-integration-for-mozillians-by-mozillians-2/

Improve CI pipeline: We’re ready to un-hide OS X and Linux64 builds via TaskCluster in TreeHerder, elevating them to “tier 2” status. This is a necessary precursor to replacing the buildbot-generated versions of these builds.

Jordan landed a patch to enable bundleclone for mock-based builds, which may help fix problems with the Android nightly builds. (https://bugzil.la/1191859)

Alin and Vlad are working on releng configs to add new 10.10 hardware to the test pool (https://bugzil.la/1203128)

Release: Ben continues to work out a plan to cope with SHA-1 certificate deprecation.(https://bugzilla.mozilla.org/show_bug.cgi?id=1079858#c64)

We are entering the end-game for Firefox 41. Release candidate builds are underway.

Operational: Kim and Vlad increased the size of the tst-emulator-64 pool by 200 instances which has significantly reduced the wait times for Android tests that use this instance type. (https://bugzil.la/1205409)

Kim is also in the process of bringing up four new buildbot masters to serve these expanding pools and reduce some of the buildbot lag we have seen in our monitoring tools (https://bugzil.la/1205409)

We have had high pending counts for the past few weeks which have significantly increased wait times, especially for Windows tests on Try. Joel Maher (from Developer Productivity team) and Kim analyzed the data for the end to end test times for Windows for the past month. They discovered that total compute time per push has increased by around 13% or 2.5 compute hours on Windows, primarily driven by the addition of new e10s tests. Given that our pool of Windows machines has a fixed size, we are looking at ways to reduce the wait times given existing hardware constraints.

See you again next week!

September 18, 2015 09:00 PM

Armen Zambrano G. (@armenzg)

Mozharness' support for Buildbot Bridge triggered test jobs

I have landed today [1] some code which allows Buildbot *test* jobs triggered through the Buildbot Bridge (BBB) to work properly.

In my previous post I explain a bit on how Mozci works with the Buildbot Bridge.
In this post I will only explain what we fixed on the Mozharness side.

Buildbot Changes

If a Buildbot test job is scheduled through TaskCluster (The Buildbot Bridge supports this), then the generated Buildbot Change associated to a test job does not have the installer and
test urls necessary for Mozharness to download for a test job.

What is a Buildbot Change? It is an object which represents the files touched associated to a code push. For the build jobs, this value gets set as part of the process of polling the Mercurial repositories, however, the test jobs are triggered via  a "buildbot sendchange" step part of the build job.
This sendchange creates the Buildbot Change for the test job which Mozharness can then use.
The BBB does not listen to sendchanges, hence, jobs triggered via the BBB have an empty changes object. Therefore, we can't download the files needed to run a test job and fail to execute.

In order to overcome this limtation, we have to detect when a Buildbot job is triggered normally or through the Buildbot Bridge.
Buildbot Bridge triggered jobs have a 'taskId' property defined (this represents the task associated to this Buildbot job). Through this 'taskId' we can determine the parent task and find a file called properties.json [2], which, it is uploaded by every BBB triggered job.
In such file, we can find both the installer and test urls.


[1] https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=03233057f1e6
[2] https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/E7lb9P-IQjmeCOvkNfdIuw/0/public/properties.json

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:26 PM

Mozilla CI tools: Upcoming support for Buildbot Bridge

What is the Buildbot Bridge?

The Buildbot Bridge (BBB) allows scheduling Buildbot jobs through TaskCluster.
In other words, you can have taskcluster tasks which represent Buildbot jobs.
This allows having TaskCluster graphs composed of tasks which will be executed either on Buildbot or TaskCluster, hence, allowing for *all* relationships between tasks to happen in TaskCluster.

Read my recent post on the benefits of scheduling every job via TaskCluster.

The next Mozilla CI tools (mozci) release will have support for BBB.

Brief explanation

You can see in this try push both types of Buildbot jobs [1].
One set of jobs were triggered through Buildbot's analysis of the try syntax in the commit message while two of the jobs should not have been scheduled.

Those two jobs were triggered off-band via Mozci submitting a task graph.
You can see the TaskCluster graph representing them in here [2].

These jobs were triggered using this simple data structure:
{
  'Linux x86-64 try build': [
    'Ubuntu VM 12.04 x64 try opt test mochitest-1'
  ]
}

Mozci turns this simple graph into a TaskCluster graph.
The graph is composed of tasks which follow this structure [3]

Notes about the Buildbot Bridge

bhearsum's original post, recording and slides:
http://hearsum.ca/blog/buildbot-taskcluster-bridge-an-overview.html
https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879
http://hearsum.ca/slides/buildbot-bridge-overview/#/

Some notes which Selena took about the topic:
http://www.chesnok.com/daily/2015/06/03/taskcluster-migration-about-the-buildbot-bridge/

The repository is in here:
https://github.com/mozilla/buildbot-bridge

---------
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9417c6151f2c
[2] https://tools.taskcluster.net/task-graph-inspector/#mAI0H1GyTJSo-YwpklZqng
[3] https://github.com/armenzg/mozilla_ci_tools/blob/mozci_bbb/mozci/sources/buildbot_bridge.py#L37


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:16 PM

September 17, 2015

Armen Zambrano G. (@armenzg)

Platform Operations lightning talks (Whistler 2015)

You can read and watch in here about the Platform Operations lighting talks:
https://wiki.mozilla.org/Platform_Operations/Ligthning_talks

Here the landing pages for the various Platform Operations teams:
https://wiki.mozilla.org/Platform_Operations

PS = I don't know what composes platform operations so feel free to add your team if I'm missing it.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 17, 2015 03:57 PM

September 11, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - September 11, 2015

Not sure how it was for you, but that was a deceptively busy week.

Modernize infrastructure: Amy created a new OS X 10.10.5 deployment procedure and installed the first 64 of our 200 new mac minis (for Firefox/Thunderbird testing). Further work needs to be done to validate the move to new hardware and upgrade to 10.10.5 and to rebase the timing tests.

Jonas rolled out support for remote signature validation and auth.taskcluster.net.

Jordan is working on adding some Android variant builds to TaskCluster (TC). As part of that process, he’s also documenting his efforts to create a HOWTO for devs so they can self-serve in TC in the future.

Ted hooked up cross-compiled Mac builds running in TC to try. This is the first step to moving Mac build load off of physical hardware. This is huge. (https://bugzil.la/1197154)

Improve CI pipeline: Our intern, Anthony, gave his end-of internship presentation on Thursday with details about the various improvements in made to TC over the summer. If you missed it, you can watch it over on air.mozilla.org: https://air.mozilla.org/anthony_miyaguchi/

Release: Firefox 41.0 beta 9 is in the pipe this week, along with Thunderbird 41.0 beta 1 (build #2).

Operational: Amy tracked down a bunch of configuration warnings on our puppet servers, filed bugs to get them fixed, and set up some notifications from our log hosts so that we learn about such known problems within 10 minutes.

Greg is rolling out a change to taskcluster-vcs to reduce parallelization for “repo”, and hopefully improve TaskCluster’s behavior relative to git.mo when 500s are thrown. So far, performance changes appear to be a wash, with some jobs taking slightly longer and others slowing.

Some faulty puppet changes this week caused tree closures on two separate days: the initial landing caused all POSIX systems to loop indefinitely in runner, and then that same change propagated into the new AMIs for spot instance the next day. Morgan has been working on a way to do tiered roll-outs of new AMIs using “canary” instances to avoid this kind of cascade puppet failure in the future: https://bugzil.la/1146341

See you next week!

September 11, 2015 11:21 PM

Morgan Phillips (mrrrgn)

TaskCluster GitHub Has Landed

TaskCluster based CI has landed for Mozilla developers. One can begin using the system today by simply dropping a .taskcluster.yml file into the base of their repository. For an example configuration file, and other documentation please see: http://docs.taskcluster.net/services/taskcluster-github/

To get started ASAP steal this config file and replace npm install . && npm test section with whatever commands will run your project's test suite. :)


September 11, 2015 08:55 PM

September 10, 2015

Armen Zambrano G. (@armenzg)

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 10, 2015 02:13 PM

September 08, 2015

Chris Cooper (coop)

URES’15 Call for Participation Extended

By request, the deadline for submissions for the third USENIX Release Engineering Summit (URES ‘15) has been extended. URES ‘15 will take place during LISA15, on November 13, 2015, in Washington, D.C.

If you would like to present a full-length or lightning talk on a release engineering topic, you can find more details on the submission process here: https://www.usenix.org/conference/ures15/call-for-participation

September 08, 2015 06:15 PM

September 04, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - September 04, 2015

catlee and coop returned from PTO, and lo, there was much rejoicing, mostly from Amy who held down the fort while the rest of her management team was off doing other stuff.

Modernize infrastructure: Thanks to Morgan, we can now trigger TaskCluster (TC) jobs based on GitHub pushes and pull requests. See the TC docs for more info: http://docs.taskcluster.net/services/taskcluster-github/

Dustin started a discussion thread about changing how we do Linux builds in TC: https://groups.google.com/forum/#!topic/mozilla.dev.builds/xmJCsSUDywE

Improve CI pipeline: Created a development environment for the Buildbot Taskcluster Bridge, which will make development of it go faster and safer. (https://bugzil.la/1199247)

Various improvements to the Buildbot Taskcluster Bridge

We made some upgrades to releng systems to allow them to take advantage of mercurial clones being served from the CDN. See gps’ blog post for more details: http://gregoryszorc.com/blog/2015/09/01/serving-mercurial-clones-from-a-cdn/

Release: We put the finishing touches on a couple of releases this week, namely Thunderbird 38.2.0 and Firefox 41.0b7. Jordan just stepped into the releaseduty role for the Firefox 41 cycle and is doing great work by all accounts.

Operational: The tree-closing window (TCW) came and eventually went over the weekend. A few things went sideways:

Thanks to everyone who helped wrestle our systems back under control: Hal, Rail, Jordan, and especially Nick who spent a substantial portion of his New Zealand weekend getting things working again.

See you again next week!

September 04, 2015 09:22 PM

RelEng & RelOps Weekly highlights - August 29, 2015

Happy weekend, friends of releng!

It’s been a very release and ops heavy week for lots of people on our teams. We were heads down building multiple Firefox releases and dealing with some issues related to load on the Mozilla VCS infrastructure and issues due to infrastructure configuration changes. We’re also working jointly with IT, as we speak, to perform a number of maintenance tasks during our regular tree closing window.

Modernize infrastructure: Dustin and Anthony are working on TTL support in tooltool.

Q brought up another hand-built windows 10 machine to double the available pool size on try. At the same time Callek continues to coordinate with developers in his efforts to green up Windows 10 testing.

Q updated Microsoft Deployment Tools (MDT) to 2013 update 1. XP booting is still being worked on but we took a big step forward in being able to support Windows 10.

Rob determined why our AWS Windows spot instances weren’t spinning up and patched a few bugs in our cloud-tool and slavealloc configurations. Now we need to work on making sure builds function properly. One step closer to having Windows builds in AWS.

Improve CI pipeline: Dustin has been working on running linux builds in a docker image running the latest version of CentOS.

Ben has been working on Buildbot Taskcluster Bridge improvements:

Rail has funsize reporting to production treeherder (https://bugzil.la/725726)

Rail got funsize to generate partial updates 30-50% faster after enabling local diffing cache.

Release: Jordan was point on handling multiple go to build requests and shipping important point releases (40.0.3, 38.2.1).

Operational: Dustin is working on a plan to migrate TreeStatus out of PHX1 and into relengapi in SCL3 as part of the PHX1 exit strategy.

Rail regenerated our git-shared tarball and increased the disk space on our linux64 AWS AMIs in effort to help with the load issues we’ve been experiencing recently on our internal git servers (https://bugzil.la/1199524).

See you again next week!

September 04, 2015 02:00 AM

September 03, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - August 29, 2015

Happy weekend, friends of releng!

It’s been a very release and ops heavy week for lots of people on our teams. We were heads down building multiple Firefox releases and dealing with some issues related to load on the Mozilla VCS infrastructure and issues due to infrastructure configuration changes. We’re also working jointly with IT, as we speak, to perform a number of maintenance tasks during our regular tree closing window.

Modernize infrastructure: Dustin and Anthony are working on TTL support in tooltool.

Q brought up another hand-built windows 10 machine to double the available pool size on try. At the same time Callek continues to coordinate with developers in his efforts to green up Windows 10 testing.

Q updated Microsoft Deployment Tools (MDT) to 2013 update 1. XP booting is still being worked on but we took a big step forward in being able to support Windows 10.

Rob determined why our AWS Windows spot instances weren’t spinning up and patched a few bugs in our cloud-tool and slavealloc configurations. Now we need to work on making sure builds function properly. One step closer to having Windows builds in AWS.

Improve CI pipeline: Dustin has been working on running linux builds in a docker image running the latest version of CentOS.

Ben has been working on Buildbot Taskcluster Bridge improvements:

Rail has funsize reporting to production treeherder (https://bugzil.la/725726)

Rail got funsize to generate partial updates 30-50% faster after enabling local diffing cache.

Release: Jordan was point on handling multiple go to build requests and shipping important point releases (40.0.3, 38.2.1).

Operational: Dustin is working on a plan to migrate TreeStatus out of PHX1 and into relengapi in SCL3 as part of the PHX1 exit strategy.

Rail regenerated our git-shared tarball and increased the disk space on our linux64 AWS AMIs in effort to help with the load issues we’ve been experiencing recently on our internal git servers (https://bugzil.la/1199524).

See you again next week!

September 03, 2015 10:54 PM

RelEng & RelOps Weekly highlights - August 23, 2015

Welcome to a double issue of our weekly highlights email, covering the last two action-packed weeks in RelEng and RelOps!

Modernize infrastructure:Rob debugged the spot deployment process and checked in changes to allow cloud tools to allocate windows spot instances: b-2008-spot (build) & y-2008-spot (try). Testing is ongoing to validate that these instances are capable of performing build work.

Greg and Jonas have been working closely with Treeherder folks and Ed Morley pushed a change that dramatically improves UX for Sheriffs looking at TC jobs.

Before: https://treeherder.allizom.org/logviewer.html#?job_id=10138081&repo=try

After: https://treeherder.allizom.org/logviewer.html#?job_id=10140320&repo=try

Load testing for separating auth into its own service for TaskCluster is complete. See Jonas for details.

Mike rolled out the new indexing to unify the routes between Taskcluster and Buildbot builds (https://bugzil.la/1133074)

Callek continues to make progress on getting windows 10 tests green (https://bugzil.la/1192842) on our infra. Special thanks to the developer support he has received thus far in getting the various issues addressed. At the same Q has been reworking our infra to support windows 10 in our deployment toolchains (which will enable us to bring up more windows 10 machines to begin to meet capacity needs).

Improve release pipeline: Ben and Rail made a breakthrough in how to test the new Release Promotion code, which will let us move faster and with more confidence on that project.

The Ship It/release runner development environment is now fully functional, which lets us easily do end to end testing of Release Promotion.

Improve CI pipeline: Callek disabled the jetpack tests running on try that have been broken since addon signing landed. (https://bugzil.la/1192318, https://bugzil.la/1192873)

Kim disabled Android 4.3 debug robocop tests on try were which were broken and hidden (https://bugzil.la/1194474)

Kim changed the instance type to c3.xlarge that Android 4.3 mochitests run on which will allow us to run media tests on emulators (https://bugzil.la/1195893)

Release: Released Firefox 40.0, 40.0.2, 41.0b1, 41.0b2, (we only built 40.0.1 not released it) for both Desktop and Android. We https://bugzil.la/1192842 also built Thunderbird 38.2.

Operational: Amy reimaged the remaining MacOSX 10.8 machines as 10.10 and Kim deployed them to our Yosemite pool which increased the size of this pool by 10% (https://bugzil.la/1173452). Kim also removed the remaining 10.8 configs from our code base (https://bugzil.la/1166311)

Rail was the bug (https://bugzil.la/1141416) that was preventing us from reimaging linux32 talos machines and Alin was able to reimage these machines and add them to our production pool.

Jordan performed the merge-day [uplift-day] config changes (https://bugzil.la/1178507)

Callek removed support for building esr31 now that it is EOL, allowing us to cleanup a bunch of now-obsolete complexity. (https://bugzil.la/1191349)

After discussions with the A-team and sheriffs, Kim increased the parameters for SETA so that tests that historically don’t identify regressions will only run on every 7th push or every 60 minutes on m-i and m-c. (https://bugzil.la/1195803). We hope to revisit this in a few weeks to increase it back to every 10th push and every 90 minutes. Thanks to Alice and Armen for writing mozci tools to allow the sheriffs to backfill jobs to avoid the backout issues on merges that SETA revealed when it was first implemented.

See you next week!

September 03, 2015 10:38 PM

August 07, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - August 07, 2015

Wow, what a week. Between tree closures and some unexpected security releases, release engineering was stretched pretty thin. Here’s hoping for a more “normal” week next week as we try to release Firefox 40.

Modernize infrastructure: Greg Arndt deployed a fix to mozilla-taskcluster to eliminate noisy ‘deadline-exceeded’ dependant tasks whose parent tasks fail. This improves sheriffing and is part of the work to make TaskCluster and TreeHerder the best of pals.

Jake Watkins implemented timezone and w32time configurations via puppet and compiled ntpdate for Windows (for stepping the clock on systems where w32time has been disabled). This allows us to control time on Windows machines without need of the AD domain.

Q and Callek have managed to get one Windows 10 host connected to the Try server with buildbot. The machine has successfully run a selection of tests so far, some of which are even passing! We still have more work to do in order to green up the other tests and need to create more machines for the pool before we can open the platform up for general use.

Dustin got a Linux build running in a CentOS 6.6 docker image within TaskCluster. There are lots of things to fix, but this will produce much more compatible builds than the earlier attempts with an Ubuntu 14.04 docker image.

Improve release pipeline: Nick increased the l10n and update test chunking, shaving multiple hours off of the release process. This helped immensely as we prepared the release builds for Firefox 40.

Improve CI pipeline: Morgan tweeted about her latest work on .taskclusterrc, bringing WebMaker CI over: https://twitter.com/mrrrgn/status/628653555214192640

Release: Urgent Firefox security releases shipped on multiple branches in under 24 hours! This all happened in the shadow of next week’s milestone release of Firefox 40, and the regular parade of beta builds. Kudos to those on release duty, specifically catlee and Nick, for getting them all out the door without tripping over each other.

Operational: John Ford and Jonas Jensen debugged a frustrating problem in the TC provisioner causing it to fail unexpectedly. Many yaks were shaved. The cause was an obscure bit of logic calling a buggy library in one of our dependencies. Since deploying the fix last week, the TC provisioner has been stable.

Greg blogged about TaskCluster and try: https://twitter.com/gregarndt/status/629063925338914816

Dustin deployed a new version of relengapi that includes support for database migrations (with Alembic), improvements to Archiver, and a new (but not-yet-used) implementation of treestatus.

The buildduty contractors continue to make strides as we knock down access hurdles for them. They are now able to handle slave loans with minimal intervention from releng or IT, and can now update python packages on our internal servers when requested by developers.

See you next week!

August 07, 2015 09:35 PM

August 03, 2015

Rail Alliev (rail)

Funsize enabled for Nightly builds

Keep calm and update Firefox

Note: this post has been sitting in the drafts queue for some reason. Better to publish it. :)

As of Tuesday, Aug 4, 2015 Funsize has been enabled on mozilla-central. From now on all Nightly builds builds will get updates for builds up to 4 days in the past (for yesterday, for the day before yesterday, etc). This should make people who don't run their nightlies every day happier.

Firefox Developer Edition partial updates will be enabled after 42.0 hits mozilla-aurora, but early adopters can use the aurora-funsize channel.

Partial updates as a part of builds and L10N repacks on nightly will be disabled as soon as Bug 1173459 is resolved.

As a bonus, you can take a look at the presentation I gave in Whistler during the work week.

Reporting Issues

If you see any issues, please report them to Bugzilla.

August 03, 2015 09:22 PM

July 31, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 31, 2015

Welcome back to the weekly releng Friday update! Here’s what we’ve been up to this week.

Modernize infrastructure: Rob checked in code to integrate Windows with our AWS cloud-tools software so that we now have userdata for deploying spot instances (https://bugzil/la/1166448) as well as creating the golden AMI for those instances.

Mark checked in code to update our puppet-managed Windows machines with a newer version of mecurial, working around some installation oddities (https://bugzil.li/1170588).

Now that Windows 10 has been officially released, Q can more easily tackle the GPOs that configure our test machines, verifying which don’t need changes, and which will need an overhaul (https://bugzil.la/1185844). Callek is working to get buildbot setup on Windows 10 so we can start figuring out which suites are failing and engage developers for help.

Improve CI pipeline: With the last security blockers resolved and a few weeks of testing under his belt, Rail is planning to enable Funsize on mozilla-central next Tuesday (https://bugzil.la/1173452)

Release: Uplift starts next week, followed by the official go-to-build for Firefox 40. Beta 9 is out now.

Operational: Buildduty contractors started this week! Alin (aselagea) and Vlad (vladC) from Softvision are helping releng with buildduty tasks. Kim and Coop are trying to get them up-to-speed as quickly as possible. They’re finding lots of undocumented assumptions built into our existing release engineering documentation.

Dustin has migrated our celery backend for relengapi to mysql since we were seeing reliability issues on the rabbit cluster we had been using (https://bugzil.la/1185507).

Our intern, Anthony Miyaguchi, added database upgrade/downgrade ability to relengapi via alembic, making future schema changes painless. (https://github.com/mozilla/build-relengapi/pull/300)

Amy has finished replacing the two DeployStudio servers with newer hardware, OS, and deployment software, and we are now performing local Timemachine backups of the their data (https://bugzil.la/1186197). Offsite backups will follow once Bacula releases a new version of their software that correctly supports TLS 1.2.

The new Windows machines we setup last week are now in production, increasing capacity by 10 machines each in the Windows XP, Windows 7, and Windows 8 test pools (https://bugzil.la/1151591).

See you next week!

July 31, 2015 07:33 PM

July 30, 2015

Hal Wine (hwine)

Decoding Hashed known_hosts Files

Decoding Hashed known_hosts Files

tl;dr: You might find this gist handy if you enable HashKnownHosts

Modern ssh comes with the option to obfuscate the hosts it can connect to, by enabling the HashKnownHosts option. Modern server installs have that as a default. This is a good thing.

The obfuscation occurs by hashing the first field of the known_hosts file - this field contains the hostname,port and IP address used to connect to a host. Presumably, there is a private ssh key on the host used to make the connection, so this process makes it harder for an attacker to utilize those private keys if the server is ever compromised.

Super! Nifty! Now how do I audit those files? Some services have multiple IP addresses that serve a host, so some updates and changes are legitimate. But which ones? It’s a one way hash, so you can’t decode.

Well, if you had an unhashed copy of the file, you could match host keys and determine the host name & IP. [1] You might just have such a file on your laptop (at least I don’t hash keys locally). [2] (Or build a special file by connecting to the hosts you expect with the options “-o HashKnownHosts=no -o UserKnownHostsFile=/path/to/new_master”.)

I through together a quick python script to do the matching, and it’s at this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.

Bonus Tip: https://github.com/defunkt/gist

Is a very nice way to manage gists from the command line.

Footnotes

[1]A lie - you’ll only get the host name and IP’s that you have connected to while building your reference known_hosts file.
[2]I use other measures to keep my local private keys unusable.

July 30, 2015 07:00 AM

July 27, 2015

Chris Cooper (coop)

The changing face of buildduty, Summer 2015 edition

Previously

Buildduty is the Mozilla release engineering (releng) equivalent of front-line support. It’s made up of a multitude of small tasks, none of which on their own are particulary complex or demanding, but taken in aggregate can amount to a lot of work.

It’s also non-deterministic. One of the most important buildduty tasks is acting as information brokers during tree closures and outages, making sure sheriffs, developers, and IT staff have the information they need. When outages happen, they supercede all other work. You may have planned to get through the backlog of buildduty tasks today, but congratulations, now you’re dealing with a network outage instead.

Releng has struggled to find a sustainable model for staffing buildduty. The struggle has been two-fold: finding engineers to do the work, and finding a duration for a buildduty rotation that doesn’t keep the engineer out of their regular workflow for too long.

I’m a firm believer that engineers *need* to be exposed to the consequences of the software they write and the systems they design:

I also believe that it’s a valuable skill to be able to design a system and document it sufficiently so that it can be handed off to someone else to maintain.

Starting this week, we’re trying something new. We’re shifting at least part of the burden to management: I am now managing a pair of contractors who will be responsible for buildduty for the rest of 2015.

Alin and Vlad are our new contractors, and are both based in Romania. Their offset from Mozilla Standard Time (aka PST) will allow them to tackle the asynchronous activities of buildduty, namely slave loans, non-urgent developer requests, and maintaining the health of the machine pools.

It will take them a few weeks to find their feet since they are unfamiliar with any of the systems. You can find them on IRC in the usual places (#releng and #buildduty). Their IRC nicks are aselagea and vladC. Hopefully they will both be comfortable enough to append |buildduty to those nicks soon. :)

While Alin and Vlad get up to speed, buildduty continues as usual in #releng. If you have an issue that needs buildduty assistance, please ask in #releng, and someone from releng will assist you as quickly as possible. For less urgent requests, please file a bug.

July 27, 2015 10:07 PM

July 24, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 24, 2015

Welcome back. When we last left our heroes, they were battling the combined forces technical debt and a lack of self-service options. We join the fight already in progress…

Kapow! width=100%

Modernize infrastructure: To pave the way for creating continuous integration (CI) automation for Windows 10, Q is auditing all of our Windows 8 GPOs to determine which will work as-is on WIndows 10, which will no longer be needed, and which will require rewriting to work on the new platform (https://bugzil.la/1185844).

Dustin has completed the taskcluster scope handling audit and has reporting his findings back to the team and filed bugs for remediation.

Rail has deployed a change that allows us to specify docker images by their sha256 in TaskCluster, reducing the risk of MITM attacks. This was one of the hard security blockers for Funsize (https://bugzil.la/1175561).

After much discussion, we’ve chosen to move forward with installing hg as an EXE on Windows for the time being. Mark is implementing this method so we can continue progress towards moving Windows 2008 builds into AWS (https://bugzil.la/1170588).

Morgan has a prototype of some github/TaskCluster integration: if your project lives in github/mozilla, you can drop a .taskclusterrc file in the base of your repository and the jobs will just start running after each pull request (http://linuxpoetry.com/blog/23/)

Dustin is migrating treestatus to relengapi, removing one of the blockers to existing the PHX1 datacenter and centralizing another of our many web apps (https://bugzil.la/1181153).

Amy is working at replacing the servers we use to image mac builders and testers. This will allow us to perform backups of critical information and will prepare us for new OS X 10.10 hardware that’s in the purchasing pipeline now (https://bugzil.la/1186197).

Kim disabled Android 4.0 test jobs by default on Try as another step toward to disabling Pandas as a test platform as we move Android 4.3 test jobs to emulators on AWS (https://bugzil.la/1184117)

Today is the last day of Anhad’s internship. :( His end-of-internship presentation is now available on Air Mozilla: http://mzl.la/1JlpT0P This week he met with Anthony to hand-off his work on the getting Windows builds working with the generic worker in TaskCluster. (https://bugzil.la/1180775)

Improve release pipeline: Ben worked with OpSec to generate a new GPG signing key (replacing our expired one) and deploy it to our Nightly and Release signing servers. We are also working to improve the monitoring around signing key expiry to avoid future fire drills.

Improve CI pipeline: Jordan has re-deployed the change that switches over all future CI and release jobs to using a Gecko-based copy of Mozharness (http://jordan-lund.ghost.io/mozharness-goes-live-in-the-tree/).

Ben has been working towards stopping automated builds of XULRunner for the Firefox 42.0 cycle, starting September 22nd, 2015 (https://groups.google.com/forum/#!topic/mozilla.dev.planning/mmNWxHOt_lw)

Release: Firefox 40 is currently in beta. We’re up to b7 now.

Operational: Amy and Coop have worked with DCops to re-balance the Linux/Windows test pools, removing 30 machines from the Linux talos pools increasing capacity by 10 machines each in the Windows XP, Windows 7, and Windows 8 test pools (https://bugzil.la/1151591).

We giveth: This week we enabled the B2G 2.2r branch. (https://bugzil.la/1177598)

…and we taketh away: We also disabled many obsolete B2G builds/branches to improve throughput and reclaim capacity.

vcs-sync is now running in AWS! Hal made the official switch this week after running both setups in parallel for a while. This allows us to retire some ancient hardware in the datacenter.

Callek touched over 55 bugs this week as buildduty, many of them during triage and resolution of machine loans. (http://tinyurl.com/nhhpjyr)

Will our heroes emerge victorious? Tune in next week!

July 24, 2015 04:57 PM

July 23, 2015

Morgan Phillips (mrrrgn)

git push origin taskcluster

If you've been around Mozilla during the past two years, you've probably heard talk about TaskCluster - the fancy task execution engine that a few awesome folks built for B2G - which will be used to power our entire CI/Build infrastructure soonish.

Up until now, there have been a few ways for developers to schedule custom CI jobs in TaskCluster; but nothing completely general. However, today I'd like to give sneak peak at a project I've been working on to make the process of running jobs on our infra extremely simple: TaskCluster GitHub.

Why Should You Care?

1.) The service watches an entire organization at once: if your project lives in github/mozilla, drop a .taskclusterrc file in the base of your repository and the jobs will just start running after each pull request - dead simple.

2.) TaskCluster will give you more control over the platform/environment: you can choose your own docker container by default, but thanks to the generic worker we'll also be able to run your jobs on Windows XP-10 and OSX.

3.) Expect integration with other Mozilla services: For a mozilla developer, using this service over travis or circle should make sense, since it will continue to evolve and integrate with our infrastructure over time.

It's Not Ready Yet: Why Did You Post This? :(

Because today the prototype is working, and I'm very excited! I also feel that there's no harm in spreading the word about what's coming.

When this goes into production I'll do something more significant than a blog post, to let developers know they can start using the system. In the meantime here it is handling a replay of this pull request. \o/ note: The finished version will do nice things, like automatically leave a comment with a link to the job and its status..

July 23, 2015 04:26 AM

July 22, 2015

Armen Zambrano G. (@armenzg)

Few mozci releases + reduced memory usage

While I was away adusca released few releases of mozci.

From the latest release I want to highlight that we're replacing the standard json library for ijson since it solves some memory leak issues we were facing for pulse_actions (bug 1186232).

This was important to fix since our Heroku instance for pulse_actions has an upper limit of 1GB of RAM.

Here are the release notes and the highlights of them:




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 22, 2015 02:21 PM

July 21, 2015

Jordan Lund (jlund)

Mozharness now lives in Gecko

What's changed?

continuous-integration and release jobs that use Mozharness will now get Mozharness from the Gecko repo that the job is running against.

How?

Whether the job is a build (requires a full gecko checkout) or a test (only requires a Firefox/Fennec/Thunderbird/B2G binary), automation will first grab a copy of Mozharness from the gecko tree, even before checking out the rest of the tree. Effectively minimizing changes to our current infra.

This is thanks to a new relengapi endpoint, Archiver, and hg.mozilla.org's sub directory archiving abilities. Essentially Archiver will get a tar ball of Mozharness from within a target gecko repo, rev, and sub-repo-directory and upload it to Amazon's S3.

What's nice about Achiver is that it is not restricted to just grabbing Mozharness. You could, for example, put https://hg.mozilla.org/build-tools in the Gecko tree or, improving on our tests.zip model, simply grab subdirectories from within the testing/* part of the tree and request them on a suite by suite basis.

What does this mean for you?

it depends. if you are...

1) developing on Mozharness

You will need to checkout gecko and patches will now land like any other gecko patch: 1) land on a development tree-branch (e.g. mozilla-inbound) 2) ride the trains. This now means:

This also means:

2) just needing to deploy Mozharness or get a copy of it without gecko

Like the usage docs linked to Archiver above, you could hit the API directly. But I recommend using the client that buildbot uses. The client will wait until the api call is complete, download the archive from a response location, and unpack it to a specified destination.

Let's take a look at that in action: say you want to download and unpack a copy of mozharness based on mozilla-beta at 93c0c5e4ec30 to some destination.

python archiver_client.py mozharness --repo releases/mozilla-beta --rev 93c0c5e4ec30 --destination /home/jlund/downloads/mozharness  

Note: if that was the first time Archiver was polled for that repo + rev, it might take a few seconds as it has to download Mozharness from hgmo and then upload it to S3. Subsequent calls will happen near instantly

Note 2: if your --destination path already exists with a copy of Mozharness or something else, the client won't rm that path, it will merge (just like unpacking a tarball behaves)

3) a Release Engineering service that is still using hg.mozilla.org/build/mozharness

Not all Mozharness scripts are used for continuous integration / release jobs. There are a number of Releng services that are based on Mozharness: e.g. Bumper, vcs-sync, and merge_day. As these services transition to using Archiver, they will continue to use hgmo/build/mozharness as the Repository of Record (RoR).

If certain services that can not use gecko based Mozharness, then we can fork Mozharness and setup a separate repo. That will of course mean such services won't receive upstream changes from the gecko copy so we should avoid this if possible.

If you are an owner or major contributor to any of these releng services, we should meet and talk about such a transition. Archiver and its client should make deployments pretty painless in most cases.

Have something that may benefit from Archiver?

If you want to move something into a larger repository or be able to pull something out of such a repository for lightweight deployments, feel free to chat to me about Archiver and Relengapi.

As always, please leave your questions, comments, and concerns below

July 21, 2015 10:28 PM

July 20, 2015

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 17, 2015

Welcome to the weekly releng Friday update, only this time on a Monday!

I’ve done away with the gory details section. It was basically a thin filter for bugzilla search results, and we all spend enough time in bugzilla already.

tl;dr

TaskCluster: Funsize is generating partial updates for nightly/aurora builds now! We’re generating partial updates for up to 4 days in the past: link to TreeHerder results, which are hidden by default.

You can set your update channel to ‘nightly-funsize’ to test.

This quarter, we’re working on a scopes and authentication/credentials audit of TaskCluster to make sure it’s secure enough to move build/testing load from buildbot to TaskCluster. Hal is leading this effort with the OpSec team.

Our interns are also hard at work on migrations to TaskCluster. Anhad finished his work migrating spidermonkey builds and tests (https://bugzil.la/1164656), while Anthony is working on uploading symbols via a separate task (https://bugzil.la/1168979).

Modernize infrastructure: Runner is now enabled on all our Windows build machines. One of the biggest benefits of this is that runner is performing most clobber/purge work before buildbot starts and so build jobs don’t need to waste so much time clobbering build directories or freeing up space. https://bugzil.la/1055794

We’re starting to investigate what the requirements are to stand up Windows 10 CI infrastructure. We’re attacking both the build integration side and the OS installation and configuration side simultaneously.

We’ve finished collecting performance data for Windows in AWS and have chosen the c3.2xlarge platform as our base for future 2008 instances.

New proposal for TaskCluster routes for buildbot/TaskCluster uploads: Mike is looking for feedback about how we organize builds in the TaskCluster index. These routes will make it possible to find builds via various parameters like platform, revision, or build date. https://bugzil.la/1133074

Mozharness in-tree: The mozharness archiver was deployed but encountered problems with celery task proliferation. Jordan wrote some code to better track and expire the celery tasks, and deployed it late last week. We hope to resume the in-tree migration this week. https://bugzil.la/1182532

Improve release pipeline: Ben has been working on killing XULRunner builds and replacing them with the Firefox SDK we’re already producing. This will really simplify our release pipeline, and clean up our codebase as well. https://bugzil.la/672509

Improve CI pipeline: Ted got 64-bit OS X cross-compiling in one of the existing docker containers! He still needs to figure out universal builds, but this is a big step forward. https://bugzil.la/921040

Release: Firefox 40 is currently in beta. We’re up to b5 now.

Operational: A bad commit landed on upstream master for “repo” caused trees to be closed for many hours last Wednesday. We eventually got back in business by stripping commits on the master. There are bugs on file now to improve how we handle these repos in automation going forward to avoid precisely this kind of problem.

I took particular solace in this bug because somewhere, someone decided that naming a git repo “repo” was a good idea. Releng is not the only group that is terrible at naming things. https://bugzil.la/1184422

We’ve fixed some bugs in and bundled Metric Collective, our OS-level metrics collection software on Windows, into an exe for use with our puppet-managed Windows servers.

We’ve gotten a nuget repo set up on our configuration management servers and work is starting to make that the default package manager for puppet-managed Windows hosts.

There was a big, disruptive, tree-closing window (TCW) over the weekend, and everything went smoothly from our perspective.

See you next week!

July 20, 2015 04:31 PM

July 17, 2015

Kim Moir (kmoir)

Learning Data Science and evidence based teaching methods

This spring, I took several online courses on the topic of data science.  I became interested in expanding my skills in this area because as release engineers, we deal with a lot of data.  I wanted to learn new tools to extract useful information from the distributed systems behemoth we manage.


This xckd reminded me of the challenges of managing our buildfarm somedays :-)
From http://xkcd.com/1546/

I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes).  There are quizzes and assignments that are due each week.  Each course below was about four weeks long.

The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R.  Most of the quizzes and examples used publicly available data for the programming exercises.  I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments.  The lectures didn't provide a lot of the material needed to complete the assignments.  Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R.  Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing  R programs.  The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.

There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning.   My next coursera  class is Introduction to Systems Engineering which I'll start in a couple of weeks.  I've really become interested in learning more about this subject after reading Thinking in Systems.

The other course I took this spring was the Software Carpentry Instructor training course.   The Software Carpentry Foundation teachers researchers basic software skills.  For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others.  These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively.  The instructor course was excellent, thanks Greg Wilson for your work teaching us.

We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material.  How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more.  Really interesting research.

For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises.  We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix.  Then we are ready to sign up to teach a Software Carpentry course!

In conclusion, data science is a great skill to have if you are managing large distributed systems.  Also, using evidence based teaching methods to help others learn is the way to go!

Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy

July 17, 2015 03:41 PM