Welcome back. As the podiatrist said, lots of exciting stuff is afoot.
The big news from the past few weeks comes from the TaskCluster migration project where we now have nightly updates being served for both Linux and Android builds on the Date project branch. If you’re following along in treeherder, this is the equivalent of “tier 2” status. We’re currently working on polish bugs and a whole bunch of verification work before we attempt to elevate these new nightly builds to tier 1 status on the mozilla-central branch, effectively supplanting the buildbot-generated variants. We hope to achieve that goal before the end of 2017. Even tier 2 is a huge milestone here, so cheers to everyone on the team who has helped make this happen, chiefly Aki, Callek, Kim, Jordan, and Mihai.
Firefox 50 has been released. We’re are currently in the beta cycle for Firefox 51, which will be extra long to avoid trying to push out a major version release during the busy holiday season. We are still on-deck to release a minor security release during this period. Everyone involved in the process applauds this decision.
OK, I’ve given up the charade that these are weekly now. Welcome back.
Thanks to an amazing effort from Ed Morley and the rest of the Treeherder team, Treeherder has been migrated to Heroku, giving us significantly more flexible infrastructure.
Git-internal is no longer a standalone single point of failure (SPOF)! A warm standby host is running, and repository mirroring is in place. We now also have a fully matching staging environment for testing.
Improve Release Pipeline:
Aki and Catlee attended the security offsite and came away with todo items and a list to prioritize to improve release security.
Aki released scriptworker 0.8.0; this gives us signed chain of trust artifacts from scriptworkers, and gpg key management for chain of trust verification.
There is still some remaining setup to be done, mostly around updates and moving artifacts into the proper locations (beetmover). Releng will then begin internal testing of these new nightlies (essentially dogfooding) to ensure that important things like updates are working correctly before we uplift this code to mozilla-central.
We hope to make that switch for Linux/Android nightlies within the next month, with Mac and Windows coming later this quarter.
During a recent tree closing window (TCW), the database team managed to successfully switch the buildbot database from MyISAM to InnoDB format for improved stability. This is something we’ve wanted to do for many years and it’s good to see it finally done.
We’re currently on beta 10 for Firefox 50. This is noteworthy because in the next release cycle, Firefox 52 will be uplifted to the Aurora (developer edition) and Firefox 52 will be the last version of Firefox to support Windows XP, Windows Vista, and Universal binaries on Mac. Firefox 52 is due for release in March of 2017. Don’t worry though, all these platforms will be moving to the Firefox 52 ESR branch where they will continue to receive security updates for another year beyond that.
However, in practice, a serial (or monotonically increasing) key can be
handy to have around. I was reminded of this during a recent situation
where we (app developers & ops) needed to be highly confident that a
replica was consistent before performing a failover. (None of us had
access to the back end to see what the DB thought the replication lag
I've had the opportunity to attend the Beyond the Code conference for the past two years. This year, the venue moved to a location in Toronto, the last two events had been held in Ottawa. The conference is organized by Shopify who again managed to have a really great speaker line up this year on a variety of interesting topics. It was a two track conference so I'll summarize some of the talks I attended.
The conference started off with Anna Lambert of Shopify welcoming everyone to the conference.
The first speaker was Atlee Clark, Director of App and Developer relations at Shopify who discussed the wheel of diversity.
The wheel of diversity is a way of mapping the characteristics that you're born with (age, gender, gender expression, race or ethnicity, national origin, mental/physical ability), along with those that you acquire through life (appearance, education, political belief, religion, income, language and communication skills, work experience, family, organizational role). When you look at your team, you can map how diverse it is by colour. (Of course, some of these characteristics are personal and might not be shared with others). You can see how diverse the team is by mapping different characteristics with different colours. If you map your team and it's mostly the same colour, then you probably will not bring different perspectives together when you work because you all have similar backgrounds and life experiences. This is especially important when developing products.
This wheel also applies to hiring too. You want to have different perspectives when you're interviewing someone. Atlee mentioned when she was hiring for a new role, she mapped out the characteristics of the people who would be conducting the hiring interviews and found there was a lot of yellow.
So she switched up the team that would be conducting the interviews to include people with more diverse perspectives.
She finished by stating that this is just a tool, keep it simple, and practice makes it better.
The next talk was by Erica Joy, who is a build and release engineer at Slack, as well as a diversity advocate. I have to admit, when I saw she was going to speak at Beyond the Code, I immediately pulled out my credit card and purchased a conference ticket. She is one of my tech heroes. Not only did she build the build and release pipeline at Slack from the ground up, she is an amazing writer and advocate for change in the tech industry. I highly recommend reading everything she has written on Medium, her chapter in Lean Out and all her discussions on twitter. So fantastic.
Her talk at the conference was "Building a Diverse Corporate Culture: Diversity and Inclusion in Tech". She talked about how literally thousands of companies say they value inclusion and diversity. However, few talk about what they are willing to give up to order to achieve it. Are you willing to give up your window seat with a great view? Something else so that others can be paid fairly? She mentioned that change is never free. People need both mentorship and sponsorship in in order to progress in their career.
I really liked her discussion around hiring and referrals. She stated that when you're hire people you already know you're probably excluding equally or better qualified that you don't know. By default, women of colour are underpaid.
Pay gap for white woman, African American women and Hispanic women compared to a white man in the United States.
Some companies have referral system to give larger referral bonuses to people who are underrepresented in tech, she gave the example of Intel which has this in place. This is a way to incentivize your referral system so you don't just hire all your white friends.
The average white American has 91 white friends and one black friend so it's not very likely that they will refer non-white people. Not sure what the numbers are like in Canada but I'd guess that they are quite similar.
In addition, don't ask people to work for free, to speak at conferences or do diversity and inclusion work. Her words were "We can't pay rent with exposure".
Spend time talking to diversity and inclusion experts. There are people that have spent their entire lives conducting research in this area and you can learn from their expertise. Meritocracy is a myth, we are just lucky to be in the right place in the right time. She mentioned that her colleague Duretti Hirpa at Slack points out the need for accomplices, not allies. People that will actually speak up for others. So people feeling pain or facing a difficult work environment don't have to do all the work of fighting for change.
In most companies, there aren't escalation paths for human issues either. If a person is making sexist or racist remarks, shouldn't that be a firing offense?
If people were really working hard on diversity and inclusion, we would see more women and people of colour on boards and in leadership positions. But we don't.
She closed with a quote from Beyonce:
"If everything was perfect, you would never learn and you would never grow"
The next talk I attended was by Coraline Ada Ehmke, who is an application engineer at Github. Her talk was about the "Broken Promise of Open Source". Open source has the core principals of the free exchange of ideas, success through collaboration, shared ownership and meritocracy.
However, meritocracy is a myth. Currently, only 6% of Github users are women. The environment can be toxic, which drives a lot of people away. She mentioned that we don't have numbers for diversity in open source other than women, but Github plans to do a survey soon to try to acquire more data.
Gabriel Fayant from Assembly of Seven Generation's talk was entitled "Walking in Both Worlds, traditional ways of being and the world of technology". I found this quite interesting, she talked about traditional ceremonies and how they promote the idea of living in the moment, and thus looking at your phone during a drum ceremony isn't living the full experience. A question from the audience from someone who worked in the engineering faculty at the University of Toronto was how we can work with indigenous communities to share our knowledge of the technology and make youth both producers of tech, not just consumers.
If everything was perfect, you would never learn and you would never grow. Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
f everything was perfect, you would never learn and you would never grow. Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
The next talk was by Sandi Metz, entitled "Madame Santi tells your future". This was a totally fascinating look at the history of printing text from scrolls all the way to computers.
She gave the same talk at another conference earlier so you watch it here. It described the progression of printing technology from 7000 years ago until today. Each new technology disrupted the previous one, and it was difficult for those who worked on the previous technology to make the jump to work on the new one.
So according to Sandi, what is your future?
What you are working on now probably won't be relevant in 10 years
You will all die
All the people you love will die
Your body will start to fail you
Life is short
Tell people that you love them
Guard your health
Spend time with your kids
Get some exercise (she loves to bike)
We are bigger than tech
Community and schools need help
She gave the example of Habitat for Humanity where she volunteers
These organizations also need help to write code, they might not have the knowledge or time to do it right
The last talk I attended was by Sabrina Geremia of Google Canada. She talked about the factors that encourage a girl to consider computer science (encouragement, career perception, self-perception and academic exposure.)
I found that this talk was interesting but it focused a bit too much on the pipeline argument - that the major problem is that girls are not enrolling in CS courses. If you look at all the problems with environment, culture, lack of pay equity and opportunities for promotion due to bias, maybe choosing a career where there is more diversity is a better choice. For instance, law, accounting and medicine have much better numbers for these issues, despite there still being an imbalance.
At the end of the day, there was a panel to discuss diversity issues:
Moderator: Ariti Sharma, Shopify, Panelists: Mohammed Asaduallah, Format, Katie Krepps, Capital One Canada, Lateesha Thomas, Dev Bootcamp, Ramya Raghavan, Google, Kara Melton, TWG, Gladstone Grant, Microsoft Canada
Some of my notes from the panel
Be intentional about seeking out talent
Fix culture to be more diverse
Recruit from bootcamps. Better diversity today. Don't wait for universities to change the ratios.
Environment impacts retention
Conduct and engagement survey to see if underrepresented groups feel that their voices are being heard.
There is a need for sponsorship, not just mentoring. Define a role that doesn't exist at the company. A sponsor can make that role happen by advocating for it at higher levels
Mentors do better if matched with demographics. They will realize the challenges that you will face in the industry better than a white man who has never directly experienced sexism or racism.
Sponsors tend to be men due to the demographics of our industry
At Microsoft, when you reach a certain level your are expected to mentor an unrepresented person
Look at compensation and representation across diverse groups
Attrition is normal, it varies by region, especially acute in San Francisco.
Women leave companies at 2x the rate of men due to culture
You shouldn't stay at a place if you are burnt out, take care of yourself.
Compared to the previous two iterations of this conference, it seemed that this time it focused a lot more on solutions to have more diversity and inclusion in your company. The previous two conferences I attended seemed to focus more on technical talks by diverse speakers.
As a side note, there were a lot of Shopify folks in attendance because they ran the conference. They sent a bus of people from their head office in Ottawa to attend it. I was really struck at how diverse some of the teams were. I met group of women who described themselves as a team of "five badass women developers" 💯 As someone who has been the only woman on her team for most of her career, this was beautiful to see and gave me hope for the future of our industry. I've visited the Ottawa Shopify office several times (Mr. Releng works there) and I know that the representation of of their office doesn't match the demographics of the Beyond the Code attendees which tended to be more women and people of colour. But still, it is refreshing to see a company making a real effort to make their culture inclusive. I've read that it is easier to make your culture inclusive from the start, rather than trying to make difficult culture changes years later when your teams are all homogeneous. So kudos to them for setting an example for other companies.
Thank you Shopify for organizing this conference, I learned a lot and I look forward to the next one!
Amy and Alin decommissioned all but 20 of our OS X 10.6 test machines, and those last few will go away when we perform the next ESR release. The next ESR release corresponds to Firefox 52, and is scheduled for March next year.
Improve Release Pipeline:
Ben finally completed his work on Scheduled Changes in Balrog. With it, we can pre-schedule changes to Rules, which will help minimize the potential for human error when we ship, and make it unnecessary for RelEng to be around just to hit a button.
Lots of other good Balrog work has happened recently too, which is detailed in Ben’s blog post.
Improve CI Pipeline:
Windows TaskCluster builders were split into level-1 (try) and level-3 (m-i, m-c, etc) worker types with sccache buckets secured by level.
Windows 10 AMI generators were added to automation in preparation for Windows 10 testing on TaskCluster. We’ve been looking to switch from testing on Windows 8 to Windows 10, as Windows 8 usage continues to decline. The move to TaskCluster seems like a natural breakpoint to make that switch.
Dustin massive patch set to enable in-tree config of the various build kinds landed last week. This was no small feat. Kudos to him for the testing and review stamina it took to get that done. Those of us working to migrate nightly builds to TaskCluster are now updating – and simplifying – our task graphs to leverage his work.
After work to fix some bugs and make it reliable, Mark re-enabled the cron job that generates our Windows 7 AWS AMIs each night.
Now that many of the Windows 7 tests are being run in AWS, Amy and Q reallocated 20 machines from Windows 7 testing to XP testing to help with the load. We are reallocating 111 additional machines from Windows 7 to XP and Windows 8 in the upcoming week.
Amy created a template for postmortems and created a folder where all of Platform Operations can consolidate their postmortem documents.
Jake and Kendall took swift action on the TCP ChallengeAck side attack vulnerability. This has been fixed with a sysctl workaround on all linux hosts and instances.
Jake pushed a new version of the mig-agent client which was deployed across all Linux and OS X platforms.
Hal implemented the new GitHub feature to require two-factor authentication for several Mozilla organizations on GitHub.
Rail has automated re-generating our SHA-1 signed Windows installers for Firefox, which are served to users on old versions of Windows (XP, Vista). This means that users on those platforms will no longer need to update through a SHA-1 signed, watershed release (we were using Firefox 43 for this) before updating to the most recent version. This will save XP/Vista users some time and bandwidth by creating a one-step update process for them to get the latest Firefox.
PyBay held their first local Python conference this last weekend
(Friday, August 19 through Sunday, August 21). What a great event! I
just wanted to get down some first impressions - I hope to do more after
the slides and videos are up.
Two of our interns finished up their terms recently. The Toronto office feels a little emptier without them, or at least that’s how I imagine it since I don’t actually work in Toronto.
Trunk-based Fennec debug builds+tests have been disabled in buildbot and moved to tier 1 in Taskcluster
Improve Release Pipeline:
Mihai finished migrating release sanity to Release Promotion. Release sanity helps to catch release issues that are susceptible to human process error. It has been live since 48.0b9, and has already helped catch a few issues that could have delayed the releases!
Improve CI Pipeline:
Rob created an automated build using taskcluster-github to create new taskcluster windows amis whenever the workertype manifest (or opencloudconfig) is updated.
Nearly everything that was indexed on MXR has been indexed on DXR! A few stragglers remain, but are very low-use trees. Additionally, mxr.mozilla.org now brings up an interstital page which offers a link to DXR instead of the hardhat.
We’ve released Firefox 48 which brings many long-awaited features to our users.
Our amazing interns Francis Kang and Connor Sheehan have finished their tenure with us. They did grace us with their intern presentations before they left, which you can replay on Air Mozilla:
Thanks to both Connor and Francis for their hard work, and thanks also to their mentors Rail and Kim for helping them become such valuable contributors.
We’ve filled the two release engineering positions we had advertised. It felt like we had an embarrassment of riches at some points during that hiring process. Thanks to everyone who took the time to apply.
Bug 1287604 - Experiment with different AWS instance types for TC linux64 builds
Some initial experiments have shown we can shave 20 minutes off an average linux64 build by using more powerful AWS instances, with a reasonable cost tradeoff. We’ll start the work of migrating to these new instances soon.
Bug 1272083 - Downloading and unzipping should be performed as data is received
Last night, I attended my first Ottawa Python Authors Meetup. It was the first time that I had attended despite wanting to attend for a long time. (Mr. Releng also works with Python and thus every time there's a meetup, we discuss who gets to go and who gets to stay home and take care of little Releng. It depends on if the talk to more relevant to our work interests.)
The venue was across the street from Confederation Park aka land of Pokemon.
I really enjoyed it. The people I chatted with were very friendly and welcoming. Of course, I ran into some people I used to work with, as is with any tech event in Ottawa it seems. Nice to catch up!
The venue had the Canada Council for the Arts as a tenant, thus the quintessentially Canadian art.
The speaker that night was Emily Daniels, developer from Halogen Software who spoke on Artificial Intelligence with Python. (Slides here, github repo here). She mentioned that she writes Java during the day but works on fun projects in Python at night. She started the talk by going through some examples of artificial intelligence on the web. Perhaps the most interesting one I found was a recurrent neural network called Benjamin which generates movie script ideas and was trained on existing sci-fi movies and movie scripts. Also, a short film called Sunspring was made of one of the generated scripts. The dialogue is kind of stilted but it is interesting concept.
After the examples, Emily then moved on to how it all works.
Deep learning is a type of machine learning that drives meaning out of data using a hierarchy of multiple layers that mimics the neural networks of our brain.
She then spoke about a project she wrote to create generative poetry from a RNN (recurrent neural network). It was based on a RNN tutorial that she heavily refactored to meet her needs. She went through the code that she developed to generate artificial prose from the works of H.G. Wells and Jane Austen. She talked about how she cleaned up the text to remove EOL delimiters, page breaks, chapters numbers and so on. And then it took a week to train it with the data.
She then talked about another example which used data from Jack Kerouac and Virginia Woolf novels, which she posts some of the results to twitter.
She also created a twitter account which posts generated text from her RNN that consumes the content of Walt Whitman and Emily Dickinson. (I should mention at this point that she chose these authors for her projects because copyrights have expired on these works and they are available on the Gutenberg project)
After the talk, she field a number of audience questions which were really insightful. There were discussions on the inherent bias in the data because it was written by humans that are sexist and racist. She mentioned that she doesn't post the results of the model automatically to twitter because some of them are really inappropriate since these novels since they learned from text that humans wrote who are inherently biased.
One thing I found really interesting is that Emily mentioned that she felt a need to ensure that the algorithms and data continue to exist, and that they were faithfully backed up. I began to think about all the Amazon instances that Mozilla releng had automatically killed that day as our capacity had peaked and declined. And of the great joy I feel ripping out code when we deprecate a platform. I personally feel no emotional attachment to bring down machines or deleting used code.
Perhaps the sense of a need for a caretaker for these recurrent neural networks and the data they create is related to the fact that the algorithms that output text that is a simulacrum for the work of an author that we enjoy reading. And perhaps that is why we maybe we aren't as attached to a ephemeral pool of build machines as we are are to our phones. Because the phone provides a sense human of connection to the larger world when we may be sitting alone.
Thank you Emily for the very interesting talk, to the Ottawa Python Authors Group for organizing the meetup, and Shopify for sponsoring the venue. Looking forward to the next one!
I received this very kind email in my inbox this morning.
"David Williams has expired your commit rights to the eclipse.platform.releng project. The reason for this change is:
We have all known this day would come, but it does not make it any easier. It has taken me four years to accept that Kim is no longer helping us with Eclipse. That is how large her impact was, both on myself and Eclipse as a whole. And that is just the beginning of why I am designating her as "Committer Emeritus". Without her, I humbly suggest that Eclipse would not have gone very far. Git shows her active from 2003 to 2012 -- longer than most! She is (still!) user number one on the build machine. (In Unix terms, that is UID 500). The original admin, when "Eclipse" was just the Eclipse Project.
She was not only dedicated to her job as a release engineer she was passionate about doing all she could to make other committer's jobs easier so they could focus on their code and specialties. She did (and still does) know that release engineering is a field of its own; a specialized profession (not something to "tack on" at the end) that just anyone can do) and good, committed release engineers are critical to the success of any project.
For anyone reading this that did not know Kim, it is not too late: you can follow her blog at
You will see that she is still passionate about release engineering and influential in her field.
And, besides all that, she was (I assume still is :) a well-rounded, nice person, that was easy to work with! (Well, except she likes running for exercise. :)
Thanks, Kim, for all that you gave to Eclipse and my personal thanks for all that you taught me over the years (and I mean before I even tried to fill your shoes in the Platform).
We all appreciate your enormous contribution to the success of Eclipse and happy to see your successes continuing.
To honor your contributions to the project, David Williams has nominated you for Committer Emeritus status."
Thank you David! I really appreciate your kind words. I learned so much working with everyone in the Eclipse community. I had the intention to contribute to Eclipse when I left IBM but really felt that I have given all I had to give. Few people have the chance to contribute to two fantastic open source communities during their career. I'm lucky to have that opportunity.
My IBM friends made this neat Eclipse poster when I left. The Mozilla dino displays my IRC handle.
Manual backfilling uses Buildapi to schedule jobs. If we switched to scheduling via TaskCluster/Buildbot-bridge we would get better results since we can guarantee proper scheduling of a build + associated dependent jobs. Buildapi does not give us this guarantee. This is mainly useful when backfilling PGO test and talos jobs.
If instead you're interested on contributing to mozci you can have a look at the issues.
The developer survey conducted by Engineering Productivity last fall indicated that debugging test failures that are reported by automation is a significant frustration for many developers. In fact, it was the biggest deficit identified by the survey. As a result,
the Engineering Productivity Team (aka A-Team) is working on improving the user experience for debugging test failures in our continuous integration and speeding up the turnaround for Try server jobs.
This quarter’s main focus is on:
Debugging tests on interactive workers (only Linux on TaskCluster)
Improve end to end times on Try (Thunder Try project)
For all bugs and priorities you can check out the project management page for it:
Which jobs have the longest total wall clock time (i.e. are the largest consumers of resources)
Putting Mozharness steps’ data inside Treeherder’s database for aggregate analysis
TaskCluster Linux builds are currently built using a mix of m3/r3/c3 2xlarge AWS instances, depending on pricing and availability. We’re going to be looking to assess the effects on build speeds of using more powerful AWS instances types, as one potential way of reducing e2e Try times.
Firefox has it's own built-in update system. The update system supports
2 types of updates: complete and incremental. Completes can be
applied to any older version, unless there are some incompatible changes
in the MAR format.
Incremental updates can be applied only to a release they were generated
Usually for the beta and release channels we generate incremental
updates against 3-4 versions. This way we try to minimize bandwidth
consumption for our end users and increase the number of users on the
latest version. For Nightly and Developer Edition builds we generate 5
incremental updates using funsize.
Both methods assume that we know ahead of time what versions should be
used for incremental updates. For releases and betas we use ADI stats to be as precise as
possible. However, these methods are static and don't use real-time
The idea to generate incremental updates on demand has been around for
ages. Some of the challenges are:
Acquiring real-time (or close to real-time) data for making decisions
on incremental update versions
Size of the incremental updates. If the size is very close to the size
of the corresponding complete, there is reason to serve incremental
updates. One of the reasons is that the that the updater tries to use
the incremental update first, and then falls back to the complete in
case if something goes wrong. In this case the updater downloads both
the incremental and the complete.
Ben and I talked about this today
and to recap some of the ideas we had, I'll put them here.
We still want to "pre-seed" most possible incremental updates before
we publish any updates
Whenever Balrog serves a
complete-only update, it should generate a structured log entry and/or
an event to be consumed by some service, which should contain all
information required to generate a incremental update.
The "new system" should be able to decide if we want to discard
incremental update generation, based on the size. These decisions
should be stored, so we don't try to generate incremental update again
next time. This information may be stored in Balrog to prevent further
Before publishing the incremental update, we should test if they can
be applied without issues, similar to the update verify tests we run
for releases, but without hitting Balrog. After they pass this test,
we can publish them to Balrog and check if Balrog returns expected XML
with partial info in it.
Minimize the amount of served completes, if we plan to generate
incremental updates. One of the ideas was to modify the client to
support responses like "Come in 5 minutes, I may have something for
The only remaining thing is to implement all these changes. :)
Windows try builds were enabled on TC Windows 2012 worker types in staging (allizom) (win32/win64, opt/debug). If all goes well, this will propagate to production in the coming days. This is the first set on non-Linux tasks we’ve had running reliably in TC, which is obviously a huge step in our migration away from buildbot.
tl;dr: We’ll be shutting down the Firefox mirrors on Bitbucket.
A long time ago we started an experiment to see if there was any
support for developing Mozilla products on social coding sites. Well,
the community-at-large has spoken, with the results many predicted:
This quarter has been a tough one for me. It has been a mix of organizing people, projects and implementing prototypes. It’s easy to forget what you have worked on, specially when plans and ideas change along the way.
Beware! This blog post may be a little unstructured, specially towards the end.
The quarter as a whole
Before the quarter commenced I was planning on adding TaskCluster jobs to Treeherder as my main objective. However, it quickly changed as GSoC submissions came around. We realized that we had two to three candidates interested in helping us. This turned into creating three potential projects. Two projects that came out of it are "refactor SETA and enable TaskCluster support" and "add unscheduled TaskCluster jobs to Treeherder." Both of these bring us closer to feature parity with Buildbot. Once this was settled, a lot of conversations happened with the TaskCluster team to make sure that Dustin's work and ours lined up well (he’s worked on refactoring how the ‘gecko decision task’ schedules tasks).
Around this time a project was completed, which made creating TaskCluster clients an excellent self-serve experience. This was key for me in reducing the number of times I had to interrupt the TaskCluster team to request that they adjust my clients.
Also around this time, another change was deployed that allowed developer’s credentials to assume almost all scopes required to schedule TaskCluster tasks without an intermediary tool with power scopes. This is very useful to create tools which allow developers to schedule tasks directly from the command line with their personal credentials. I created some prototypes to prove this concept. Here’s a script to schedule a Linux 64 task. Here’s the blog post explaining it.
During this quarter, Dustin refactored how scheduling of tasks was accomplished in the gecko decision task. For the project "adding new TaskCluster jobs," this was a risk as it could have made scheduling tasks either more complicated or not possible without significant changes. After many discussions, it seemed that we were fine to proceed as planned.
Out of these conversations a new idea was born: the "action tasks" idea. The beauty of "action tasks" is that they're atomic units of processing that can make complicated scheduling requests very easy. You can read martianwars’ blog post (under “What are action tasks?”) to learn more about it. Action tasks are defined in-tree to schedule task labels for a push. The project as originally defined had a very big scope (goal: make Treeherder find action task definition and integrate them in the UI) and some technical issues were encountered that made me concerned that more would be encountered (i.e., limited scopes granted to developers; this is not an issue anymore). My focus switched to making pulse_actions requests be visible on Treeherder. When switching deliverables I did not realize that we could have taken the first part of the project and just implemented that. In any case, a reduced scope is being implemented by martianwars since, after Dustin's refactoring, we need to put our graph through "optimizations" that determine which nodes need to be removed from the graph. This code lives in-tree and made "action tasks" to be the right solution since it uses in-tree logic where the optimizations code lives.
While working on my deliverables, I also discovered various things and created various utility projects.
TaskCluster developer experiments (prototypes):
In this repository I created various prototypes that make scheduling tasks from the command line extremely easy.
This project allows you to dump a Pulse queue into a file. It also allows you to "replay" the messages and process them as if feeding from a real queue. This was crucial to test code changes for pulse_actions.
Treeherder submitter is a Python package which makes it very easy to submit jobs to Treeherder. I made pulse_actions submit jobs to Treeherder with this package. I had to write this package since the Treeherder client allowed me to shoot my own foot. Various co-workers have written similar code, however the code found was not repackaged to be used by others (understandably). This package helps you to use the minimum amount of data necessary to submit jobs and helps you transition between job states (pending vs. running vs. completed).
Unfortunately, I have not had time to upstream this code due to the end of quarter being upon me, however I would like to upstream the code if the team is happy with it. On the other hand, Treeherder will soon be switching to a Pulse-based submission model for ingestion and the Python client might not be used anymore.
This package allows to upload files to an S3 bucket that has a 31 days expiration.
It takes advantage of a cool feature that the TaskCluster team provides.
The first step is that there is an API where you request temporary S3 credentials to the bucket associated to your TaskCluster credentials. You then can upload to your assigned prefix for that bucket. It is extremely easy!
Writing tests for a large project can be very time-consuming, specially if it calls for "integration tests"
Writing Mocks and patching functions for services like BuildApi, allthethings.json, TaskCluster and Treeherder can take a lot of work
Trying to test a Pulse multi-exchange-based consumer can be hard
This is probably because it is difficult to write integration tests
I developed "pulse replay" to help me here, however, I did not create automated tests to test each case scenario
Contributors and I don't like writing tests
I'm glad that when doing reviews I can ask for contributors to write tests; otherwise, I don't think that we would have what we have!
Writing tests is not easy, specially integration tests. It takes time to learn and to write them properly.
It also does not give you the satisfaction of thinking, "I built this feature."
The good thing about writing tests this quarter is that I finally learned how to write them.
I've also have another post in the works about how to increase test coverage
I also learned that code which was written by contributors and reviewed by me does not necessarily have the same quality as it would have if I fully focused on it myself. Not that they don't write superb code, however, due to my experience with the project I have more context. I've noticed this when I started writing tests, which gets me into the “ideal code” mode or a “big picture” mode. While writing tests I can also spot refactoring opportunities to make the code more maintainable and understandable. It is a very different kind of mindset that you enter when you're writing tests than when you're reviewing someone else’s code, even though, I tried to enter this "maintainable code mindset" while reviewing code for others.
I've improved my knowledge on writing tests
I really didn’t have much experience in this before this quarter
I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and
some pointers to good talks I was able to attend. The full schedule can be
found here and all the
videos are here.
Brandon Rhodes' Welcome to PyCon was one of the best introductions to a
conference I've ever seen. Unfortunately I can't find a link to a
What I liked about it was that he made
everyone feel very welcome to PyCon and to Portland. He explained some of
the simple (but important!) practical details like where to find the
conference rooms, how to take transit, etc. He noted that for the first
time, they have live transcriptions of the talks being done and put up on
screens beside the speaker slides for the hearing impaired.
He also emphasized the importance of keeping questions short during Q&A
after the regular sessions. "Please form your question in the form of a
question." I've been to way too many Q&A sessions where the person asking
the question took the opportunity to go off on a long, unrelated tangent. For
the most part, this advice was followed at PyCon: I didn't see very many
long winded questions or statements during Q&A sessions.
Ned Batchelder gave this great talk about using python's language features
to debug problematic code. He ran through several examples of tricky
problems that could come up, and how to use things like monkey patching
and the debug trace hook to find out where the problem is. One piece of
advice I liked was when he said that it doesn't matter how ugly the code
is, since it's only going to last 10 minutes. The point is the get the
information you need out of the system the easiest way possible, and then
you can undo your changes.
I found this interesting, but a little too theoretical. Object
capabilities are a completely orthogonal way to access control lists as a
way model security and permissions. It was hard for me to see how we could
apply this to the systems we're building.
A really cool intro to the Home Assistant
project, which integrates all kinds of IoT type things in your home. E.g.
Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler
systems. I'm definitely going to give this a try once I free up my
Lots of good information about how classes work in Python, including some
details about meta-classes. I think I understand meta-classes better after
having attended this session. I still don't get descriptors though!
(I hope Mike learns soon that __new__ is pronounced "dunder new" and not
"under under new"!)
I really enjoyed this talk. Cory Benfield describes the importance of
keeping a clean separation between your protocol parsing code, and your IO.
It not only makes things more testable, but makes code more reusable.
Nearly every HTTP library in the Python ecosystem needs to re-implement its
own HTTP parsing code, since all the existing code is tightly coupled to
the network IO calls.
Really good talk about failing gracefully. He covered some familiar topics
like adding timeouts and retries to things that can fail, but also
introduced to me the concept of circuit breakers. The idea with a circuit
breaker is to prevent talking to services you know are down. For example,
if you have failed to get a response from service X the past 5 times due
to timeouts or errors, then open the circuit breaker for a set amount of
time. Future calls to service X from your application will be intercepted,
and will fail early. This can avoid hammering a service while it's in an
error state, and works well in combination with timeouts and retries of course.
I was thinking quite a bit about Ben'sredo module during this talk. It's a
great module for handling retries!
I didn't end up going to this talk, but I did have a chance to chat with
Brian before. magic-wormhole is a tool to safely transfer
files from one computer to another. Think scp, but without needing ssh
keys set up already, or direct network flows. Very neat tool!
The infamous GIL is gone! And your Python programs only run 25x slower!
Larry describes why the GIL was introduced, what it does, and what's
involved with removing it. He's actually got a fork of Python with the GIL
removed, but performance suffers quite a bit when run without the GIL.
Our globetrotting releng heroes are off to London next week for the Mozilla all-hands meeting, but the weekly highlights are back after a brief Nigerian vacation. ;)
Joel, Q, and Catlee have moved another ⅓ of our Windows 7 test load into AWS, bringing us to about ⅔ of our total test load migrated. This further reduces our turnaround time and backlog for that platform.
Mark landed several Puppet patches to support package installs for Windows 7. Including WIndows 7 sdk, Apache, and others. This furthers our progress on having programmatically installed and managed Windows 7.
Kendall and Jake spent the week together making significant progress towards phase 1 of moving mozreview and autoland into the cloud as production-grade services. The new architecture has been planned, diagrammed, passed a preliminary security review, and much of the new deployment automation and management code has been written.
We’ve dropped support for architectures that don’t support SSE2. This allows us to switch to VS2015 for compilation on Windows, which brings with it a substantial improvement (>1 hour) in Windows PGO build times. See the thread in dev.platform for details.
Migrated to a new build or continuous integration system
Implemented a new release or deployment pipeline
Implemented tooling to simplify managing your apps in a mobile store
Significantly reduced build time with parallelization or some other interesting optimization!
Moved your build and test system to containers
Refactored your infrastructure code for a live production environment
... we'd love to see your submission to the workshop
We'd like to encourage people new to speaking to apply, as well as those from underrepresented groups in tech. We'd love to hear from some new voices and new companies !
Submissions are due July 1, 2016. If you have questions on of the submission process, topics to submit, or anything else, I'm happy to help! I'm kmoir and I work at mozilla.com or contact me on twitter. Submit early and often!
Last week I attended DevOpsDays Toronto. It was my first time attending a DevOpsDays event and it was quite interesting. It was held at CBC's Glenn Gould studios which is a quick walk from the Toronto Island airport where I landed after an hour flight from Ottawa. This blog post is an overview of some of the talks at the conference.
Glenn Gould Studios, CBC, Toronto.
Statue of Glenn Gould outside the CBC studios that bear his name.
The day started out with an introduction from the organizers and a brief overview of history of DevOps days. They also made a point about reminding everyone that they had agreed to the code of conduct when they bought their ticket. I found this explicit mention of the code of conduct quite refreshing.
The first talk of the day was John Willis, evangelist at Docker. He gave an overview of the state of enterprise devops. I found this a fresh perspective because I really don't know what happens in enterprises with respect to DevOps since I have been working in open source communities for so long. John providing an overview of what DevOps encompasses.
DevOps is a continuous feedback loop.
He talked a lot about how empathy is so important in our jobs. He mentions that at Netflix has a slide deck that describes company culture. He doesn't know if this is still the case, but it he had heard that if you hadn't read the company culture deck and show up for an interview at Netflix, you would be automatically disqualified for further interviews. Etsy and Spotify have similar open documents describing their culture.
He gave us some reading to do. I've read the "Release It!" book which is excellent and has some fascinating stories of software failure in it, I've added the other books to my already long reading list.
He stated that it's a long standing mantra that you can have two of either fast, cheap or good but recent research shows that today we can many changes quickly, and if there is a failure the mean time to recovery is short.
He left us with some more books to read.
The second talk was a really interesting talk by Hany Fahim, CEO of VM Farms. It was a short mystery novella describing how VM Farms servers suddenly experienced a huge traffic spike when the Brazilian government banned Whatsapp as a result of a legal order. I love a good war story.
Hany discussed one day VMfarms suddenly saw a huge increase in traffic.
This was a really important point. When your system is failing to scale, it's important to decide if it's a valid increase in traffic or malicious.
Looking on twitter, they found that a court case in Brazil had recently ruled that Whatsup would be blocked for 48 hours. Users started circumventing this block via VPN. Looking at their logs, they determined that most of the traffic was resolving to ip addresses from Brazil and that there was a large connection time during SSL handshakes.
In conclusion, making changes to use multi-core HAProxy fixed a lot of issues. Also, twitter was and continues to be a great source of information on activity that is happening in other countries. Whatsapp was returned to service and then banned a second time, and their servers were able to keep up with the demand.
After lunch, we were back to to more talks. The organizers came on stage for a while to discuss the afternoon's agenda. They also remarked that one individual had violated the code of conduct and had been removed from the conference. So, the conference had a code of conduct and steps were taken if it was violated.
Next up, Bridget Kromhout from Pivotal gave a talk entitled Containers will not Fix your Broken Culture. I first saw Bridget speak at Beyond the Code in Ottawa in 2014 about scaling the streaming services for Drama Fever on AWS. At the time, I was moving our mobile test infrastructure to AWS so I was quite enthralled with her talk because 1) it was excellent 2) I had never seen another woman give a talk about scaling services on AWS. Representation matters.
The summary of the talk last week was that no matter what tools you adopt, you need to communicate with each other about the cultural changes are required to implement new services. A new microservices architecture is great, but if these teams that are implementing these services are not talking to each other, the implementation will not succeed.
Bridget pointing out that the technology we choose to implement is often about what is fashionable.
Shoutout to Jennifer Davis' and Katherine Daniel's Effective DevOps book. (note - I've read it on Safari online and it is excellent. The chapter on hiring is especially good)
Loved this poster about the wall of confusion between development and operations.
In the afternoon, there were were lightning talks and then open spaces. Open spaces are free flowing discussions where the topic is voted upon ahead of time. I attended ones on infrastructure automation, CI/CD at scale and my personal favourite, horror stories. I do love hearing how distributed system can go down and how to recover. I found that the conversations were useful but it seemed like some of them were dominated by a few voices. I think it would be better if the person that suggested to topic for the open space also volunteered to moderate the discussion.
He started by giving some key platform characteristics. Stores on Shopify have flash sales that have traffic spikes so they need to be able to scale for these bursts of traffic.
From commit to deploy in 10 minutes. Everyone can deploy. This has two purposes: Make sure the developer stays involved in the deploy process. If it only takes 10 minutes, they can watch to make sure that their deploy succeeds. If it takes longer, they might move on to another task. Another advantage of this quick deploy process is that it can delight customers with the speed of deployment. They also deploy in small batches to ensure that the mean time to recover is small if the change needs to be rolled back.
BuildKite is a third party build and test orchestration service. They wrote a tool called Scrooge that monitors the number of EC2 nodes based on current demand to reduce their AWS bills. (Similar to what Mozilla releng does with cloud-tools)
Shopify uses a open source orchestration tool called ShipIt. I was sitting next to my colleague Armen at the conference and he started chuckling at this point because at Mozilla we also wrote an application called ship-it which release management uses to kick off Firefox releases. Shopify also has a overall view of the ship it deployment process which allows developers to see the percentages of nodes where their change has been deployed. One of the questions after the talk was why they use AWS for their deployment pipeline when they have use machines in data centres for their actual customers. Answer: They use AWS where resilency is not an issue.
Building containers is computationally expensive. He noted that a lot of engineering resources went into optimizing the layers in the Docker containers. To isolate changes to the smallest layer. They build service called Locutus to build the containers on commit, and push to a registry. It employs caching to make the builds smaller.
One key point that John also mentioned is that they had a team dedicated to optimizing their deployment pipeline. It is unreasonable to expect that developers working on the core Shopify platform to also optimize the pipeline.
It was an interesting perspective. I've seen quite a few talks about bringing devops culture and practices to the operations side of the house, but the perspective of teaching developers about it is discussed less often.
He emphasized the need to empower developers to use DevOp practices by giving them tools, and showing them how to use them. For instance, if they needed to run docker to test something, walk them through it so they will know how to do it next time.
The final talk I'll mention is by Will Weaver. He talks about how it is hard to show prospective clients how he had CI and tests experience when that experience is not open to the public. So he implemented tests and CI for his dotfiles on github.
He had excellent advice on how to work on projects outside of work to showcase skills for future employers.
Diversity and Inclusion
As an aside, whenever I'm at a conference I note the number of people in the "not a white guy" group. This conference had an all men organizing committee but not all white men. (I recognize the fact that not all diversity is visible i.e. mental health, gender identity, sexual orientation, immigration status etc) They was only one woman speaker, but there were a few non-white speakers. There were very few women attendees. I'm not sure what the process was to reach out to potential speakers other than the CFP.
There were slides that showed diverse developers which was refreshing.
Loved Roderick's ops vs dev slide.
I learned a lot at the conference and am thankful for all the time that the speakers took to prepare their talks. I enjoyed all the conversations I had learning about the challenges people face in the organizations implementing continuous integration and deployment. It also made me appreciate the culture of relentless automation, continuous integration and deployment that we have at Mozilla.
I don't know who said this during the conference but I really liked it
Shipping is the heartbeat of your company
It was interesting to learn how all these people are making their companies heart beat stronger via DevOps practices and tools.
FIRST, I MUST SOLICIT YOUR STRICTEST CONFIDENCE IN THIS MIGRATION. THIS IS BY VIRTUE OF ITS NATURE AS BEING UTTERLY CONFIDENTIAL AND ‘TOP SECRET’. I AM SURE AND HAVE CONFIDENCE OF YOUR ABILITY AND RELIABILITY TO HANDLE A TASK OF THIS GREAT MAGNITUDE INVOLVING A PENDING MIGRATION REQUIRING MAXIMUM CONFIDENCE.
WE ARE TOP OFFICIAL OF THE RELENG TEAM WHO ARE INTERESTED IN MIGRATION OF TASKS INTO TASKCLUSTER WITH JOBS WHICH ARE PRESENTLY TRAPPED IN BUILDBOT. IN ORDER TO COMMENCE THIS BUSINESS WE SOLICIT YOUR ASSISTANCE TO ENABLE US MIGRATE INTO YOUR TASKGRAPH THE SAID TRAPPED JOBS.
<marked as spam><deleted>
The ongoing work to get TaskCluster building Firefox on Windows reached an important milestone with our first Mozharness based build going green in the TreeHerder dashboard. This represents underlying effort in documenting the dependency chain for Windows builders and producing simple manifests that give greater transparency to changes in this area.
Alin, Amy, and Van brought 192 new OS X 10.10.5 Mac minis online. This should eliminate our Yosemite backlog and allow us to enable more e10s tests.
Q, catlee, and jmaher migrated the first batch of Windows 7 tests to AWS this week. Currently we’re running these suites for all branches of gecko 49 and higher: Web platform tests + reftests, gtest, cppunit, jittest, jsreftest, crashtest. This will reduce our reliance on hardware and allow us to scale dynamically. They are still working on greening larger sets of tests which are more sensitive to their operating environment. Once we have moved a significant portion of tests, we can add additional e10s tests on w7 as well.
Improve Release Pipeline:
In the interest of making progress on migrating Nightly builds to TaskCluster, we had a meeting last week to discuss the security requirements around our nightly release process. Based on the discussions in that meeting, Aki is now iterating on a “one graph” solution for Nightlies (as opposed to a two-graph approach where signing is separate). If this approach works, i.e. we can’t find major security holes in the proposed model, it will simplify our process greatly.
Improve CI Pipeline:
We have achieved our first deprecation milestone in the TaskCluster migration by turning off Linux 64-bit debug builds/tests for aurora/trunk branches in buildbot. These bugs are only generated in TaskCluster now.
We're delighted to have Francis Kang and Connor Sheehan join the Mozilla release engineering team as summer interns. Francis is studying at the University of Toronto while Connor attends McMaster University in Hamilton, Ontario. We'll have another intern (Anthony) join us later on in the summer who will be working from our San Francisco office.
They are both already off to a great start and have pull requests merged into production that fixed some release promotion issues. Their code was used in the Firefox 47.0 beta 5 release promotion that we ran last night so their first week was quite productive.
Mentoring an intern provides an opportunity to see the systems we run from a fresh perspective. They both have lots of great questions which makes us revisit why design decisions were made, could we do things better? Like all teaching roles, I always find that I learn a tremendous amount from the experience, and hope they have fun learning real world software engineering concepts with respect to running large distributed systems.
The interns are coming! The interns are coming! It’s too late, they’re already here!
Yes, intern season has begun. Releng welcomes Francis Kang and Connor Sheehan for the summer. They will be working on the long tail of release promotion tasks to start. Kim and Rail will be mentoring them. Our other (returning) intern, Anthony Miyaguchi, joins us next month.
Mozilla continues to discuss the future of XP support. Many more users would be affected than with OS X, but since the OS itself is no longer supported by Microsoft, there is only so much Mozilla can do to provide a secure browser on an inherently insecure platform. It’s also a huge burden on developers to make new features work (or provide an alternative) on an aging/ancient platform. Lots of factors to consider here to find balance.
Improve CI Pipeline:
Aki released version 0.1.0 of scriptworker. Scriptworker is an async python TaskCluster worker, designed for specific Release Engineering needs such as signing and interacting with our update servers (Balrog).
Vlad added five new Mac test masters to spread the load for existing machines as well as providing capacity for the new Mac test machines that will soon be installed in our data centre. We’ve had very high pending counts for Mac tests recently, so having more machines on which to run those tests, as well as more masters to defray that load, should help alleviate the chronic backlog. (https://bugzil.la/1264417)
Shipped Thunderbird 45.1b1, Fennec 47.0b2, Firefox 47.0b2, Firefox 47.0b3, Firefox 46.0.1, Fennec 46.0.1, and Firefox 45.1.1esr. Check out the release notes for more details:
Two weeks worth of awesome crammed into one blog post. Can you dig it?
Kendall and Greg have deployed new hg web nodes! They’re bigger, better, faster! The four new nodes have more processing power than the old ten nodes combined. In addition, all of the web and ssh nodes have been upgraded to CentOS 7, giving us a modern operating system and better security.
Relops and jmaher certified Windows 7 in the cloud for 40% of tests. We’re now prepping to move those tests. The rest should follow soon. From a capacity standpoint, moving any Windows testing volume into the cloud is huge.
Mark deployed new versions of hg and git to the Windows testing infrastructure.
Rob’s new mechanisms for building TaskCluster Windows workers give us transparency on what goes into a builder (single page manifests) and have now been used to successfully build Firefox under mozharness for TaskCluster with an up-to-date toolchain (mozilla-build 2.2, hg 3.7.3, python 2.7.11, vs2015 on win 2012) in ec2.
Improve Release Pipeline:
Firefox 46.0 Release Candidates (RCs) were all done with our new Release Promotion process. All that work in the beta cycle for 46.0 paid off.
Varun began work on improving Balrog’s backend to make multifile responses (such as GMP) easier to understand and configure. Historically it has been hard for releng to enlist much help from the community due to the access restrictions inherent in our systems. Kudos to Ben for finding suitable community projects in the Balrog space, and then more importantly, finding the time to mentor Varun and others through the work.
Improve CI Pipeline:
Aki’s async code has landed in taskcluster-client.py! Version 0.3.0 is now on pypi, allowing us to async all the python TaskCluster things.
With build promotion well sorted for the Firefox 46 release, releng is switching gears and jumping into the TaskCluster migration with both feet this month. Kim and Mihai will be working full-time on migration efforts, and many others within releng have smaller roles. There is still a lot of work to do just to migrate all existing Linux workloads into TaskCluster, and that will be our focus for the next 3 months.
Vlad and Amy landed patches to decommission the old b2g bumper service and its infrastructure.
Alin created a dedicated server to run buildduty tools. This is part of an ongoing effort to separate services and tools that had previously been piggybacking on other hosts.
Amy and Jake beefed up our AWS puppetmasters and tweaked some time out values to handle the additional load of switching to puppet aspects. This will ensure that our servers stay up to date and in sync.
What’s better than handing stuff off? Turning stuff off. Hal started disabling no-longer-needed vcs-sync jobs.
Shipped Firefox 46.0RC1 and RC2, Fennec 46.0b12, Firefox and Fennec 46.0, ESR 45.1.0 and 38.8.0, Firefox and Fennec 47.0beta1, and Thunderbird 45.0b1. The week before, we shipped Firefox and Fennec 45.0.2 and 46.0b10, Firefox 45.0.2esr and Thunderbird 45.0.
For further details, check out the release notes here:
Last year, the Platform Operations organization was born and it brought together multiple teams across Mozilla which empower development with tools and processes.
This year, we've decided to create a logo that identifies us an organization and builds our self-identify.
We've filed this issue for a logo design  and we would like to have a call for any community members to propose their designs. We would like to have all applications in by May 13th. Soon after that, we will figure out a way to narrow it down to one logo! (details to be determined).
We would also like to thank whoever made the logo which we pick at the end (details also to be determined).
Looking forward to collaborate with you and see what we create!
In my previous post I
introduced the new release process we have been adopting in the 46.0 release
Release build promotion has been in production since Firefox 46.0 Beta 1. We
have discovered some minor issues; some of them are already fixed, some still
One of the visible bugs is
Bug 1260892. We
generate a big
which should contain all important checksums. With numerous changes to the
process the file doesn't represent all required files anymore. Some files are
missing, some have different names.
We are working on fixing the bug, but you can use the following work around to
verify the files.
This follows the idea that mconley started with The Joy of Coding and his livehacks. At the moment there is only "Unscripted" videos of me hacking away. I hope one day to do live hacks but for now they're offline videos.
Mistakes I made in case any Platform Ops member wanting to contribute want to avoid:
Lower the music of the background music
Find a source of music without ads and with music that would not block certain countries from seeing it (e.g. Germany)
Do not record in .flv format since most video editing software do not handle it
Add an intro screen so you don't see me hiding OBS
Have multiple bugs to work on in case you get stuck in the first one
This is release candidate week, traditionally one of the busiest times for releng. Your patience is appreciated.
Improve Release Pipeline:
Varun began work on improving Balrog’s backend to make multifile responses (such as GMP) easier to understand and configure. Historically it has been hard for releng to enlist much help from the community due to the access restrictions inherent in our systems. Kudos to Ben for finding suitable community projects in the Balrog space, and then more importantly, finding the time to mentor Varun and others through the work.
Improve CI Pipeline:
With build promotion well underway for the upcoming Firefox 46 release, releng is switching gears and jumping into the TaskCluster migration with both feet. Kim and Mihai will be working full-time on migration efforts, and many others within releng have smaller roles. There is still a lot of work to do just to migrate all existing Linux workloads into TaskCluster, and that will be our focus for the next 3 months.
We started doing the uplifts for the Firefox 46 release cycle late last week. Release candidates builds should be starting soon. As mentioned above, this is the first non-beta release of Firefox to use the new build promotion process.
Last week, we shipped Firefox and Fennec 45.0.2 and 46.0b10, Firefox 45.0.2esr and Thunderbird 45.0. For further details, check out the release notes here:
As an attempt to attract candidates to GSoC I wanted to make sure that the possible projects were achievable rather than lead them on a path of pain and struggle. It also helps me picture the order on which it makes more sense to accomplish.
It was also a good exercise for students to have to read and ask questions about what was not clear and give lots to read about the project.
I want to share this and another project definition in case it is useful for others.
We want to rewrite SETA to be easy to deploy through Heroku and to support TaskCluster (our new continuous integration system) .
Please read carefully this document before starting to ask questions. There is high interest in this project and it is burdensome to have to re-explain it to every new prospective student.
Main mentor: armenzg (#ateam)
Co-mentor: jmaher (#ateam)
Please read jmaher’s blog post carefully  before reading anymore.
Now that you have read jmaher’s blog post, I will briefly go into some specifics.
SETA reduces the number of jobs that get scheduled on a developer’s push.
A job is every single letter you see on Treeherder. For every developer’s push there is a number of these jobs scheduled.
On every push, Buildbot  decides what to schedule depending on the data that it fetched from SETA .
The purpose of this project is two-fold:
Write SETA as an independent project that is:
automatically deployed through Heroku app
Support TaskCluster, our new CI (continuous integration system)
NOTE: The current code of SETA  lives within a repository called ouija.
Ouija does the following for SETA:
It has a cronjob which kicks in every 12 hours to scrape information about jobs from every push
It takes the information about jobs (which it grabs from Treeherder) into a database
SETA then goes a queries the database to determine which jobs should be scheduled. SETA chooses jobs that are good at reporting issues introduced by developers. SETA has its own set of tables and adds the data there for quick reference.
Involved pieces for this project:
Get familiar with deploying apps and using databases in Heroku
The main focus of this post is about what I've learning about writing Python tests, using mocks and patching functions properly. This is not an exhaustive post.
What I'm writing now is something I should have learned many years ago as a Python developer. It can be embarrassing to recognize it, however, I've thought of sharing this with you since I know it would have helped me earlier on my career and I hope it might help you as well.
Somebody has probably written about this topic and if you're aware of a good blog post covering this similar topic please let me know. I would like to see what else I've missed.
The way that Mozilla CI tools is designed it begs for integration tests, however, I don't think it is worth doing beyond unit testing + mocking. The reason is that mozci might not stick around once we have fully migrated from Buildbot which was the hard part to solve.
I want to call out the recent work being done by the build team to modernize the build system. As David reports in his firefox-dev post, the team has recently managed to realize a drastic reduction in Windows PGO build times. This reduction brings the build time in line with those for Linux PGO builds. Since Windows PGO builds are currently a long pole in both the CI and release process, this allows us to provide more timely feedback about build quality to developers and sheriffs. Pretty graphs are available.
Last week we shipped Firefox 46.0b9, and there are several other releases still in flight. See the weekly post-mortem notes for further details.
Hello from Release Engineering! Once a month we highlight one of our projects
to help the Mozilla community discover a useful tool or an interesting
contribution opportunity. This month's project is Release Build Promotion.
What is Release Build Promotion?
Release build promotion (or "build promotion", or "release promotion" for short),
is the latest release pipeline for Firefox being developed by Release
Engineering at Mozilla.
Release build promotion starts with the builds produced and tested by CI (e.g.
on mozilla-beta or
We take these builds, and use them as the basis to generate all our l10n repacks,
partial updates, etc. that are required to release Firefox. We "promote"
the CI builds to the release channel.
How is this different?
The previous release pipeline also started with builds produced and tested by
CI. However, when it came time to do a release, we would create an entirely new
set of builds with slightly different build configuration. These builds would
not get the regular CI testing.
Release build promotion improves the process by removing the second set of
builds. This drastically improves the total time to do a release, and also
increases our confidence in our products since we now are shipping exactly
what's been tested. We also improve visibility of the release process; all the
tasks that make up the release are now reported to Treeherder along with the
corresponding CI builds.
Release build promotion is in use for Firefox desktop starting with the 46 beta
cycle. ESR and release branches have not yet been switched over.
Firefox for Android is also not yet handled. We plan to have this ready for
One of the major reasons of this project was our release end-to-end times. I
pulled some data to compare:
One of the Firefox 45 betas took almost 12 hours
One of the Firefox 46 betas took less than 3 hours
Support Firefox for Android
Support release and ESR branches
Extend this process back to the aurora and nightly channels
Can I contribute?
Yes! We still have a lot of things to do and welcome everyone to contribute.
Bug 1253369 - Notifications on release promotion events.
(No bug yet) Redesign and modernize Ship-it
to reflect the new release work flow. This will include new UI, multiple
sign-offs, new release-runner, etc.
We skipped a week of updates due to the Easter Holiday. I’ve also moved the timing of these update emails/posts From Friday afternoon to Monday so that more people will see them. Look for your releng/relops highlights on Mondays now going forward.
Callek got us running the mozharness-tests (CI tests for mozharness) [they used to run on Travis-CI, and that we lost when we moved Mozharness in tree[. Based on intree code with taskcluster. These tests only run when someone touches mozharness code. (http://bugzil.la/1240184)
Callek closed out a bunch of old bugs that related to foopies, pandas, and a few lingering ones about tegras. Since we have retired that infrastructure in favor of Android Emulators.
Kendall, Jake, and Mark worked to patch much of our infrastructure against the git 0-day vulnerability. They’re finishing up the tail end of those machines that have significantly less exposure/risk.
Rob landed patches to increase the stability of our Windows AWS AMI generation, making that process more robust. There’s some additional work to be done around verifying certificate downloads to fix the remaining issues we know about.
Nick landed some patches to improve our AWS recovery time (by about an hour) when we terminate many instances at once.
Amy has initiated a purchase for another 192 mac minis to expand our existing OS X 10.10 test pool in support of e10s and other load.
I'm a Mozilla Release Engineer. Which means I am a strong advocate for remote teams. Our team is very distributed and I believe we are successful at it too.
Funnily enough, I think a big part of the distributed model includes getting remote people physically together once in a while. When you do, you create a special energy. Unfortunately that energy can sometimes be wasted.
Have you ever had a work week that goes something like the following:
You stumble your way to the office in a strange environment on day one. You arrive, find some familiar faces, hug, and then pull out your laptop. You think, 'okay, first things first, I'll clear my email, bugmail, irc backscroll, and maybe even that task I left hanging from the week before.'
At some point, someone speaks up and suggests you come up with a schedule for the week. A google doc is opened and shared and after a few 'bikeshedding' moments, it's lunch time! A local to the area or advanced yelper in the group advertises a list of places to eat and after the highest rated food truck wins your stomach's approval, you come back to the office and ... quickly check your email.
The above scenario plays out in a similar fashion for the remainder of the week. Granted, I exaggerate and some genuine ideas get discussed. Maybe even a successful side sprint happens. But I am willing to bet that you, too, have been to a team meet up like this.
So can it be done better? Well I was at one recently in Vancouver and this post will describe what I think made the difference.
Prior to putting out burning trees at Mozilla, I put out burning trees as a Forest Firefighter in Canada. BC Forest Protection uses the Incident Command System (ICS). That framework enabled us to safely and effectively suppress wildfires. So how does it work and why am I bringing it up? Well, without this framework, firefighters would spend a lot of time on the fire line deciding where to camp, what to eat, what part of the fire to suppress first, and how to do it. But thanks to ICS, these decisions are already made and the firefighters can get back to doing work that feels productive!
You can imagine how team meet ups could benefit from such organization. With ICS, there are four high level branches: Logistics, Planning, Operations, and Finance & Administration. The last one doesn't really apply to our 'work week' scenario as we use Egencia prior to arriving and Expensify after leaving so it doesn't really affect productivity during the week. However, let's dive into the other three and discover how they correlate to team meet ups.
For each of these branches, someone should be nominated or assigned and complete the branch responsibilities.
Ideally the Logistics lead should be someone who is local to the area or has been there before. This person is required to create an etherpad/Google Doc that:
proposes a hotel near the office
describe the hotel
provide directions from the airport. map screenshots encouraged
provide directions from the hotel to the office.
propose restaurants to eat for each day of the week
poll food restrictions within the team
reserve the restaurants in advanced
work with the office Work Place Resource
book a room/space within the office
sign up team to office lunches
get key cards/fobs assigned and ready to be handed out
send out an email with link to the doc that contains all this information.
Now you might be saying, "wait a second, I can do all those things myself and don't need to be hand held." And while that is true, the benefit here is you reduce the head space required on each individual, the time spent debating, and you get everyone doing the same thing at the same time. This might not sound very flexible or distributed but remember, that's the point; you're centralized and together for the week! You might also be thinking "I really enjoy choosing a hotel and restaurant." That's fine too, but I propose you coordinate with the logistics assignee prior to the work week rather than spend quality work week time on these decisions.
Now that you have logistics sorted, it's time to do all the action planning. Traditionally we've had work weeks where we pre-plan high level goals we want to accomplish but we don't actually fill out the schedule until Monday as a group. The downside here is this can chew up a lot of time and you can easily get side track before completing the schedule. So, like Logistics, assign someone to Planning.
This person is required to create a [insert issue tracker of choice] list and determine a bugs/issues that should be tackled during the week. The way this is done of course depends on the issue tracker, style of the group, and type of team meet up but here is an example we used for finishing a deliverable related goal.
write a list of issues for each of the following categories:
nice to haves
work in progress
done but needs to be verified
For the above, we used Trello which is nice as it's really just a board of sticky notes. I could write a whole separate blog post on how to to be effective with it by incorporating bugzilla links, assignees to cards, tags, sub-lists, and card voting but for now, here is a visual example:
The beauty here is that all of the tasks (sticky notes) are done upfront and each team member simply plucks them off the 'hard blockers' and 'nice to have' lists one by one, assigns them to themselves, and moves them into the completed section.
No debating or brainstorming what to do, just sprint!
The Operations assignee here should:
be a proxy to the outside world
be a moderator internally
If you want to take advantage of a successful physical team meetup, forget about the communication tools that are designed for distributed environments.
During the work week I think it is best to ignore email, bug mail, and irc. Treat the week like you are on PTO: change your bugzilla name and create a vacation responder. Have the Operations assignee respond to urgent requests and act as a proxy to the outside world.
It is also nice to have the Operations assignee moderate internally by constantly iterating over the trello board state, grasping what needs to be done, where you are falling behind, and what new things have come up.
Vancouver by accident
This model wasn't planned or agreed upon prior to the Vancouver team meetup. It actually just happened by accident. I (jlund) took on the Logistics, rail handled Planning, and catlee acted as that sort of moderator/proxy role in Operations. Everyone at the meet up finished the week satisfied and I think hungry to try it again.
I'm looking forward to using a framework like this in the future. What's your thoughts?
After the huge push last week to realize our first beta release using build promotion, there’s not a whole lot of new work to report on this week. We continue to polish the build promotion process in the face of a busier-than-normal week in terms of operational work.
Firefox 46.0 beta 1 was finally released to world last week, and the ninth rebuild was, in fact, the charm. As the first release attempted using the new build promotion process, this is a huge milestone for Mozilla releases.
As proof we’re getting better, Firefox 46.0 beta 2 was released this week using the same build promotion process and only required three build iterations. Progress!
This was also a week for dot releases, with security releases in the pipe for Firefox 45 and our extended support release (ESR) versions.
Kim just stepped into the releaseduty rotation for the Firefox 46 cycle. Kudos to Mihai for fixing up the releaseduty docs during his rotation so the process is easy to step into! We released Firefox 45.0esr, Firefox 46.0b1, Thunderbird 38.7.0 and Firefox 38.7.0esr with several other releases in the pipeline. See the notes for details:
There is new capacity in AWS for our Linux64 and Emulator 64 jobs thanks to Vlad and Kim’s work in bug 1252248.
Alin and Amy moved 10 WinXP machines to the Windows 8 pool to reduce pending counts on that platform. (bug 1255812)
Kim removed all the code used to run our panda infrastructure from our repos in bug 1186617. Amy is in the process of decommissioning the associated hardware in bug 1189476.
Speaking of Amy, she received a much-deserved promotion this week. To quote from Lawrence’s announcement email:
“I’m excited to promote Amy Rich into the newly created position of Head of Operations [for Platform Operations]. This new role, which reports directly to me, expands the purview of her existing systems ops team, and includes assisting me with more management leadership responsibility.
“Amy’s unique mix of skills make her a great fit for this role. She has a considerable systems engineering background, and she and her team have been responsible for greatly improving our release infrastructure over the past five years. As a people manager, her commitment to both individuals and the big picture engenders loyalty, respect, and admiration. She is inquisitive and reflective, bringing strategic perspective to decision-making processes such as setting the relative priority between TaskCluster migration and Windows 7 in AWS. As a leader, she has recently stepped up to shepherd projects aimed at creating a more cohesive Platform Operations team, and she is also assisting with Mozilla’s company-wide Diversity & Inclusion strategy.
“Amy’s team will focus on systems automation, lifecycle, and operations support. This involves taking on systems ops responsibilities for Engineering Productivity (VCS, MozReview, MXR/DXR, Bugzilla, and Treeherder) in addition to those of Release Engineering. The long-term vision for this team is that they will support the systems ops needs for all of Platform Operations.
Please join me in congratulating Amy on her new role!”
With lots of hard work from numerous people, we have expanded the scope of TaskCluster linux builds to include all twig branches as well as Aurora, and are on-track to make these builds Tier-1, and move Buildbot builds to Tier 2, in the next week or two.
Rok is extending the clobberer tool to be able to purge the cache for taskcluster workers (https://bugzil.la/1174263). This should be landing soon.
Aki added buildtime-generated code to the python taskcluster client, for easier code inspection and better stack traces. The code is still pending merge.
Improve Release Pipeline:
Firefox 46.0b1 is the first release we’ve attempted using build promotion, a new release process that multiple team members have been working on since last year. As is typical with new systems, we encountered some issues on this first attempt, and have so far iterated 9 times trying to get it right as continue to fix bugs. Among the issues we found this week was a discrepancy between how our manual update checks were attempting to invoke Ba'al, the Soul-eater, when compared with our automated tests. This is how the sausage is made, people.
We released Firefox and Fennec 45.0, as well as Fennec 46.0b1. As mentioned, Firefox 46.0b1 was still in progress as we went to press.
Kim disabled the last of the Android 4.0 jobs running on pandas (rack-mounted Android reference cards). We are in the process of cleaning up the code that was associated with them, as well as decommissioning the remaining pandas and associated hardware. Thank you pandas for service, enjoy your well-deserved retirement! Android performance tests will now run via autophone and results displayed via perfherder thanks to the hard work of many people on the developer productivity team.
The “TaskCluster login v3” effort is drawing to a close, and everyone can now login and create their own TaskCluster clients for whatever mad-science automation they want to do. This change makes the TaskCluster authentication system more maintainable and scalable, and will help us in encouraging other services such as RelengAPI, treeherder, and ship-it to use TaskCluster authentication. Dustin is in touch with owners of the old “permacreds” issued to interested people over the last few years to switch over to the new system.
Two platform support discussions that I want to highlight this week:
It was a busy week with many releases in flight, as well as preparation for running beta 1 with release promotion next week. We also are in the process of adding more capacity to certain test platform pools to lower wait times given all the new e10s tests that have been enabled.
Improve Release Pipeline:
Nick ran a staging release for 46.0b1 to check for issues before the merge, preventing some bustage for Fennec and ensuring we can fall back to the old system if any unexpected issues show up with release promotion
Dustin deployed a new version of the TaskCluster tools/login system with much improved UI for handling signing in and out and editing clients and roles. He also simplified the existing roles, with the result that the set of roles now fits on one screen, and is entirely composed of human-readable names. All of this works toward two important goals: building a sign-in system that is useful and usable by all mozillians; and configuring the access-control system to give everyone their appropriate permissions and no more.
The releases calendar is getting busier as we get closer to the end of the cycle. Many releases were shipped or are still in-flight:
Fennec 45.0 (in-progress)
Firefox 45.0 (in-progress) - we shipped the RC to the beta channel
It was a busy week for release engineering as several team members travelled to the Vancouver office to sprint on the release promotion project. The goal of the release promotion project is to promote continuous integration builds to release channels, allowing us to ship releases much more quickly.
Improve Release Pipeline:
Chris, Jordan, Callek (remotely), Kim, Mihai and Rail had a sprint on Release Promotion. We made so much progress on this project that we decided to use the new process for Firefox 46.0b1. https://bugzil.la/1118794 So many green jobs!
A quieter week than last in terms of meetings and releases, but a glibc stack overflow exploit made things “fun” from an operational standpoint this week.
Improve CI Pipeline:
Dustin deployed a change to the TaskCluster login service to allow logins by people who have LDAP accounts (e.g,. for try pushes) but do not have access to the company’s single-sign-on provider. This closes a gap that excluded some of our most productive contributors from access to TaskCluster. With this change, anyone who has a Mozillians account or an LDAP account can connect to TaskCluster and have appropriate access based on group membership.
Ben wrote a blog post about using the Balrog agent to streamline the throttled rollout of Firefox releases. This is one of the few remaining interactive activities in the Firefox release process. Being able to automate it will eliminate some email hand-offs, leading to faster turnaround.
As opposed to last week’s congestion, this week had a rather normal pace. Various releases have been shipped or are still in-flight:
Firefox 45.0b7 (in progress)
Thunderbird 45.0b2 (in progress) picking up the swap back to GTK2 for Linux users
Next week a handful of the people working on “Release Promotion” will be in Vancouver to try and sprint our way to the finish line. Among them are Jlund, Rail, Kmoir, and Mtabara. Callek won’t be able to make it in person, but will be joining them remotely.
Over the course of the week, Jake, Hal, and Amy have worked to patch and reboot our infrastructure to make it safe against the glibc gethostinfo exploit.
Many people from various different teams pitched in to diagnose a bug that was causing our Windows 7 test pool to shut down. Special thanks to philor who finally tracked it down to a Firefox graphics problem. The patch was backed out, and operations are back to normal. (https://bugzil.la/1248347)
Aki wrote a blog post this week about how releng should get better about providing generically packaged tools. Not only would this make our own internal testing story better, but would make easier for contributors outside of releng to hack and help.
This past week, the release engineering managers – Amy, catlee, and coop (hey, that’s me!) – were in Washington, D.C. meeting up with the other managers who make up the platform operations management team. Our goal was to try to improve the ways we plan, prioritize, and cooperate. I thought it was quite productive. I’ll have more to say about next week once I gather my thoughts a little more.
Everyone else was *very* busy while we were away. Details are below.
Dustin deployed a change to the TaskCluster authorization service to support “named temporary credentials”. With this change, credentials can come with a unique name, allowing better identification, logging, and auditing. This is a part of Dustin’s work to implement “TaskCluster Login v3” which should provide a smoother and more flexible way to connect to TaskCluster and create credentials for all of the other tasks you need to perform.
Windows 10 in the cloud is being tested. All the ground work is done to make golden AMIs, mirroring the first stages of work done for Windows 7 in the cloud. Being able to perform some subset of Windows 10 testing in the cloud should allow us to purchase less hardware than we had originally anticipated for this quarter.
Improve CI pipeline:
One of the subjects discussed at Mozlando was improving the overall integration of localization (l10n) builds with our continuous integration (CI) system. Mike fixed an l10n packaging bug this week that I first remember looking at over 4 years ago. This fix allows us to properly test l10n packaging of Mac builds in a release configuration on check-in, thereby avoiding possible headaches later in the release cycle. (https://bugzil.la/700997)
Armen, Joel, Dustin, and Greg worked together to green up even more Linux test jobs in TaskCluster. Among other things, this involved upgrading to the latest Docker (1.10.0) and diagnosing some test runner scripts which use 1.3GB of RAM – not counting the Firefox binaries they run! This project has already been a long slog, but we are constantly making progress and will soon have all jobs in-tree at Tier 2.
Kim landed a patch to enable Mac OS X 10.10.5 testing on try by default and disable 10.6 testing. This allowed us to disable some old r5 machines and install around 30 new 10.10.5 machines and enable them in production. Hooray for increased capacity! (https://bugzil.la/1239731)
This week, Jake and Mark added check_ami.py support to runner for our Windows 2008 instances running in Amazon. This is an important step towards parity with our Linux instances in that it allows our Windows instances to check when a newer AMI is available and terminate themselves to be re-created with the new image. Until now, we’ve need to manually refresh the whole pool to pick up changes, so this is a great step forward.
Also on the Windows virtualization front, Rob and Mark turned on puppetization of Windows 2008 golden AMIs this week. This particular change has taken a long time to make it to production, but it’s hard to overstate the importance of this development. Windows is definitely *not* designed to manage its configuration via puppet, but being able to use that same configuration system across both our POSIX and Windows systems will hopefully decrease the time required to update our reference platforms by substantially reducing the cognitive overhead required for configuration changes. Anyone who remembers our days using OPSI will hopefully agree.
Improve CI pipeline:
Ben landed a Balrog patch that implements JSONSchemas for Balrog Release objects. This will help ensure that data entering the system is more consistent and accurate, and allows humans and other systems that talk to Balrog to be more confident about the data they’ve constructed before they submit it.
We’re currently on beta 3 for the Firefox 45. After all the earlier work to unhork gtk3 (see last week’s update), it’s good to see the process humming along.
A small number of stability issues have precipitated a dot release for Firefox 44. A Firefox 44.0.1 release is currently in progress.
Kim implemented changes to consume SETA information for Android API 15+ data using data from API 11+ data until we have sufficient data for API 15+ test jobs. This reduced the number of high number of pending counts for the AWS instance types used by Android. (https://bugzil.la/1243877)
Coop (hey, that’s me!) did a long-overdue pass of platform support triage. Lots of bugs got closed out (30+), a handful actually got fixed, and a collection of Windows test failures got linked together under a root cause (thanks, philor!). Now all we need to do is find time to tackle the root cause!
In addition to Rok who also joined our team week, I’m ecstatic to welcome back Aki Sasaki to Mozilla release engineering.
If you’ve been a Mozillian for a while, Aki’s name should be familiar. In his former tenure in releng, he helped bootstrap the build & release process for both Fennec *and* FirefoxOS, and was also the creator of mozharness, the python-based script harness that has allowed us to push so much of our configuration back into the development tree. Essentially he was devops before it was cool.
Aki’s first task in this return engagement will be to figure out a generic way to interact with Balrog, the Mozilla update server, from TaskCluster. You can follow along in bug 1244181.
I’m happy to announce a new addition to the Mozilla release engineering. This week, we are lucky to welcome Rok Garbas to the team.
Rok is a huge proponent of Nix and NixOS. Whether we end up using those particular tools or not, we plan to leverage his experience with reproducible development/production environments to improve our service deployment story in releng. To that end, he’s already working with Dustin who has also been thinking about this for a while.
Rok’s first task is to figure out how the buildbot-era version of clobberer, a tool for clearing and resetting caches on build workers, can be rearchitected to work with TaskCluster. You can follow along in bug 1174263 if you’re interested.
Coop (hey, that’s me!) re-enabled partner repacks as part of release automation this week, and was happy to see the partner repacks for the Firefox 44 release get generated and published without any manual intervention. Back in August, we moved the partner repack process and configuration into github from mercurial. This made it trivially easy for Mozilla partners to issue a pull request (PR) when a configuration change was needed. This did require some re-tooling on the automation side, and we took the opportunity to fix and update a lot of partner-related cruft, including moving the repack hosting to S3. I should note that the EME-free repacks are also generated automatically now as part of this process, so those of you who prefer less DRM with your Firefox can now also get your builds on a regular basis.
One of the main reasons why build promotion is so important for releng and Mozilla is that it removes the current disconnect between the nightly/aurora and beta/release build processes, the builds for which are created in different ways. This is one of the reasons why uplift cycles are so frequently “interesting” - build process changes on nightly and aurora don’t often have an existing analog in beta/release. And so it was this past Tuesday when releng started the beta1 process for Firefox 45. We quickly hit a blocker issue related to gtk3 support that prevented us from even running the initial source builder, a prerequisite for the rest of the release process. Nick, Rail, Callek, and Jordan put their heads together and quickly came up with an elegant solution that unblocked progress on all the affected branches, including ESR. In the end, the solution involved running tooltool from within a mock environment, rather than running it outside the mock environment and trying to copy relevant pieces in. Thanks for the quick thinking and extra effort to get this unblocked. Maybe the next beta1 cycle won’t suck quite as much!
The patch that Nick prepared (https://bugzil.la/886543) is now in production and being used to notify users on unsupported versions of GTK why they can’t update. In the past, they would’ve simply received no update with no information as to why.
Dustin made security improvements to TaskCluster, ensuring that expired credentials are not honored.
We had a brief Balrog outage this morning [Fri Jan 29]. Balrog is the server side component of the update system used by Firefox and other Mozilla products. Ben quickly tracked the problem down to a change in the caching code. Big thanks to mlankford, Usul, and w0ts0n from the MOC for their quick communication and help in getting things back to a good state quickly.
On Wednesday, Dustin spoke at Siena College, holding an information session on Google Summer of Code and speaking to a Software Engineering class about Mozilla, open source, and software engineering in the real world.
It’s encouraging to see more progress this week on both the build/release promotion and TaskCluster migration fronts, our two major efforts for this quarter.
In a continuing effort to enable faster, more reliable, and more easily-run tests for TaskCluster components, Dustin landed support for an in-memory, credential-free mock of Azure Table Storage in the azure-entities package. Together with the fake mock support he added to taskcluster-lib-testing, this allows tests for components like taskcluster-hooks to run without network access and without the need for any credentials, substantially decreasing the barrier to external contributions.
All release promotion tasks are now signed by default. Thanks to Rail for his work here to help improve verifiability and chain-of-custody in our upcoming release process. (https://bugzil.la/1239682)
Beetmover has been spotted in the wild! Jordan has been working on this new tool as part of our release promotion project. Beetmover helps move build artifacts from one place to another (generally between S3 buckets these days), but can also be extended to perform validation actions inline, e.g. checksums and anti-virus. (https://bugzil.la/1225899)
Dustin configured the “desktop-test” and “desktop-build” docker images to build automatically on push. That means that you can modify the Dockerfile under `testing/docker`, push to try, and have the try job run in the resulting image, all without pushing any images. This should enable much quicker iteration on tweaks to the docker images. Note, however, that updates to the base OS images (ubuntu1204-build and centos6-build) still require manual pushes.
Mark landed Puppet code for base windows 10 support including secrets and ssh keys management.
Improve CI pipeline:
Vlad and Amy repurposed 10 Windows XP machines as Windows 7 to improve the wait times in that test pool (https://bugzil.la/1239785)
Armen and Joel have been working on porting the Gecko tests to run under TaskCluster, and have narrowed the failures down to the single digits. This puts us on-track to enable Linux debug builds and tests in TaskCluster as the canonical build/test process.
Ben finished up work on enhanced Release Blob validation in Balrog (https://bugzil.la/703040), which makes it much more difficult to enter bad data into our update server.
You may recall Mihai, our former intern who we just hired back in November. Shortly after joining the team, he jumped into the releaseduty rotation to provide much-needed extra bandwidth. The learning curve here is steep, but over the course of the Firefox 44 release cycle, he’s taken on more and more responsibility. He’s even volunteered to do releaseduty for the Firefox 45 release cycle as well. Perhaps the most impressive thing is that he’s also taken the time to update (or write) the releaseduty docs so that the next person who joins the rotation will be that much further ahead of the game. Thanks for your hard work here, Mihai!
Hal did some cleanup work to remove unused mozharness configs and directories from the build mercurial repos. These resources have long-since moved into the main mozilla-central tree. Hopefully this will make it easier for contributors to find the canonical copy! (https://bugzil.la/1239003)
Every new year gives you an opportunity to sit back, relax,
have some scotch and re-think the passed year. Holidays give
you enough free time. Even if you decide to not take a vacation around
the holidays, it's usually calm and peaceful.
This time, I found myself thinking mostly about productivity, being
effective, feeling busy, overwhelmed with work and other related topics.
When I started at Mozilla (almost 6 years ago!), I tried to apply all my
GTD and time management knowledge and techniques. Working remotely and
in a different time zone was an advantage - I had close to zero
interruptions. It worked perfect.
Last year I realized that my productivity skills had faded away somehow.
40h+ workweeks, working on weekends, delivering goals in the last week
of quarter don't sound like good signs. Instead of being productive I
"Every crisis is an opportunity". Time to make a step back and reboot
myself. Burning out at work is not a good idea. :)
Here are some ideas/tips that I wrote down for myself you may found
Morning exercises. A 20-minute walk will wake your brain up and
generate enough endorphins for the first half of the day.
Meditation. 2x20min a day is ideal; 2x10min would work too. Something
like calm.com makes this a peace of cake.
Task #1: make a daily plan. No plan - no work.
Don't start your day by reading emails. Get one (little) thing done
first - THEN check your email.
Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
Meetings are time consuming, so "Set a goal for each meeting".
Consider skipping a meeting if you don't have any goal set, unless it's a
beer-and-tell meeting! :)
Constantly ask yourself if what you're working on is important.
3-4 times a day ask yourself whether you are doing something towards
your goal or just finding something else to keep you busy. If you want
to look busy, take your phone and walk around the office with some
papers in your hand. Everybody will think that you are a busy person!
This way you can take a break and look busy at the same time!
Take breaks! Pomodoro technique has this option
built-in. Taking breaks helps not only to avoid RSI, but also
keeps your brain sane and gives you time to ask yourself the questions
mentioned above. I use Workrave on my
laptop, but you can use a real kitchen timer instead.
Wear headphones, especially at office. Noise cancelling ones are even
better. White noise, nature sounds, or instrumental music are your
Make sure you enjoy your work environment. Why on the earth would you
spend your valuable time working without joy?!
De-clutter and organize your desk. Less things around - less
Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them.
Your health is more important and expensive. Thanks to mhoye for this advice!
Don't check email every 30 seconds. If there is an emergency, they
will call you! :)
Reward yourself at a certain time. "I'm going to have a chocolate at
11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you
are Pavlov's dog too!
Don't try to read everything NOW. Save it for later and read in a
Capture all creative ideas. You can delete them later. ;)
Prepare for next task before break. Make sure you know what's next, so
you can think about it during the break.
This is my list of things that I try to use everyday. Looking forward to
I would appreciate your thoughts this topic. Feel free to comment or
send a private email.
One of releng’s big goals for Q1 is to deliver a beta via build promotion. It was great to have some tangible progress there this week with bouncer submission.
Lots of other stuff in-flight, more details below!
Dustin worked with Armen and Joel Maher to run Firefox tests in TaskCluster on an older EC2 instance type where the tests seem to fail less often, perhaps because they are single-CPU or slower.
Improve CI pipeline:
We turned off automation for b2g 2.2 builds this week, which allowed us to remove some code, reduce some complexity, and regain some small amount of capacity. Thanks to Vlad and Alin on buildduty for helping to land those patches. (https://bugzil.la/1236835 and https://bugzil.la/1237985)
In a similar vein, Callek landed code to disable all b2g desktop builds and tests on all trees. Another win for increased capacity and reduced complexity! (https://bugzil.la/1236835)
Kim finished integrating bouncer submission with our release promotion project. That’s one more blocker out of the way! (https://bugzil.la/1215204)
Happy new year from all of us in releng! Here’s a quick rundown of what’s happened over the holidays.
We are now running 100% of our Windows builds (including try) in AWS. This greatly improves the scalability of our Windows build infrastructure. It turns out these AWS instances are also much faster than the in-house hardware we were using previously. On AWS, we save over 45 minutes per try run on Windows, with more modest improvements on the integration branches. Thanks to Rob, Mark, and Q for making this happen!
Dustin added a UI frontend to the TaskCluster “secrets” service and landed numerous fixes to it and to the hooks service.
Rob implemented some adjustments to 2008 userdata in cloud-tools that allow us to re-enable puppetisation of 2008 golden AMIs.
Callek added buildbot master configuration that enables parallel t-w732 testing prototype instances in EC2. This is an important step as we try to virtualize more of our testing infrastructure to reduce our maintenance burden and improve burst capacity.
Q implemented a working mechanism for building Windows 7/10 cloud instance AMIs that behave as other EC2 Windows instances (EC2Config, Sysprep, etc) and can be configured for test duty.
Mark landed Puppet code for base Windows 7 support including secrets and ssh keys management.
Improve CI pipeline:
Dustin completed modifications to the docker worker to support “coalescing”, the successor to what is now known as queue collapsing or BuildRequest merging.
Ben modernized Balrog’s toolchain, including switching it from Vagrant to Docker, enabling us to start looking at a more modern deployment strategy.
Hal introduced the team learned about Lean Coffee at Mozlando. The team has adopted it wholeheartedly, and is using it for both project and team meetings with success. We’re using Trello for virtual post-it notes in vidyo meetings.
Rob fixed a problem where our AWS instances in us-west-2 were pulling mercurial bundles from us-east-1. This saves us a little bit of money every month in transfer costs between AWS regions. (bug 1232501)
I'm not a manager (but I interview Mozilla releng candidates)
I'm not looking for a new job.
These are just my observations after working in the tech industry for a long time.
I'm kind of a resume and interview nerd. I like helping friends fix their resumes and write amazing cover letters. In the past year I've helped a few (non-Mozilla) friends fix up their resumes, write cover letters, prepare for interviews as they search for new jobs. This post will discuss some things I've found to be helpful in this process.
Preparation Everyone tends to jump into looking at job descriptions and making their resume look pretty. Another scenario is that people have a sudden realization that they need to get out of their current position and find a new job NOW and frantically start applying for anything that matches their qualifications. Before you do that, take a step back and make a list of things that are important to you. For example, when I applied at Mozilla, my list was something like this
learn release engineering at scale + associated tools/languages
work on a team of release engineers (not be the only one)
good team dynamics - people happy to share knowledge and like to ship
work in an organization where release engineering is valued for increasing the productivity of the organization as a whole and is funded (hardware/software/services/training) accordingly
support to attend and present at conferences
People spend a lot of time at work. Life is too short to be unhappy every day. Writing a list of what is important serves as a checklist to when you are looking at job descriptions and immediately weed out the ones that don't match your list.
People tend focus a lot on the technical skills they want to use or new ones you want to learn. You should also think about what kind of culture where you want to work. Do the goals and ethics of the organization align with your own? Who will you be working with? Will you enjoy working with this team? Are you interested in remote work or do you want to work in an office? How will a long commute impact or relocation your quality of life? What is the typical career progression of someone in this role? Are there both management and technical tracks for advancement?
To summarize, itemize the skills you'd like to use or learn, the culture of the company and the team and why you want to work there.
Your cover letter should succinctly map your existing skills to the role you are applying for and convey enthusiasm and interest. You don't need to have a long story about how you worked on a project at your current job that has no relevance to your potential new employer. Teams that are looking to hire have problems to solve. Your cover letter needs to paint a picture that your have the skills to solve them.
Picture by Jim Bauer - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/lens-cap/10320891856/sizes/l
Refactoring your resume
Developers have a lot of opportunities these days, but if you intend to move from another industry, into a tech company, it can be more tricky. The important thing is to convey the skills you have in a a way that people can see they can be applied to the problems they want to hire you to fix.
Many people describe their skills and accomplishments in a way that is too company specific. They may have a list of acronyms and product names on their resume that are unlikely to be known by people outside the company. When describing the work you did in a particular role, describe the work that you did in a that is measurable way that highlights the skills you have. An excellent example of a resume that describes the skills that without going into company specific detail is here. (Julie Pagano also has a terrific post about how she approached her new job search.)
Another tip is to leave out general skills that are very common. For instance, if you are a technical writer, omit the fact that you know how to use Windows and Word and focus on highlighting your skills and accomplishments.
Non-technical interview preparation
Every job has different technical requirements and there are many books and blog posts on how to prepare for this aspect of the interview process. So I'm going to just cover the non-technical aspects.
When I interview someone, I like to hear lots of questions. Questions about the work we do and upcoming projects. This indicates that have taken the time to research the team, company and work that we do. It also shows that enthusiasm and interest.
Here is a list suggestions to prepare for interviews
1. Research the company make a list of relevant questions Not every company is open about the work that they do, but most will be have some public information that you can use to formulate questions during the interviews. Do you know anyone you can have coffee or skype with to who works for the company and can provide insight? What products/services do the company produce? Is the product nearing end of life? If so, what will it be replaced by? What is the companies market share, is it declining, stable or experiencing growth? Who are their main competitors? What are some of the challenges they face going forward? How will this team help address these challenges?
2. Prepare a list of questions for every person that interviews you ahead of time Many companies will give you the list of names of people who will interview you. Have they recently given talks? Watch the videos online or read the slides. Does the team have github or other open repositories? What are recent projects are they working on? Do they have a blog or are active on twitter? If so, read it and formulate some questions to bring to the interview. Do they use open bug tracking tools? If so, look at the bugs that have recent activity and add them to the list of questions for your interview. A friend of mine read the book of a person that interviewed him had written and asked questions about the book in the interview. That's serious interview preparation!
Photo by https://www.flickr.com/photos/wocintechchat/ https://www.flickr.com/photos/wocintechchat/22506109386/sizes/l
3. Team dynamics and tools Is the team growing or are you hiring to replace somebody who left? What's the onboarding process like? Will you have a mentor? How is this group viewed by the rest of the company? You want to be in a role where you can make a valuable contribution. Joining a team where their role is not valued by the company or not funded adequately is a recipe for disappointment. What does a typical day look like? What hours do people usually work? What tools do people use? Are there prescribed tools or are you free to use what you'd like?
4. Diversity and Inclusion If you're a member of an underrepresented group in tech, the numbers are lousy in this industry with some notable exceptions. And I say that while recognizing that I'm personally in the group that is the lowest common denominator for diversity in tech.
I don't really have good advice for this area other than do your research to ensure you're not entering a toxic environment. If you look around the office where you're being interviewed and nobody looks like you, it's time for further investigation. Look at the company's website - is the management team page white guys all the way down? Does the company support diverse conferences, scholarships or internships? Ask on a mailing list like devchix if others have experience working at this company and what it's like for underrepresented groups. If you ask in the interview why there aren't more diverse people in the office and they say something like "well, we only hire on merit" this is a giant red flag. If the answer is along the lines of "yes, we realize this and these are the steps we are taking to rectify this situation", this is a more encouraging response.
A final piece of advice, ensure that you meet with your manager that you're going to report to as part of your hiring process. You want to ensure that you have rapport with them and can envision a productive working relationship.
What advice do you have for people preparing to find a new job?
All of Mozilla gathered in Orlando, Florida last week for one of our twice-yearly all-hands meetings. Affectionately called “Mozlando”, it was a chance for Mozilla contributors (paid and not) to come together to celebrate successes and plan for the future…in between riding roller coasters and drinking beer.
Even though I’ve been involved in the day-to-day process, have manage a bunch of people working on the relevant projects, and indeed have been writing these quasi-weekly updates, it was until Chris AtLee put together a slide deck retrospective of what we had accomplished in releng over the last 6 months that it really sunk in (hint: it’s a lot):
But enough about the “ancient” past, here’s what has been happening since Mozlando:
Modernize infrastructure: There was a succession of meetings at Mozlando to help bootstrap people on using TaskCluster (TC). These were well-attended, at least by people in my org, many of whom have a vested interest in using TC in the ongoing release promotion work.
Speaking of release promotion, the involved parties met in Orlando to map out the remaining work that stands between us and releasing a promoted CI as a beta, even if just in parallel to an actual release. We hope to have all the build artifacts generated via release promotion by the end of 2016 — l10n repacks are the long pole here — with the remaining accessory service tasks like signing and updates coming online early in 2016.
Improve CI pipeline: Mozilla announced a change of strategy in Orlando with regards to FirefoxOS.
FirefoxOS is alive and strong, but the push through carriers is over. We pivot to IoT and user experience. #mozlando
In theory, the switch from phones to connected devices should improve our CI throughput in the near-term, provided we are actually able to turn off any of the existing b2g build variants or related testing. This will depend on commitments we’ve already made to carriers, and what baseline b2g coverage Mozilla deems important.
During Mozlando, releng held a sprint to remove the jacuzzi code from our various repos. Jacuzzis were once an important way to prevent “bursty,” prolific jobs (namely l10n) from claiming all available capacity in our machine pools. With the recent move to AWS for Windows builds, this is really only an issue for our Mac build platform now, and even that *should* be fixed sometime soon if we’re able to repack Mac builds on Linux. In the interim, the added complexity of the jacuzzi code wasn’t deemed worth the extra maintenance hassle, so we ripped it out. You served your purpose, but good riddance.
Release: Sadly, we are never quite insulated from the ongoing needs of the release process during these all-hands events. Mozlando was no different. In fact, it’s become such a recurrent issue for release engineering, release management, and release QA that we’ve started discussing ways to be able to timeshift the release schedule either forward or backward in time. This would also help us deal with holidays like Thanksgiving and Christmas when many key players in the release process (and devs too) might normally be on vacation. No changes to announce yet, but stay tuned.
With the upcoming deprecation of SHA-1 support by Microsoft in 2016, we’ve been scrambling to make sure we have a support plan for Firefox users on older versions on Windows operating systems. We determined that we would need to offer multiple dot releases to our users: a first one to update the updater itself and the related maintenance service to recognize SHA-2, and then a second update where we begin signing Firefox itself with SHA-2. (https://bugzil.la/1079858)
Jordan was on the hook for the Firefox 43.0 final release that went out the door on Tuesday, December 15.
As with any final release, there is a related uplift cycle. These uplift cycles are also problematic, especially between the aurora and beta branches where there continues to be discrepancies between the nightly- and release-model build processes. The initial beta build (b1) for Firefox 44 was delayed for several days while we resolved a suite of issues around GTK3, general crashes, and FHR submissions on mobile. Much of this work also happened at Mozlando.
Operational: We continue the dance of disabling older r5 Mac minis running 10.10.2 to replace them with new, shiny r7 Mac minis running 10.10.5. As our r7 mini capacity increases, we also able/required to retire some of the *really* old r4 Mac minis running OS X 10.6, mostly because we need the room in the datacenter. The gating factor here has been making sure that tests works still work on the various release branches on the new r7 minis. Joel has been tackling this work, and this week was able to verify the tests on the mozilla-release branch. Only the esr38 branch is still running on the r5 minis. Big thanks to Kim and our stalwart buildduty contractors, Alin and Vlad, for slogging through the buildbot-configs with patches for this.
Speaking of our buildduty contractors, Alin and Vlad both received commit level 2 access to the Mozilla repos in Mozlando. This makes them much more autonomous, and is a result of many months of steady effort with patches and submissions. Good work, guys!
The Mozilla VR Team may soon want a Gecko branch for generating Windows builds with a dedicated update channel. The VR space at Mozilla is getting very exciting!
I can’t promise much content during the Christmas lull, but look for more releng updates in the new year.
You may have noticed that Windows has had no updates for Nightly for the last week or so. We’ve had a few issues with signing the binaries as part of moving from a SHA-1 certificate to SHA-2. This needs to be done because Windows won’t accept SHA-1 signed binaries from January 1 2016 (this is tracked in bug 1079858).
Updates are now re-enabled, and the update path looks like this
older builds → 20151209095500 → latest Nightly
Some people may have been seeing UAC prompts to run the updater, and there could be one more of those when updating to the 20151209095500 build (which is also the last SHA-1 signed build). Updates from that build should not cause any UAC prompts.
One of the challenges of maintaining a legacy system is deciding how much
effort should be invested in improvements. Since modern vcs-sync is
“right around the corner”, I have been avoiding looking at improvements
to legacy (which is still the production version for all build farm use
While adding another gaia branch, I noticed that the conversion path for
active branches was both highly variable and frustratingly long. It
usually took 40 minutes for a commit to an active branch to trigger a
build farm build. And worse, that time could easily be 60 minutes if the
stars didn’t align properly. (Actually, that’s the conversion time for
git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to
generate the trigger.)
The full details are in bug 1226805, but a simple rearrangement of
the jobs removed the 50% variability in the times and cut the average
time by 50% as well. That’s a savings of 20-40 minutes per gaia push!
Moral: don’t take your eye off the legacy systems – there still can be
some gold waiting to be found!
I thought I would share a few tips I've learned over the years of how to make
the most of these company gatherings. These summits or workweeks are always
full of awesomeness, but they can also be confusing and overwhelming.
#1 Seek out people
It's great to have a (short!) list of people you'd like to see in person.
Maybe somebody you've only met on IRC / vidyo or bugzilla?
Having a list of people you want to say "thank you" in person to is
a great way to approach this. Who doesn't like to hear a sincere "thank
you" from someone they work with?
#2 Take advantage of increased bandwidth
I don't know about you, but I can find it pretty challenging at times to
get my ideas across in IRC or on an etherpad. It's so much easier in
person, with a pad of paper or whiteboard in front of you. You can share
ideas with people, and have a latency/lag-free conversation! No more
fighting AV issues!
#3 Don't burn yourself out
A week of full days of meetings, code sprints, and blue sky dreaming can
be really draining. Don't feel bad if you need to take a breather. Go for
a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and
ready to engage again.
Over the past few quarters we've been working to migrate our infrastructure
off of the ageing "FTP"  system to Amazon S3.
We've maintained some backwards compatibility for the time being , so that
current Firefox CI and release builds are still available via
ftp.mozilla.org, or preferably,
archive.mozilla.org since we don't
support the ftp protocol any more!
This is pretty big change, but we really think this will make it easier
to find the builds you're looking for.
The Taskcluster Index allows us to attach multiple "routes" to a build
job. Think of a route as a kind of hierarchical tag, or directory. Unlike
regular directories, a build can be tagged with multiple routes, for
example, according to the revision or buildid used.
Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I
encourage you to explore the gecko.v2
namespace, and see if it makes things easier for you to find what you're
looking for! 
Can't find what you want in the index? Please let us know!
November 13th, I attended the USENIX Release Engineering Summit in Washington, DC. This summit was along side the larger LISA conference at the same venue. Thanks to Dinah McNutt, Gareth Bowles, Chris Cooper, Dan Tehranian and John O'Duinn for organizing.
I gave two talks at the summit. One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.
Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/tim_norris/2600844073/sizes/o/
I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city. We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.
Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l
In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems. In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
I'd also like to thank all the members of Mozilla releng/ateam who reviewed my slides and provided feedback before I gave the presentations.
The attendees of the summit attended the same keynote as the LISA attendees. Jez Humble, well known for his Continuous Delivery and Lean Enterprise books provided a keynote on Lean Configuration Management which I really enjoyed. (Older version of slides from another conference, are available here and here.)
In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release". He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success. For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".
Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing. Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.
At the release engineering summit itself, Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company. This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.
Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!
Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.
For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US.
Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines. Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster.
There were many more great talks, hopefully more slides will be up soon!
We’ve had lots of interest already in our advertised internship position, and that’s great. However, many of the applications I’ve looked at won’t pan out because they overlooked a key line in the posting:
*Only local candidates will be considered for this role.*
That’s right, we’re only able to accept interns who are legally able to work in Canada.
The main reason behind this is that all of our potential mentors are in Toronto, and having an engaged, local mentor is one of the crucial determinants of a successful internship. In the past, it was possible for Mozilla to sponsor foreign students to come to Canada for internships, but recent changes to visa and international student programs has made the bureacratic process (and concomitant costs) a nightmare to manage. Many applicants simply aren’t eligible any more under the new rules either.
I’m not particularly happy about this, but it’s the reality of our intern hiring landscape. Some of our best former interns have come from abroad, and I’ve already seen some impressive resumes this year from international students. Hopefully one of the non-Toronto-based positions will still appeal to them.
This was all thanks to F3real, who joined us from Mozilla's community and released his first Python package. He has also brought forth the integration tests we wrote for it. Here's the issue and PR if you're curious.
F3real will now be looking at removing the buildapi module from mozci and making use of the python package instead.
I had the privilege of attending MozFest
last week. Overall it was a really great experience. I met lots of really
wonderful people, and learned about so many really interesting and inspiring
My biggest takeaway from MozFest was how important it is to provide good APIs
and data for your systems. You can't predict how somebody else will be able to
make use of your data to create something new and wonderful. But if you're not
making your data available in a convenient way, nobody can make use of it at
It was a really good reminder for me. We generate a lot of data in Release
Engineering, but it's not always exposed in a way that's convenient for other
people to make use of.
The rest of this post is a summary of various sessions I attended.
Friday night started with a Science Fair. Lots of really interesting stuff here.
Some of the projects that stood out for me were:
naturebytes - a DIY wildlife camera based on the
raspberry pi, with an added bonus of aiding conservation efforts.
histropedia - really cool visualizations of
time lines, based on data in Wikipedia and Wikidata. This was the first time
I'd heard of Wikidata, and the possibilities were very exciting to me! More on
this later, as I attended a whole session on Wikidata.
Several projects related to the Internet-of-Things (IOT)
On Saturday, the festival started with some keynotes. Surman spoke about how
MozFest was a bit chaotic, but this was by design. In a similar way that the web
is an open platform that you can use as a platform for building your own ideas,
MozFest should be an open platform so you can meet, brainstorm, and work on your
ideas. This means it can seem a bit disorganized, but that's a good thing :) You
get what you want out of it.
I attended several good sessions on Saturday as well:
Ending online tracking. We
discussed various methods currently used to track users, such as cookies and
fingerprinting, and what can be done to combat these. I learned, or
re-learned, about a few interesting Firefox extensions as a result:
privacybadger. Similar to Firefox's
tracking protection, except it doesn't rely on a central blacklist. Instead,
it tries to automatically identify third party domains that are setting
cookies, etc. across multiple websites. Once identified, these third party
domains are blocked.
Intro to D3JS. d3js is a JS data visualization library. It's quite powerful,
but something I learned is that you're expected to do quite a bit of work
up-front to make sure it's showing you the things you want. It's not great as
a data exploration library, where you're not sure exactly what the data means,
and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.
6 kitchen cases for IOT We
discussed the proposed IOT design manifesto
briefly, and then split up into small groups to try and design a product,
using the principles outlined in the manifesto. Our group was tasked with
designing some product that would help connect hospitals with amateur chefs in
their local area, to provide meals for patients at the hospital. We ended up
designing a "smart cutting board" with a built in display, that would show you
your recipes as you prepared them, but also collect data on the frequency of
your meal preparations, and what types of foods you were preparing.
Going through the exercise of evaluating the product with each of the design
principles was fun. You could be pretty evil going into this and try and
collect all customer data :)
How to fight an internet shutdown -
we role played how we would react if the internet was suddenly shut down
during some political protests. What kind of communications would be
effective? What kind of preparation can you have done ahead of time for such
This session was run by Deji from accessnow.
It was really eye opening to see how internet shutdowns happen fairly
regularly around the world.
Data is beaufitul
Introduction to wikidata
Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody
can edit and query the database. One of the really interesting features of
Wikidata is that localization is kind of built-in as part of the design. Each
item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas
Adams. The description for each item is simply a table of locale -> localized
description. There's no inherent bias towards English, or any other language.
The beauty of this is that you can reference the same piece of data from
multiple languages, only having to focus on localizing the various
descriptions. You can imagine different translations of the same Wikipedia
page right now being slightly inconsistent due to each one having to be
updated separately. If they could instead reference the data in Wikidata, then
there's only one place to update the data, and all the other places that
reference that data would automatically benefit from it.
The query language is quite powerful as well. A simple demonstration was "list
all the works of art in the same room in the Louvre as the Mona Lisa."
It really got me thinking about how powerful open data is. How can we in
Release Engineering publish our data so others can build new, interesting and
useful tools on top of it?
Local web Various options
for purely local web / networks were discussed. There are some interesting
mesh network options available commotion
was demo'ed. These kind of distributions give you file exchange, messaging,
email, etc. on a local network that's not necessarily connected to the
I’ve been remiss in (re)introducing our latest hire in release engineering here at Mozilla.
Mihai Tabara is a two-time former intern who joins us again, now in a full-time capacity, after a stint as a release engineer at Hortonworks. He’s in Toronto this week with some other members of our team to sprint on various aspects of release promotion.
After a long hiring drought for releng, it’s great to be able to welcome someone new to the team, and even better to be able to welcome someone back. Welcome, Mihai!
to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.
Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:
I was fortunate enough to be able to attend Dev Ops Days Silicon
Valley this year. One of the main talks was given by
Jason Hand, and he made some great points. I wanted to highlight two
of them in this post:
Post Mortems are really learning events, so you should hold them
when things go right, right? RIGHT!! (Seriously, why
wouldn’t you want to spot your best ideas and repeat them?)
Systems are hard – if you’re pushing the envelope, you’re
teetering on the line between complexity and chaos. And we’re
all pushing the envelope these days - either by getting fancy
or getting lean.
Post Mortems as Learning Events
Our industry has talked a lot about “Blameless Post Mortems”, and
techniques for holding them. Well, we can call them “blameless” all we
want, but if we only hold them when things go wrong, folks will get the
message loud and clear.
If they are truly blameless learning events, then you would also hold
them when things go right. And go meh. Radical idea? Not really - why
else would sports teams study game films when they win? (This point was
also made in a great Ignite by Katie Rose: GridIronOps - go read her
My $0.02 is - this would also give us a chance to celebrate success.
That is something we do not do enough, and we all know the dedication
and hard work it takes to not have things go sideways.
And, by the way, terminology matters during the learning event. The
person who is accountable for an operation is just that: capable of
giving an account of the operation. Accountability is not
Terminology and Systems – Setting the right expectations
Part way through Jason’s talk, he has this awesome slide about how
system complexity relates to monitoring which relates to problem
resolution. Go look at slide 19 - here’s some of what I find
amazing in that slide:
It is not a straight line with a destination. Your most stable
system can suddenly display inexplicable behavior due to any
number of environmental reasons. And you’re back in the chaotic
world with all that implies.
Systems can progress out of chaos, but that is an uphill battle.
Knowing which stage a system is in (roughly) informs the approach
to problem resolution.
Note the wording choices: “known” vs “unknowable” – for all but
the “obvious” case, it will be confusing. That is a property of
the system, not a matter of staff competency.
While not in his slide, Jason spoke to how each level really has
different expectations. Or should have, but often the appropriate
expectation is not set. Here’s how he related each level to industry
The only level with enough certainty to be able to expect the “best”
is the known and familiar one. This is the “obvious” one, because
we’ve all done exactly this before over a long enough time period to
fully characterize the system, its boundaries, and abnormal
Here, cause and effect are tightly linked. Automation (in real time)
Once we back away from such certainty, it is only realistic to have
less certainty in our responses. With the increased uncertainty, the
linkage of cause and effect is more tenuous.
Even if we have all the event history and logs in front of us, more
analysis is needed before appropriate corrective action can be
determined. Even with automation, there is a latency to the
Okay, now we are pushing the envelope. The system is complex, and
we are still learning. We may not have all the data at hand, and may
need to poke the system to see what parts are stuck.
Cause and effect should be related, but how will not be
visible until afterwards.
There is much to learn.
For chaotic systems, everything is new. A lot is truly unknowable
because that situation has never occurred before. Many parts of the
system are effectively black boxes. Thus resolution will often be a
process of trying something, waiting to see the results, and
responding to the new conditions.
There is so much more in that diagram I want to explore. The connecting
of problem resolution behavior to complexity level feels very powerful.
My experience tells me that many of these subjective terms are
highly context sensitive, and in no way absolute. Problem resolution
at 0300 local with a bad case of the flu just has a way of making
“obvious” systems appear quite complex or even chaotic.
By observing the behavior of someone trying to resolve a problem,
you may be able to get a sense of how that person views that system
at that time. If that isn’t the consensus view, then there is a gap.
And gaps can be bridged with training or documentation or
Much of Q4 is spent planning and budgeting for the next year, so there’s been lots of discussion about which efforts will need support next year and which things we can accommodate with existing resources and which will need additional resources.
And if planning and budgeting doesn’t scare some Hallowe'en spirit into you, I don’t know what will.
Modernize infrastructure: Q got most of the automated installation of w10 working and is now focusing on making sure that jobs run.
Improve CI pipeline: Andrew (with help from Dustin) will be running a bunch of test suites in TaskCluster (based on TaskCluster-built binaries) at Tier 2.
Release: Callek built v1.5 of the OpenH264 Plugin, and pushed it out to the testing audience. Expecting to go live to our users in the next few weeks.
Callek managed to get “final verify” (an update test with all live urls) working on taskcluster in support of the “release promotion” project.
Firefox 42, our big moment-in-time release for the second half of 2015, gets released to the general public next week. Fingers are crossed.
Operational: Kim and Amy decommissioned about 50% of the remaining panda infrastructure (physical mobile testing boards) after shifting the load to AWS.
We repurposed 30 of our Linux64 talos machines for Windows 7 testing in preparation for turning on some e10s tests.
Kim implemented some changes to SETA which would allow us to configure the SETA parameters on a per platform basis (https://bugzil.la/1175568).
Rail performed the mozilla-central train uplifts a week early when the Release Management plans shifted, turning Nightly into Gecko 45. FirefoxOS v2.5 branch based on Gecko 44 has been created as a part of the uplift.
Callek investigated a few hours of nothing being scheduled on try on Tuesday, to learn there was an issue with a unicode character in the commit message which broke JSON importing of pushlog. And then did work to reschedule all those jobs (https://bugzil.la/1218943).
Industry News: In addition to the work we do at Mozilla, a number of our people are leaders in industry and help organize, teach, and speak. These are some of the upcoming events people are involved with:
Today we started serving an important set of directories on ftp.mozilla.org using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.
Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.