Planet Release Engineering

May 17, 2017

Kim Moir (kmoir)

New blog location

I moved my blog to WordPress.

New location is here https://kimmoir.blog/

May 17, 2017 09:12 PM

April 18, 2017

Armen Zambrano G. (@armenzg)

Docker image to generate allthethings.json

I've created a lot of hackery in the past (mozci) based on Release Engineering's allthethings.json file as well as improving the code to generate it reliably. This file contains metadata about the Buildbot setup and the relationship between builders (a build trigger these tests).

Now, I have never spent time ensuring that the setup to generate the file is reproducible. As I've moved over time through laptops I've needed to modify the script to generate the file to fit my new machine's set up.

Today I'm happy to announce that I've published a Docker image at Docker hub to help you generate this file anytime you want.

You can find the documentation and code in here.

Please give it a try and let me know if you find any issues!
docker pull armenzg/releng_buildbot_docker
docker run --name allthethings --rm -i -t releng_buildbot_docker bash
# This will generate an allthethings.json file; it will take few minutes
/braindump/community/generate_allthethings_json.sh
# On another tab (once the script is done)
docker cp allthethings:/root/.mozilla/releng/repos/buildbot-configs/allthethings.json .


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 18, 2017 03:14 PM

March 13, 2017

Chris Cooper (coop)

Shameless self (release) promotion: Firefox 53.0b1 from TaskCluster

You may recall two short months ago when we moved Linux and Android nightlies from buildbot to TaskCluster. Due to the train model, this put us (release engineering) on a clock: either we’d be ready to release a beta version of Firefox 53 for Linux and Android using release promotion in TaskCluster, or we’d need to hold back our work for at least the next cycle, causing uplift headaches galore.

I’m happy to report that we were able to successfully release Firefox 53.0b1 for Linux and Android from TaskCluster last week. This is impressive for 3 reasons:

  1. Mac and Windows builds were still promoted from buildbot, so we were able to seamlessly integrate the artifacts of two different continuous integration (CI) platforms.
  2. The process whereby nightly builds are generated has always been different from how we generate release builds. Firefox 53.0b1 represents the first time a beta build was generated using the same taskgraph we use for a nightly, thereby reducing the delta between CI builds and release builds. More work to be done here, for sure.
  3. Nobody noticed. With all the changes under the hood, this may be the most impressive achievement of all.

A round of thanks to Aki, Johan, Kim, and Mihai who worked hard to get the pieces in place for Android, and a special shout-out to Rail who handled the Linux beta while also dealing with the uplift requirements for ESR52. Of course, thanks to everyone else who has helped with the migration thus far. All of that foundational work is starting to pay off.

Much more to do, but I look forward to updating you about Mac and Windows progress soon.

March 13, 2017 07:17 PM

February 21, 2017

Chris Cooper (coop)

RelEng & RelOps highlights - February 21, 2017

It’s been a while. How are you?

Modernize infrastructure:

We finally closed the 9-year-old bug requesting that we redirect all HTTP traffic to hg.mozilla.org to HTTPS! Many thanks to everyone who helped ensure that automation and other tools continued to work normally. Not every day you get to close bugs that are older than my kids. https://bugzilla.mozilla.org/show_bug.cgi?id=450645

The new TreeStatus page (https://mozilla-releng.net/treestatus) was finally released by garbas with a proxy in place of old url.

Improve Release Pipeline:

Initial work on Uplift dashboard has been done by bastien and released to production by garbas. https://shipit.mozilla-releng.net/release-dashboard

Releng had a workweek in Toronto to plan how release promotion will work in a TaskCluster world. With the uplift for Firefox 52 rapidly approaching (see Release below), we came up with a multi-phase plan that should allow us to release the Linux and Android versions of Firefox 52 from TaskCluster, with the Mac and Windows versions still being created by buildbot.

Improve CI Pipeline:

Alin and Sebastian disabled Windows 10 tests on our CI. Windows 10 tests will be reappearing later this year once we move datacentres and acquire new hardware to support them. https://bugzilla.mozilla.org/show_bug.cgi?id=1330999

Andrei and Relops converted some Windows talos machines to run Linux64 to reduce wait times on this platform. https://bugzilla.mozilla.org/show_bug.cgi?id=1337452

There are some upcoming deadlines involving datacentre moves that, while not currently looming, are definitely focusing our efforts in the TaskCluster migration. As part of the aforementioned workweek, we targeted the next platform that needs to migrate, Mac OS X. We are currently breaking out the packaging and signing steps for Mac so that they can be done on Linux. That work can then be re-used for l10n repacks *and* release promotion.

Operational:

Since most of our Linux64 builds and tests have migrated to TaskCluster, Alin was able to shut down many of our Linux buildbot masters. This will reduce our monthly AWS bill and the complexity of our operational environment. https://bugzilla.mozilla.org/show_bug.cgi?id=1335435

Hal ran our first “hard close” Tree Closing Window (TCW) in quite a while on Saturday, February 11 (https://bugzilla.mozilla.org/show_bug.cgi?id=1324148). It ran about an hour longer than planned due to some strange interactions deep in the back end, which is why it was a “hard close.” The issue may be related to occasional “database glitches” we have seen in the past. This time IT got some data, and have raised a case with our load balancer vendor.

Release:

We are deep in the beta cycle for Firefox 52, with beta 8 coming out this week. Firefox 52 is an important milestone release because it signals the start of another ESR cycle.

See you again soon!

February 21, 2017 04:18 PM

February 17, 2017

Chris Cooper (coop)

Being productive when distributed teams get together, take 2

Salmon jumpingEvery year, hundreds of release engineers swim upstream because they’re built that way.

Last week, we (Mozilla release engineering) had a workweek in Toronto to jumpstart progress on the TaskCluster (TC) migration. After the success of our previous workweek for release promotion, we were anxious to try the same format once again and see if we could realize any improvements.

Prior preparation prevents panic

We followed all of the recommendations in the Logistics section of Jordan’s post to great success.

Keeping developers fed & watered is an integral part of any workweek. If you ever want to burn a lot of karma, try building consensus between 10+ hungry software developers about where to eat tonight, and then finding a venue that will accommodate you all. Never again; plan that shit in advance. Another upshot of advance planning is that you can also often go to nicer places that cost the same or less. Someone on your team is a (closet) foodie, or is at least a local. If it’s not you, ask that person to help you with the planning.

What stage are you at?

The workweek in Vancouver benefitted from two things:

  1. A week of planning at the All-Hands in Orlando the month before; and,
  2. Rail flying out to Vancouver a week early to organize much of the work to be done.

For this workweek, it turned out we were still at the planning stage, but that’s totally fine! Never underestimate the power of getting people on the same page. Yes, we did do *some* hacking during the week. Frankly, I think it’s easier to do the hacking bit remotely, but nothing beats a bunch of engineers in a room in front of a whiteboard for planning purposes. As a very distributed team, we rarely have that luxury.

Go with it

…which brings me to my final observation. Because we are a very distributed team, opportunities to collaborate in person are infrequent at best. When you do manage to get a bunch of people together in the same room, you really do need to go with discussions and digressions as they develop.

This is not to say that you shouldn’t facilitate those discussions, timeboxing them as necessary. If I have one nit to pick with Jordan’s post it’s that the “Operations” role would be better described as a facilitator. As a people manager for many years now, this is second-nature to me, but having someone who understands the problem space enough to know “when to say when” and keep people on track is key to getting the most out of your time together.


By and large, everything worked out well in Toronto. It feels like we have a really solid format for workweeks going forward.

February 17, 2017 10:48 PM

January 20, 2017

Chris Cooper (coop)

Nightlies in TaskCluster - go team!

As catlee has already mentioned, yesterday we shipped the first nightly builds for Linux and Android off our next-gen Mozilla continuous integration (CI) system known as TaskCluster. I eventually want to talk more about why this important and how we got to here, but for now I’d like to highlight some of the people who made this possible.

Thanks to Aki’s meticulous work planning and executing on a new chain of trust (CoT) model, the nightly builds we now ship on TaskCluster are arguably more secure than our betas and releases. Don’t worry though, we’re hard at work porting the chain of trust to our release pipeline. Jordan and Mihai tag-teamed the work to get the chain-of-trust-enabled workers doing important things like serving updates and putting binaries in the proper spots. Kim did the lion’s share of the work getting our task graphs sorted to tie together the disparate pieces. Callek wrangled all of the l10n bits. On the testing side, gbrown did some heroic work getting reliable test images setup for our Linux platforms. Finally, I’d be remiss if I didn’t also call out Dustin who kept us all on track with his migration tracker and who provided a great deal of general TaskCluster platform support.

Truly it was a team effort, and thanks to all of you for making this particular milestone happen. Onward to Mac, Windows, and release promotion!

January 20, 2017 07:21 PM

Chris AtLee (catlee)

Nightly builds from Taskcluster

Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.

74851712.jpg

We now have a much more secure, resilient, and hackable nightly build and release process.

It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.

The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.

Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.

Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.

This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.

Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!

January 20, 2017 01:35 PM

December 29, 2016

Chris Cooper (coop)

Buildduty year in review: 2016

I've just finished my milkI’ve just finished my milk

I wanted to post a follow-up to plug an accidental omission in catlee’s 2016 RelEng Retrospective. Buildduty is one of those things we rarely talk about publicly; it is simultaneously on the front-lines of developer support for releng, and yet somehow consistently flies under the radar.

2016 was the first year where we had full-time buildduty support from Softvision. Previous to their involvement, buildduty was a responsibility that rotated through the rest of releng. We tried various different models and cadences, but because of the interrupt-driven nature of the work, buildduty was avoided like a pariah or an Australian table wine.

As we approach 2017, I now consider buildduty a solved problem. I can’t imagine what releng life would be like without the buildduty help we get from Alin and Andrei. Yes, there’s still lots of work to do, but I no longer worry about *how* we’ll get it done.

This is largely a testament to the work put in by Softvision. In fact, in the spring of 2016, there was a personnel change with one of our buildduty contractors leaving for another opportunity. Softvision trained up a replacement (Andrei) in the background and subbed them in, and they barely missed a beat.

I’d like to also explicitly call out the increasing amount of buildduty management help provided by kmoir.

How exactly is buildduty helping? The list is long, varied, and growing: machine loans, package uploads, puppet environment changes, config changes, platform capacity work, debugging help, etc.

Here are some aggregate numbers from 2016:

That last number is the one that intrigues me the most: 59 patches submitted means that buildduty has moved past simple maintenance and are now proactively fixing problems for releng. Indeed, over the past year buildduty has taken on an increasing ownership role of our soon-to-be legacy buildbot infrastructure. We now routinely ask them to make changes to the buildbot configs on their own…with proper review, of course!

As buildbot moves further into maintenance mode, I’m excited to see what new skills and responsibilities our buildduty team can pick up to help even more in 2017.

December 29, 2016 08:51 PM

December 28, 2016

Chris AtLee (catlee)

2016 RelEng Retrospective

As 2016 winds down, I wanted to take some time to highlight all the work our Release Engineering team has done this year. Personally, I really enjoy writing these retrospective posts. I think it's good to spend some time remembering how far we've come in a year. It's really easy to forget what you did last month, and 6 months ago seems like ancient history!

People!

We added four people to our team this year!

Aki (:aki) re-joined us in January and has been working hard on developing a security model for Taskcluster for sensitive tasks like signing and publishing binaries.

Rok (:garbas) started in February and has been working on modernizing our web application framework development and deployment processes.

Johan (:jlorenzo) started in August and has been improving our release automation, Balrog, and automatically publishing Android builds to the Google Play Store.

Simon (:sfraser) started in October and has been improving monitoring of our production systems, as well as getting his feet wet with our partial update generation system.

Releases

This year we released 104 desktop versions of Firefox, and 58 android versions (including Beta, Release and ESR branches).

5 of those releases were just in the week prior to our all hands meeting in Hawaii!

Several other releases this year were special for particular reasons, and required special efforts on our part. We continued to provide SHA-1 signed installers for Windows XP users. We also produced a special 47.0.2 release in order to try and rescue users stuck on 47. We've never shipped a point release for a previous release branch before! We've also generated partial updates to try and help users on 43.0.1 and 47.0.2 get faster updates to the latest version of Firefox.

Release promotion

We couldn't have shipped so many releases so quickly last week if it weren't for release promotion. Previous to Firefox 46, our release process would generate completely new builds after CI was finished. This wasted a lot of time, and also meant we weren't shipping the exact binaries we had tested. Today, we ship the same builds that CI has generated and tested. This saves a ton of time (up to 8 hours!), and gives us a lot more confidence in the quality of the release.

This is one of those major kinds of changes that really transforms how we approach doing releases. I can't really remember what it was like doing releases prior to release promotion!

We also added support in Shipit to allow starting a release before all the en-US builds are done. This lets our Release Management team kick off a release early, assuming all the builds pass. It saves a person having to wait around watching Treeherder for the coveted green builds.

Windows in AWS

This year we completed our migration to AWS for Windows builds. 100% of our Windows builds are now done in AWS. This means that we now have a much faster and more scalable Windows build platform.

In addition, we also migrated most of the Windows 7 unittests to run in AWS. Previously these were running on dedicated hardware in our datacentre. By moving these tests to AWS, we again get a much more scalable test platform, but we also freed up hardware capacity for other test platforms (e.g. Windows XP).

Taskcluster

One of our major focus areas this year was migrating our infrastructure from Buildbot to Taskcluster. As of today, we have:

  • Fully migrated Linux64 and Android debug builds and tests
  • Builds for all other platforms operating as Tier2
  • Linux64 and Android nightly builds, l10n repacks and updates operating as Tier2
  • Tons of security design & implementation work

Balrog

Scheduled Changes in Balrog means that now we can have machines set background update rate to 0% 24 hours after release, instead of having a human do it.

Balrog itself was migrated from our datacentre in SCL3 into AWS. We now have a much more flexible deployment pipeline.

Balrog has also been one of our best projects for getting volunteer contributions! Many of the work done this year was done by contributors!

RIP

Being able to shut off old, crufty and deprecated stuff is an important part of staying agile. This year we were finally able to develop an end of life plan for Windows XP. In addition, we discontinued support for OSX 10.6-10.8, systems without SSE2, and 32-bit OSX systems. Not having to support these old platforms simplifies managing our infrastructure, and also makes product development easier.

We also shut down all the panda mobile testing infrastructure and legacy vcs-sync.

What's next?

2017 is looking like it's going to be another interesting (and busy!) year for RelEng.

Our top priority is to finish the migration to Taskcluster. Hopefully by the end of 2017, the only thing left on buildbot will be the ESR52 branch. This will require some big changes to our release automation, especially for Fennec.

We're also planning to provide some automated processes to assist with the rest of the release process. Releases still involve a lot of human to human handoffs, and places where humans are responsible for triggering automation. We'd like to provide a platform to be able to manage these handoffs more reliably, and allow different pieces of automation to coordinate more effectively.

December 28, 2016 07:35 PM

November 29, 2016

Chris Cooper (coop)

RelEng & RelOps highlights - November 29, 2016

Welcome back. As the podiatrist said, lots of exciting stuff is afoot.

Modernize infrastructure:

The big news from the past few weeks comes from the TaskCluster migration project where we now have nightly updates being served for both Linux and Android builds on the Date project branch. If you’re following along in treeherder, this is the equivalent of “tier 2” status. We’re currently working on polish bugs and a whole bunch of verification work before we attempt to elevate these new nightly builds to tier 1 status on the mozilla-central branch, effectively supplanting the buildbot-generated variants. We hope to achieve that goal before the end of 2017. Even tier 2 is a huge milestone here, so cheers to everyone on the team who has helped make this happen, chiefly Aki, Callek, Kim, Jordan, and Mihai.

A special shout-out to Dustin who helped organize the above migration work over the past few months but writing a custom dependency tracking tool. The code is here https://github.com/taskcluster/migration and you can see output here: http://migration.taskcluster.net/ It’s been super helpful!

Improve Release Pipeline:

Many improvements to Balrog were put into production this past week, including one from a new volunteer. Ben blogged about them in detail.

Aki released several scriptworker releases to stabilize polling and gpg homedir creation. scriptworker 1.0.0b1 enables chain of trust verification.

Aki added multi-signing-format capability to scriptworker and signingscript; this is live on the Date project branch.

Aki added a shared scriptworker puppet module, making it easier to add new instance types. https://bugzilla.mozilla.org/show_bug.cgi?id=1309293

Aki released dephash 0.3.0 with pip>=9.0.0 and hashin>=0.7.0 support.

Improve CI Pipeline:

Nick optimized our requests for AWS spot pricing, shaving several minutes off the runtime of the script which launches new instances in response to pending buildbot jobs.

Kim disabled Windows XP tests on trunk, and winxp talos on all branches (https://bugzilla.mozilla.org/show_bug.cgi?id=1310836 and https://bugzilla.mozilla.org/show_bug.cgi?id=1317716) Now Alin is rebalancing the Windows 8 pools so we can enable e10s testing on Windows 8 with the re-imaged XP machines. Recall that Windows XP is moving to the ESR branch with Firefox 52 which is currently on the Aurora/Developer Edition release branch.

Kim enabled Android x86 nightly builds on the Date project branch: https://bugzilla.mozilla.org/show_bug.cgi?id=1319546

Kim enabled SETA on the graphics projects branch to reduce wait times for test machines: https://bugzilla.mozilla.org/show_bug.cgi?id=1319490

Operational:

Rok has deployed the first service based on the new releng microservices architecture. You can find the new version of TryChooser here: https://mozilla-releng.net/trychooser/ More information about the services and framework itself can be found here: https://docs.mozilla-releng.net/

Release:

Firefox 50 has been released. We’re are currently in the beta cycle for Firefox 51, which will be extra long to avoid trying to push out a major version release during the busy holiday season. We are still on-deck to release a minor security release during this period. Everyone involved in the process applauds this decision.

See you next *mumble* *mumble*!

November 29, 2016 09:27 PM

October 28, 2016

Armen Zambrano G. (@armenzg)

Usability improvements for Firefox automation initiative - Status update #8

NEW:
* Debugging on remote workers is not a main priority for this quarter since it got completed

* We've added investigating hyper chunking

On this update we will look at the progress made in the last two weeks.

This quarter’s main focus is on improving end to end times on Try (Thunder Try project).

For all bugs and priorities you can check out:
https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Thunder Try - Improve end to end times on try
---------------------------------------------

Project #1 - Artifact builds on automation
##########################################
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

Upcoming:

Project #2 - S3 Cloud Compiler Cache
####################################
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:

Project #3 - Metrics
####################
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Upcoming:

Project #4 - Build automation improvements
##########################################
Nothing new for this edition

Project #5 - Hyper chunking
###########################
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262834

Nothing new for this edition

Project #6 - Run Web platform tests from the source checkout
############################################################
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286900

Feature blocked on misconfigured EBS volumes in AWS (bug 1305174)


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

October 28, 2016 01:33 PM

October 25, 2016

Chris Cooper (coop)

RelEng & RelOps highlights - October 24, 2016

OK, I’ve given up the charade that these are weekly now. Welcome back.

Modernize infrastructure:

Thanks to an amazing effort from Ed Morley and the rest of the Treeherder team, Treeherder has been migrated to Heroku, giving us significantly more flexible infrastructure.

Git-internal is no longer a standalone single point of failure (SPOF)! A warm standby host is running, and repository mirroring is in place. We now also have a fully matching staging environment for testing.

Improve Release Pipeline:

Aki and Catlee attended the security offsite and came away with todo items and a list to prioritize to improve release security.

Aki released scriptworker 0.8.0; this gives us signed chain of trust artifacts from scriptworkers, and gpg key management for chain of trust verification.

Improve CI Pipeline:

We now have nightly Linux64, Linux32 and Android 4.0 API15+ builds running on the date branch on taskcluster. Kim’s work to refactor the nightly task graph to transform the existing build “kind” into a signing “kind” has made adding new platforms quite straightforward. See https://bugzilla.mozilla.org/show_bug.cgi?id=1267426 and https://bugzilla.mozilla.org/show_bug.cgi?id=1277579 for more details.

There is still some remaining setup to be done, mostly around updates and moving artifacts into the proper locations (beetmover). Releng will then begin internal testing of these new nightlies (essentially dogfooding) to ensure that important things like updates are working correctly before we uplift this code to mozilla-central.

We hope to make that switch for Linux/Android nightlies within the next month, with Mac and Windows coming later this quarter.

Operational:

During a recent tree closing window (TCW), the database team managed to successfully switch the buildbot database from MyISAM to InnoDB format for improved stability. This is something we’ve wanted to do for many years and it’s good to see it finally done.

Release:

We’re currently on beta 10 for Firefox 50. This is noteworthy because in the next release cycle, Firefox 52 will be uplifted to the Aurora (developer edition) and Firefox 52 will be the last version of Firefox to support Windows XP, Windows Vista, and Universal binaries on Mac. Firefox 52 is due for release in March of 2017. Don’t worry though, all these platforms will be moving to the Firefox 52 ESR branch where they will continue to receive security updates for another year beyond that.

See you soon!

October 25, 2016 09:48 PM

October 21, 2016

Hal Wine (hwine)

Using Auto Increment Fields to Your Advantage

Using Auto Increment Fields to Your Advantage

I just found, and read, Clément Delafargue’s post “Why Auto Increment Is A Terrible Idea” (via @CoreRamiro). I agree that an opaque primary key is very nice and clean from an information architecture viewpoint.

However, in practice, a serial (or monotonically increasing) key can be handy to have around. I was reminded of this during a recent situation where we (app developers & ops) needed to be highly confident that a replica was consistent before performing a failover. (None of us had access to the back end to see what the DB thought the replication lag was.)

Read more...

October 21, 2016 07:00 AM

September 30, 2016

Kim Moir (kmoir)

Beyond the Code 2016 recap

I've had the opportunity to attend the Beyond the Code conference for the past two years.  This year, the venue moved to a location in Toronto, the last two events had been held in Ottawa.  The conference is organized by Shopify who again managed to have a really great speaker line up this year on a variety of interesting topics.  It was a two track conference so I'll summarize some of the talks I attended.  

The conference started off with Anna Lambert of Shopify welcoming everyone to the conference.





The first speaker was Atlee Clark, Director of App and Developer relations at Shopify who discussed the wheel of diversity.


The wheel of diversity is a way of mapping the characteristics that you're born with (age, gender, gender expression, race or ethnicity, national origin, mental/physical ability), along with those that you acquire through life (appearance, education, political belief, religion, income, language and communication skills, work experience, family,  organizational role).  When you look at your team, you can map how diverse it is by colour.  (Of course, some of these characteristics are personal and might not be shared with others).  You can see how diverse the team is by mapping different characteristics with different colours.  If you map your team and it's mostly the same colour, then you probably will not bring different perspectives together when you work because you all have similar backgrounds and life experiences.  This is especially important when developing products. 



This wheel also applies to hiring too.  You want to have different perspectives when you're interviewing someone.  Atlee mentioned when she was hiring for a new role, she mapped out the characteristics of the people who would be conducting the hiring interviews and found there was a lot of yellow.


So she switched up the team that would be conducting the interviews to include people with more diverse perspectives.

She finished by stating that this is just a tool, keep it simple, and practice makes it better. 

The next talk was by Erica Joy, who is a build and release engineer at Slack, as well as a diversity advocate.  I have to admit, when I saw she was going to speak at Beyond the Code, I immediately pulled out my credit card and purchased a conference ticket.  She is one of my tech heroes.  Not only did she build the build and release pipeline at Slack from the ground up, she is an amazing writer and advocate for change in the tech industry.   I highly recommend reading everything she has written on Medium, her chapter in Lean Out and all her discussions on twitter.  So fantastic.

Her talk at the conference was "Building a Diverse Corporate Culture: Diversity and Inclusion in Tech".  She talked about how literally thousands of companies say they value inclusion and diversity.  However, few talk about what they are willing to give up to order to achieve it.  Are you willing to give up your window seat with a great view?   Something else so that others can be paid fairly?  She mentioned that change is never free.  People need both mentorship and sponsorship in in order to progress in their career.





I really liked her discussion around hiring and referrals.  She stated that when you're hire people you already know you're probably excluding equally or better qualified that you don't know.  By default, women of colour are underpaid.

Pay gap for white woman, African American women and Hispanic women compared to a white man in the United States.

Some companies have referral system to give larger referral bonuses to people who are underrepresented in tech, she gave the example of Intel which has this in place.  This is a way to incentivize your referral system so you don't just hire all your white friends.  

The average white American has 91 white friends and one black friend so it's not very likely that they will refer non-white people. Not sure what the numbers are like in Canada but I'd guess that they are quite similar.
  
In addition, don't ask people to work for free, to speak at conferences or do diversity and inclusion work.  Her words were "We can't pay rent with exposure".

Spend time talking to diversity and inclusion experts.  There are people that have spent their entire lives conducting research in this area and you can learn from their expertise.  Meritocracy is a myth, we are just lucky to be in the right place in the right time.  She mentioned that her colleague Duretti Hirpa at Slack points out the need for accomplices, not allies. People that will actually speak up for others.  So people feeling pain or facing a difficult work environment don't have to do all the work of fighting for change. 




In most companies, there aren't escalation paths for human issues either.  If a person is making sexist or racist remarks, shouldn't that be a firing offense? 

If people were really working hard on diversity and inclusion, we would see more women and people of colour on boards and in leadership positions.  But we don't.

She closed with a quote from Beyonce:

"If everything was perfect, you would never learn and you would never grow"

💜💜💜

The next talk I attended was by Coraline Ada Ehmke, who is an application engineer at Github.  Her talk was about the "Broken Promise of Open Source".  Open source has the core principals of the free exchange of ideas, success through collaboration, shared ownership and meritocracy.


However, meritocracy is a myth.  Currently, only 6% of Github users are women.  The environment can be toxic, which drives a lot of people away.  She mentioned that we don't have numbers for diversity in open source other than women, but Github plans to do a survey soon to try to acquire more data.


Gabriel Fayant from Assembly of Seven Generation's talk was entitled "Walking in Both Worlds, traditional ways of being and the world of technology".  I found this quite interesting, she talked about traditional ceremonies and how they promote the idea of living in the moment, and thus looking at your phone during a drum ceremony isn't living the full experience.  A question from the audience from someone who worked in the engineering faculty at the University of Toronto was how we can work with indigenous communities to share our knowledge of the technology and make youth both producers of tech, not just consumers. 

If everything was perfect, you would never learn and you would never grow.
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html

f everything was perfect, you would never learn and you would never grow.
Read more at: http://www.brainyquote.com/quotes/quotes/b/beyoncekno596349.html
The next talk was by Sandi Metz, entitled "Madame Santi tells your future".  This was a totally fascinating look at the history of printing text from scrolls all the way to computers.

She gave the same talk at another conference earlier so you watch it here.  It described the progression of printing technology from 7000 years ago until today.  Each new technology disrupted the previous one, and it was difficult for those who worked on the previous technology to make the jump to work on the new one. 

So according to Sandi, what is your future?

The last talk I attended was by Sabrina Geremia of Google Canada.  She talked about the factors that encourage a girl to consider computer science (encouragement, career perception, self-perception and academic exposure.)


I found that this talk was interesting but it focused a bit too much on the pipeline argument - that the major problem is that girls are not enrolling in CS courses.  If you look at all the problems with environment, culture, lack of pay equity and opportunities for promotion due to bias, maybe choosing a career where there is more diversity is a better choice.  For instance, law, accounting and medicine have much better numbers for these issues, despite there still being an imbalance.

At the end of the day, there was a panel to discuss diversity issues:

Moderator: Ariti Sharma, Shopify, Panelists: Mohammed Asaduallah, Format, Katie Krepps, Capital One Canada, Lateesha Thomas, Dev Bootcamp, Ramya Raghavan, Google, Kara Melton, TWG, Gladstone Grant, Microsoft Canada
Some of my notes from the panel

Compared to the previous two iterations of this conference, it seemed that this time it focused a lot more on solutions to have more diversity and inclusion in your company. The previous two conferences I attended seemed to focus more on technical talks by diverse speakers.


As a side note, there were a lot of Shopify folks in attendance because they ran the conference.  They sent a bus of people from their head office in Ottawa to attend it.  I was really struck at how diverse some of the teams were.  I met group of women who described themselves as a team of "five badass women developers" 💯 As someone who has been the only woman on her team for most of her career, this was beautiful to see and gave me hope for the future of our industry.   I've visited the Ottawa Shopify office several times (Mr. Releng works there) and I know that the representation of of their office doesn't match the demographics of the Beyond the Code attendees which tended to be more women and people of colour.  But still, it is refreshing to see a company making a real effort to make their culture inclusive.  I've read that it is easier to make your culture inclusive from the start, rather than trying to make difficult culture changes years later when your teams are all homogeneous. So kudos to them for setting an example for other companies.

Thank you Shopify for organizing this conference, I learned a lot and I look forward to the next one!

September 30, 2016 01:10 PM

September 19, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - September 19, 2016

Welcome back to our *cough*weekly*cough* updates!

Modernize infrastructure:

Amy and Alin decommissioned all but 20 of our OS X 10.6 test machines, and those last few will go away when we perform the next ESR release. The next ESR release corresponds to Firefox 52, and is scheduled for March next year.

Improve Release Pipeline:

Ben finally completed his work on Scheduled Changes in Balrog. With it, we can pre-schedule changes to Rules, which will help minimize the potential for human error when we ship, and make it unnecessary for RelEng to be around just to hit a button.

Lots of other good Balrog work has happened recently too, which is detailed in Ben’s blog post.

Improve CI Pipeline:

Windows TaskCluster builders were split into level-1 (try) and level-3 (m-i, m-c, etc) worker types with sccache buckets secured by level.

Windows 10 AMI generators were added to automation in preparation for Windows 10 testing on TaskCluster. We’ve been looking to switch from testing on Windows 8 to Windows 10, as Windows 8 usage continues to decline. The move to TaskCluster seems like a natural breakpoint to make that switch.

Dustin massive patch set to enable in-tree config of the various build kinds landed last week. This was no small feat. Kudos to him for the testing and review stamina it took to get that done. Those of us working to migrate nightly builds to TaskCluster are now updating – and simplifying – our task graphs to leverage his work.

Operational:

After work to fix some bugs and make it reliable, Mark re-enabled the cron job that generates our Windows 7 AWS AMIs each night.

Now that many of the Windows 7 tests are being run in AWS, Amy and Q reallocated 20 machines from Windows 7 testing to XP testing to help with the load. We are reallocating 111 additional machines from Windows 7 to XP and Windows 8 in the upcoming week.

Amy created a template for postmortems and created a folder where all of Platform Operations can consolidate their postmortem documents.

Jake and Kendall took swift action on the TCP ChallengeAck side attack vulnerability. This has been fixed with a sysctl workaround on all linux hosts and instances.

Jake pushed a new version of the mig-agent client which was deployed across all Linux and OS X platforms.

Hal implemented the new GitHub feature to require two-factor authentication for several Mozilla organizations on GitHub.

Release:

Rail has automated re-generating our SHA-1 signed Windows installers for Firefox, which are served to users on old versions of Windows (XP, Vista). This means that users on those platforms will no longer need to update through a SHA-1 signed, watershed release (we were using Firefox 43 for this) before updating to the most recent version. This will save XP/Vista users some time and bandwidth by creating a one-step update process for them to get the latest Firefox.

See you next *cough*week*cough*!

September 19, 2016 06:38 PM

August 22, 2016

Hal Wine (hwine)

Py Bay 2016 - a First Report

Py Bay 2016 - a First Report

PyBay held their first local Python conference this last weekend (Friday, August 19 through Sunday, August 21). What a great event! I just wanted to get down some first impressions - I hope to do more after the slides and videos are up.

Read more...

August 22, 2016 07:00 AM

August 08, 2016

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - August 8, 2016

Two of our interns finished up their terms recently. The Toronto office feels a little emptier without them, or at least that’s how I imagine it since I don’t actually work in Toronto.

Modernize infrastructure:

Trunk-based Fennec debug builds+tests have been disabled in buildbot and moved to tier 1 in Taskcluster

Improve Release Pipeline:

Mihai finished migrating release sanity to Release Promotion. Release sanity helps to catch release issues that are susceptible to human process error. It has been live since 48.0b9, and has already helped catch a few issues that could have delayed the releases!

Improve CI Pipeline:

Rob created an automated build using taskcluster-github to create new taskcluster windows amis whenever the workertype manifest (or opencloudconfig) is updated.

David Burns posted a status update to dev.platform about the ongoing build system improvements.

Operational:

Balrog has been successfully migrated to the new Docker-based CloudOps infrastructure, which will make development and maintenance much easier.

We’ve ended the experiment of providing Firefox code mirrors on bitbucket. This will allow us to shut down some legacy infrastructure shortly.

We’ve shut down legacy vcs-sync!

Nearly everything that was indexed on MXR has been indexed on DXR! A few stragglers remain, but are very low-use trees. Additionally, mxr.mozilla.org now brings up an interstital page which offers a link to DXR instead of the hardhat.

RIP Manny BothansMany betas died to bring you e10s.

Release:

We’ve released Firefox 48 which brings many long-awaited features to our users.

Outreach

Our amazing interns Francis Kang and Connor Sheehan have finished their tenure with us. They did grace us with their intern presentations before they left, which you can replay on Air Mozilla:

Thanks to both Connor and Francis for their hard work, and thanks also to their mentors Rail and Kim for helping them become such valuable contributors.

Hiring:

We’ve filled the two release engineering positions we had advertised. It felt like we had an embarrassment of riches at some points during that hiring process. Thanks to everyone who took the time to apply.

See you next week!

August 08, 2016 10:36 PM

August 02, 2016

Armen Zambrano G. (@armenzg)

Usability improvements for Firefox automation initiative - Status update #2

On this update, we will look at the progress made since our initial update.

A reminder that this quarter’s main focus is on:

For all bugs and priorities you can check out the project management page for it:
https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Debugging tests on interactive workers
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262260

Accomplished recently:

Upcoming:

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1285582
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1288827
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1289879

Thunder Try - Improve end to end times on try

Project #1 - Artifact builds on automation
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

No news for this edition and probably the next one.


Project #2 - S3 Cloud Compiler Cache
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:


Project #3 - Metrics
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Upcoming:

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1242017
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1258861

Other

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1287604
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1272083
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1247168


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

August 02, 2016 07:48 PM

July 29, 2016

Kim Moir (kmoir)

Ottawa Python Authors Meetup: Artificial Intelligence with Python

Last night, I attended my first Ottawa Python Authors Meetup.  It was the first time that I had attended despite wanting to attend for a long time.  (Mr. Releng also works with Python and thus every time there's a meetup, we discuss who gets to go and who gets to stay home and take care of little Releng.  It depends on if the talk to more relevant to our work interests.)

The venue was across the street from Confederation Park aka land of Pokemon.


I really enjoyed it.  The people I chatted with were very friendly and welcoming. Of course, I ran into some people I used to work with, as is with any tech event in Ottawa it seems. Nice to catch up!

The venue had the Canada Council for the Arts as a tenant, thus the quintessentially Canadian art.


The speaker that night was Emily Daniels, developer from Halogen Software who spoke on Artificial Intelligence with Python. (Slides here, github repo here).  She mentioned that she writes Java during the day but works on fun projects in Python at night.  She started the talk by going through some examples of artificial intelligence on the web.  Perhaps the most interesting one I found was a recurrent neural network called Benjamin which generates movie script ideas and was trained on existing sci-fi movies and movie scripts.  Also, a short film called Sunspring was made of one of the generated scripts.  The dialogue is kind of stilted but it is interesting concept.

 After the examples, Emily then moved on to how it all works. 

Deep learning is a type of machine learning that drives meaning out of data using a hierarchy of multiple layers that mimics the neural networks of our brain.

She then spoke about a project she wrote to create generative poetry from a RNN (recurrent neural network).  It was based on a RNN tutorial that she heavily refactored to meet her needs.  She went through the code that she developed to generate artificial prose from the works of H.G. Wells and Jane Austen.  She talked about how she cleaned up the text to remove EOL delimiters, page breaks, chapters numbers and so on. And then it took a week to train it with the data.

She then talked about another example which used data from Jack Kerouac and Virginia Woolf novels, which she posts some of the results to twitter.


She also created a twitter account which posts generated text from her RNN that consumes the content of Walt Whitman and Emily Dickinson. (I should mention at this point that she chose these authors for her projects because copyrights have expired on these works and they are available on the Gutenberg project)

After the talk, she field a number of audience questions which were really insightful. There were discussions on the inherent bias in the data because it was written by humans that are sexist and racist.  She mentioned that she doesn't post the results of the model automatically to twitter because some of them are really inappropriate since these novels since they learned from text that humans wrote who are inherently biased.

One thing I found really interesting is that Emily mentioned that she felt a need to ensure that the algorithms and data continue to exist, and that they were faithfully backed up.  I began to think about all the Amazon instances that Mozilla releng had automatically killed that day as our capacity had peaked and declined.  And of the great joy I feel ripping out code when we deprecate a platform.  I personally feel no emotional attachment to bring down machines or deleting used code.
 
Perhaps the sense of a need for a caretaker for these recurrent neural networks and the data they create is related to the fact that the algorithms that output text that is a simulacrum for the work of an author that we enjoy reading.  And perhaps that is why we maybe we aren't as attached to a ephemeral pool of build machines as we are are to our phones.  Because the phone provides a sense human of connection to the larger world when we may be sitting alone.

Thank you Emily for the very interesting talk, to the Ottawa Python Authors Group for organizing the meetup, and Shopify for sponsoring the venue.  Looking forward to the next one!

Further reading

July 29, 2016 07:06 PM

Eclipse Committer Emeritus

I received this very kind email in my inbox this morning.

"David Williams has expired your commit rights to the
eclipse.platform.releng project.  The reason for this change is:

We have all known this day would come, but it does not make it any easier.
It has taken me four years to accept that Kim is no longer helping us with
Eclipse. That is how large her impact was, both on myself and Eclipse as a
whole. And that is just the beginning of why I am designating her as
"Committer Emeritus". Without her, I humbly suggest that Eclipse would not
have gone very far. Git shows her active from 2003 to 2012 -- longer than
most! She is (still!) user number one on the build machine. (In Unix terms,
that is UID 500). The original admin, when "Eclipse" was just the Eclipse
Project.

She was not only dedicated to her job as a release engineer she was
passionate about doing all she could to make other committer's jobs easier
so they could focus on their code and specialties. She did (and still does)
know that release engineering is a field of its own; a specialized
profession (not something to "tack on" at the end) that just anyone can do)
 and good, committed release engineers are critical to the success of any
project.

For anyone reading this that did not know Kim, it is not too late: you can
follow her blog at

http://relengofthenerds.blogspot.com/

You will see that she is still passionate about release engineering and
influential in her field.

And, besides all that, she was (I assume still is :) a well-rounded, nice
person, that was easy to work with! (Well, except she likes running for
exercise. :)

Thanks, Kim, for all that you gave to Eclipse and my personal thanks for
all that you taught me over the years (and I mean before I even tried to
fill your shoes in the Platform).

We all appreciate your enormous contribution to the success of Eclipse and
happy to see your successes continuing.

To honor your contributions to the project, David Williams has nominated
you for Committer Emeritus status."


Thank you David! I really appreciate your kind words.  I learned so much working with everyone in the Eclipse community.  I had the intention to contribute to Eclipse when I left IBM but really felt that I have given all I had to give.  Few people have the chance to contribute to two fantastic open source communities during their career.  I'm lucky to have that opportunity.


My IBM friends made this neat Eclipse poster when I left.  The Mozilla dino displays my IRC handle.

July 29, 2016 07:06 PM

July 21, 2016

Armen Zambrano G. (@armenzg)

Mozci and pulse actions contributions opportunities

We've recently finished a season of feature development adding TaskCluster support to add new jobs to Treeherder on pulse_actions.

I'm now looking at what optimizations or features are left to complete. If you would like to contribute feel free to let me know.

Here's some highligthed work (based on pulse_action issues and bugs):
This will help us save money in Heroku since using Buildapi + buildjson files is memory hungry and requires us to use bigger Heroku nodes.
This is important to help us change the behaviour of the Heroku app without having to commit any code. I've used this in the past to modify the logging level when debugging an issue.

This is also useful if we want to have different pipelines in Heroku. 
Having Heroku pipelines help us to test different versions of the software.
This is useful if we want to have a version running from 'master' against the staging version of Treeherder.
It would also help contributors to have a version of their pull requests running live.
We don't have any tests running. We need to determine how to run a minimum set of tests to have some confident in the product.

This needs integration tests of Pulse messages.
The comment is the bug is rather accurate and it shows that there are many small things that need fixing.
Manual backfilling uses Buildapi to schedule jobs. If we switched to scheduling via TaskCluster/Buildbot-bridge we would get better results since we can guarantee proper scheduling of a build + associated dependent jobs. Buildapi does not give us this guarantee. This is mainly useful when backfilling PGO test and talos jobs.

If instead you're interested on contributing to mozci you can have a look at the issues.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 21, 2016 07:47 PM

July 19, 2016

Armen Zambrano G. (@armenzg)

Usability improvements for Firefox automation initiative - Status update #1

The developer survey conducted by Engineering Productivity last fall indicated that debugging test failures that are reported by automation is a significant frustration for many developers. In fact, it was the biggest deficit identified by the survey. As a result,
the Engineering Productivity Team (aka A-Team) is working on improving the user experience for debugging test failures in our continuous integration and speeding up the turnaround for Try server jobs.

This quarter’s main focus is on:

For all bugs and priorities you can check out the project management page for it:
https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

In this email you will find the progress we’ve made recently. In future updates you will see a delta from this email.

PS = These status updates will be fortnightly


Debugging tests on interactive workers
Accomplished recently:

Upcoming:


Thunder Try - Improve end to end times on try

Project #1 - Artifact builds on automation
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

Upcoming:


Project #2 - S3 Cloud Compiler Cache
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:

Upcoming:


Project #3 - Metrics
Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Upcoming:

Other
Upcoming:


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 19, 2016 12:44 PM

July 18, 2016

Hal Wine (hwine)

Legacy vcs-sync is dead! Long live vcs-sync!

Legacy vcs-sync is dead! Long live vcs-sync!

tl;dr: No need to panic - modern vcs-sync will continue to support the gecko-dev & gecko-projects repositories.

Today’s the day to celebrate! No more bash scripts running in screen sessions providing dvcs conversion experiences. Woot!!!

I’ll do a historical retrospective in a bit. Right now, it’s time to PARTY!!!!!

July 18, 2016 07:00 AM

July 12, 2016

Rail Alliev (rail)

Thoughts about partial updates on demand

Firefox has it's own built-in update system. The update system supports 2 types of updates: complete and incremental. Completes can be applied to any older version, unless there are some incompatible changes in the MAR format. Incremental updates can be applied only to a release they were generated for.

Usually for the beta and release channels we generate incremental updates against 3-4 versions. This way we try to minimize bandwidth consumption for our end users and increase the number of users on the latest version. For Nightly and Developer Edition builds we generate 5 incremental updates using funsize.

Both methods assume that we know ahead of time what versions should be used for incremental updates. For releases and betas we use ADI stats to be as precise as possible. However, these methods are static and don't use real-time data.

The idea to generate incremental updates on demand has been around for ages. Some of the challenges are:

  • Acquiring real-time (or close to real-time) data for making decisions on incremental update versions
  • Size of the incremental updates. If the size is very close to the size of the corresponding complete, there is reason to serve incremental updates. One of the reasons is that the that the updater tries to use the incremental update first, and then falls back to the complete in case if something goes wrong. In this case the updater downloads both the incremental and the complete.

Ben and I talked about this today and to recap some of the ideas we had, I'll put them here.

  • We still want to "pre-seed" most possible incremental updates before we publish any updates
  • Whenever Balrog serves a complete-only update, it should generate a structured log entry and/or an event to be consumed by some service, which should contain all information required to generate a incremental update.
  • The "new system" should be able to decide if we want to discard incremental update generation, based on the size. These decisions should be stored, so we don't try to generate incremental update again next time. This information may be stored in Balrog to prevent further events/logs.
  • Before publishing the incremental update, we should test if they can be applied without issues, similar to the update verify tests we run for releases, but without hitting Balrog. After they pass this test, we can publish them to Balrog and check if Balrog returns expected XML with partial info in it.
  • Minimize the amount of served completes, if we plan to generate incremental updates. One of the ideas was to modify the client to support responses like "Come in 5 minutes, I may have something for you!"

The only remaining thing is to implement all these changes. :)

July 12, 2016 07:03 PM

Chris Cooper (coop)

RelEng & RelOps Weekly highlights - July 12, 2016

Q3 planning is in full swing. Chief priorities continue to be the migration to TaskCluster (TC) from buildbot, and release process improvements.

Modernize infrastructure:

Aki finished porting configman to python3 (merged!). https://github.com/mozilla/configman/pull/163

Windows try builds were enabled on TC Windows 2012 worker types in staging (allizom) (win32/win64, opt/debug). If all goes well, this will propagate to production in the coming days. This is the first set on non-Linux tasks we’ve had running reliably in TC, which is obviously a huge step in our migration away from buildbot.

Improve Release Pipeline:

Aki wrote dephash to pin python requirements to versions+hashes: http://escapewindow.dreamwidth.org/247093.html https://github.com/escapewindow/dephash https://pypi.python.org/pypi/dephash

Aki got signtool working with py3 (reproducibly this time), using requests. https://github.com/escapewindow/signtool https://pypi.python.org/pypi/signtool

We implemented a short cache for Balrog rules, which greatly reduced load on the database. https://bugzilla.mozilla.org/show_bug.cgi?id=1111032

Kim stood up beta builds with addon signing preferences disabled. This allows addon developers to test their addons prior to signing on release-equivalent builds. https://bugzilla.mozilla.org/show_bug.cgi?id=1135781

Improve CI Pipeline:

Francis disabled valgrind buildbot builds. Turning stuff off in buildbot #feelsgoodman. https://bugzilla.mozilla.org/show_bug.cgi?id=1278611

Kim enabled Android x86 builds on trunk running in TC. https://bugzilla.mozilla.org/show_bug.cgi?id=1174206

See you next week!

July 12, 2016 04:01 PM

Hal Wine (hwine)

End of an Experiment

End of an Experiment

tl;dr: We’ll be shutting down the Firefox mirrors on Bitbucket.

A long time ago we started an experiment to see if there was any support for developing Mozilla products on social coding sites. Well, the community-at-large has spoken, with the results many predicted:

  • YES!!! when the social coding site is GitHub
  • No, when the social coding site is Bitbucket
Read more...

July 12, 2016 07:00 AM

June 30, 2016

Armen Zambrano G. (@armenzg)

Adding new jobs to Treeherder is now transparent

As of today, when you add new jobs to a push on Treeherder (among other actions), a job will get scheduled named 'Sch' for scheduling.

This gives two main advantages:


You can see in this screenshot what the job looks like:
In this push you can see three 'Sch' jobs. Each one is for a different action taken.
  1. Backfill a job - Link to log
  2. Add new jobs - Link to log
  3. Trigger all talos jobs - Link to log
You can find the link to filing a bug after you do this:

  1. Click on job (the job's panel will load at the bottom of the page)
  2. Click on "Job details"
  3. You will see "File bug" (see the screenshot below)
Thanks for reading and feel free to file bugs to make the output more understandable!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 30, 2016 10:41 PM

June 29, 2016

Armen Zambrano G. (@armenzg)

Q2 retropective

This quarter has been a tough one for me. It has been a mix of organizing people, projects and implementing prototypes. It’s easy to forget what you have worked on, specially when plans and ideas change along the way.

Beware! This blog post may be a little unstructured, specially towards the end.

The quarter as a whole
Before the quarter commenced I was planning on adding TaskCluster jobs to Treeherder as my main objective. However, it quickly changed as GSoC submissions came around. We realized that we had two to three candidates interested in helping us. This turned into creating three potential projects. Two projects that came out of it are "refactor SETA and enable TaskCluster support" and "add unscheduled TaskCluster jobs to Treeherder." Both of these bring us closer to feature parity with Buildbot. Once this was settled, a lot of conversations happened with the TaskCluster team to make sure that Dustin's work and ours  lined up well (he’s worked on refactoring how the ‘gecko decision task’ schedules tasks).

Around this time a project was completed, which made creating TaskCluster clients an excellent self-serve experience. This was key for me in reducing the number of times I had to interrupt the TaskCluster team to request that they adjust my clients.

Also around this time, another change was deployed that allowed developer’s credentials to assume almost all scopes required to schedule TaskCluster tasks without an intermediary tool with power scopes. This is very useful to create tools which allow developers to schedule tasks directly from the command line with their personal credentials. I created some prototypes to prove this concept. Here’s a script to schedule a Linux 64 task. Here’s the blog post explaining it.

During this quarter, Dustin refactored how scheduling of tasks was accomplished in the gecko decision task. For the project "adding new TaskCluster jobs," this was a risk as it could have made scheduling tasks either more complicated or not possible without significant changes. After many discussions, it seemed that we were fine to proceed as planned.

Out of these conversations a new idea was born: the "action tasks" idea. The beauty of "action tasks" is that they're atomic units of processing that can make complicated scheduling requests very easy. You can read martianwars’ blog post (under “What are action tasks?”) to learn more about it. Action tasks are defined in-tree to schedule task labels for a push. The project as originally defined had a very big scope (goal: make Treeherder find action task definition and integrate them in the UI) and some technical issues were encountered that made me concerned that more would be encountered (i.e., limited scopes granted to developers; this is not an issue anymore). My focus switched to making pulse_actions requests be visible on Treeherder. When switching deliverables I did not realize that we could have taken the first part of the project and just implemented that.  In any case, a reduced scope is being implemented by martianwars since, after Dustin's refactoring, we need to put our graph through "optimizations" that determine which nodes need to be removed from the graph. This code lives in-tree and made "action tasks" to be the right solution since it uses in-tree logic where the optimizations code lives.

While working on my deliverables, I also discovered various things and created various utility projects.

New projects

TaskCluster developer experiments (prototypes):
In this repository I created various prototypes that make scheduling tasks from the command line extremely easy.


Replay pulse messages (new package):
This project allows you to dump a Pulse queue into a file. It also allows you to "replay" the messages and process them as if feeding from a real queue. This was crucial to test code changes for pulse_actions.


Treeherder submitter (new package):
Treeherder submitter is a Python package which makes it very easy to submit jobs to Treeherder. I made pulse_actions submit jobs to Treeherder with this package. I had to write this package since the Treeherder client allowed me to shoot my own foot. Various co-workers have written similar code, however the code found was not repackaged to be used by others (understandably). This package helps you to use the minimum amount of data necessary to submit jobs and helps you transition between job states (pending vs. running vs. completed).

Unfortunately, I have not had time to upstream this code due to the end of quarter being upon me, however I would like to upstream the code if the team is happy with it. On the other hand, Treeherder will soon be switching to a Pulse-based submission model for ingestion and the Python client might not be used anymore.


TaskCluster S3 uploader (new package):
This package allows to upload files to an S3 bucket that has a 31 days expiration.
It takes advantage of a cool feature that the TaskCluster team provides.
The first step is that there is an API where you request temporary S3 credentials to the bucket associated to your TaskCluster credentials. You then can upload to your assigned prefix for that bucket. It is extremely easy!

Discoveries









Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

June 29, 2016 07:29 PM

June 09, 2016

Chris AtLee (catlee)

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.

Monday

Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.

Tuesday

Guido's Keynote

(video)

Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.

Click

(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)

Wednesday

Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.

Gilectomy

(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote

(video)

If you watch only one video from PyCon, watch this. It's just incredible.

June 09, 2016 07:39 PM

June 03, 2016

Kim Moir (kmoir)

Submissions for Releng 2016: due by July 1, 2016

The CFP for Releng 2016 is open!  The workshop will be held November 18, 2016 in Seattle.  It will be held in conjunction with FSE 2016.  (Foundations of Software Engineering ACM conference)
Picture by howardignatius- Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/howardignatius/14482954049/sizes/l
If you've done something like

We'd like to encourage people new to speaking to apply, as well as those from underrepresented groups in tech. We'd love to hear from some new voices and new companies ! 


Submissions are due July 1, 2016. If you have questions on of the submission process, topics to submit, or anything else, I'm happy to help!  I'm kmoir and I work at mozilla.com or contact me on twitter. Submit early and often!

June 03, 2016 06:11 PM

June 02, 2016

Kim Moir (kmoir)

DevOpsDays Toronto recap

Last week I attended DevOpsDays Toronto.  It was my first time attending a DevOpsDays event and it was quite interesting.  It was held at CBC's Glenn Gould studios which is a quick walk from the Toronto Island airport where I landed after an hour flight from Ottawa.  This blog post is an overview of some of the talks at the conference.
 
Glenn Gould Studios, CBC, Toronto.  

Statue of Glenn Gould outside the CBC studios that bear his name.

Day 1


The day started out with an introduction from the organizers and a brief overview of history of DevOps days. They also made a point about reminding everyone that they had agreed to the code of conduct when they bought their ticket. I found this explicit mention of the code of conduct quite refreshing.


The first talk of the day was John Willis,  evangelist at Docker.  He gave an overview of the state of enterprise devops. I found this a fresh perspective because I really don't know what happens in enterprises with respect to DevOps since I have been working in open source communities for so long.  John providing an overview of what DevOps encompasses.


DevOps is a continuous feedback loop.


He talked a lot about how empathy is so important in our jobs.  He mentions that at Netflix has a slide deck that describes company culture.  He doesn't know if this is still the case, but it he had heard that if you hadn't read the company culture deck and show up for an interview at Netflix, you would be automatically disqualified for further interviews.  Etsy and Spotify have similar open documents describing their culture.

Here he discusses the research by Christina Maslach on the six sources of burnout.
Christina Maslach
Christina Maslach

He gave us some reading to do.  I've read the "Release It!" book which is excellent and has some fascinating stories of software failure in it, I've added the other books to my already long reading list.

The rugged manifesto and realizing that the code you write will always be under attack by malicious authors.  ICE stands for Inclusivity, Complexity and Empathy.

He stated that it's a long standing mantra that you can have two of either fast, cheap or good but recent research shows that today we can many changes quickly, and if there is a failure the mean time to recovery is short.

He left us with some more books to read.

The second talk was a really interesting talk by Hany Fahim, CEO of VM Farms.  It was a short mystery novella describing how VM Farms servers suddenly experienced a huge traffic spike when the Brazilian government banned Whatsapp  as a result of a legal order. I love a good war story.

 Hany discussed one day VMfarms suddenly saw a huge increase in traffic. 

This was a really important point.  When your system is failing to scale, it's important to decide if it's a valid increase in traffic or malicious.


Looking on twitter, they found that a court case in Brazil had recently ruled that Whatsup would be blocked for 48 hours.  Users started circumventing this block via VPN.  Looking at their logs, they determined that most of the traffic was resolving to ip addresses from Brazil and  that there was a large connection time during SSL handshakes.
  

The government of Brazil encouraged the use of open source software versus Windows, and thus the users became more technically literate, and able to circumvent blocks via VPN.


In conclusion, making changes to use multi-core HAProxy fixed a lot of issues. Also, twitter was and continues to be a great source of information on activity that is happening in other countries. Whatsapp was returned to service and then banned a second time, and their servers were able to keep up with the demand.

After lunch, we were back to to more talks.  The organizers came on stage for a while to discuss the afternoon's agenda.  They also remarked that one individual had violated the code of conduct and had been removed from the conference.  So, the conference had a code of conduct and steps were taken if it was violated.

Next up, Bridget Kromhout from Pivotal gave a talk entitled Containers will not Fix your Broken Culture.
I first saw Bridget speak at Beyond the Code in Ottawa in 2014 about scaling the streaming services for Drama Fever on AWS.  At the time, I was moving our mobile test infrastructure to AWS so I was quite enthralled with her talk because 1) it was excellent 2) I had never seen another woman give a talk about scaling services on AWS.  Representation matters.

The summary of the talk last week was that no matter what tools you adopt, you need to communicate with each other about the cultural changes are required to implement new services.  A new microservices architecture is great, but if these teams that are implementing these services are not talking to each other, the implementation will not succeed.

Bridget pointing out that the technology we choose to implement is often about what is fashionable.


Shoutout to Jennifer Davis' and Katherine Daniel's Effective DevOps book. (note -  I've read it on Safari online and it is excellent.  The chapter on hiring is especially good)

Loved this poster about the wall of confusion between development and operations.  

In the afternoon, there were were lightning talks and then open spaces. Open spaces are free flowing discussions where the topic is voted upon ahead of time.  I attended ones on infrastructure automation, CI/CD at scale and my personal favourite, horror stories.  I do love hearing how distributed system can go down and how to recover.  I found that the conversations were useful but it seemed like some of them were dominated by a few voices.  I think it would be better if the person that suggested to topic for the open space also volunteered to moderate the discussion.

Day 2

The second day started out with a fantastic talk by John Arthorne of Shopify speaking on scaling their deployment pipeline.  As a side note, John and I worked together for more than a decade on Eclipse while we both worked at IBM so it was great to catch up with him after the talk. 



He started by giving some key platform characteristics.  Stores on Shopify have flash sales that have traffic spikes so they need to be able to scale for these bursts of traffic. 

From commit to deploy in 10 minutes.  Everyone can deploy. This has two purposes: Make sure the developer stays involved in the deploy process.  If it only takes 10 minutes, they can watch to make sure that their deploy succeeds. If it takes longer, they might move on to another task.  Another advantage of this quick deploy process is that it can delight customers with the speed of deployment.  They also deploy in small batches to ensure that the mean time to recover is small if the change needs to be rolled back.
 
BuildKite is a third party build and test orchestration service.  They wrote a tool called Scrooge that monitors the number of EC2 nodes based on current demand to reduce their AWS bills.  (Similar to what Mozilla releng does with cloud-tools)


Shopify uses a open source orchestration tool called ShipIt.  I was sitting next to my colleague Armen at the conference and he started chuckling at this point because at Mozilla we also wrote an application called ship-it which release management uses to kick off Firefox releases.   Shopify also has a overall view of the ship it deployment process which allows developers to see the percentages of nodes where their change has been deployed. One of the questions after the talk was why they use AWS for their deployment pipeline when they have use machines in data centres for their actual customers. Answer: They use AWS where resilency is not an issue. 
 
Building containers is computationally expensive. He noted that a lot of engineering resources went into optimizing the layers in the Docker containers. To isolate changes to the smallest layer.  They build service called Locutus to build the containers on commit, and push to a registry. It employs caching to make the builds smaller. 

One key point that John also mentioned is that they had a team dedicated to optimizing their deployment pipeline.  It is unreasonable to expect that developers working on the core Shopify platform to also optimize the pipeline.

In the afternoon , there were a series of lightning talks. Roderick Randolph from Capital One gave an amazing talk about Supporting Developers through DevOps.


It was an interesting perspective.  I've seen quite a few talks about bringing devops culture and practices to the operations side of the house, but the perspective of teaching developers about it is discussed less often.



He emphasized the need to empower developers to use DevOp practices by giving them tools, and showing them how to use them.  For instance, if they needed to run docker to test something, walk them through it so they will know how to do it next time. 





The final talk I'll mention is by Will Weaver.  He talks about how it is hard to show prospective clients how he had CI and tests experience when that experience is not open to the public.  So he implemented tests and CI for his dotfiles on github. 


He had excellent advice on how to work on projects outside of work to showcase skills for future employers.




Diversity and Inclusion


As an aside, whenever I'm at a conference I note the number of people in the "not a white guy" group. This conference had an all men organizing committee but not all white men.  (I recognize the fact that not all diversity is visible i.e. mental health, gender identity, sexual orientation, immigration status etc) They was only one woman speaker, but there were a few non-white speakers.  There were very few women attendees. I'm not sure what the process was to reach out to potential speakers other than the CFP. 



 There were slides that showed diverse developers which was refreshing.



Loved Roderick's ops vs dev slide.

I learned a lot at the conference and am thankful for all the time that the speakers took to prepare their talks.  I enjoyed all the conversations I had learning about the challenges people face in the organizations implementing continuous integration and deployment. It also made me appreciate the culture of relentless automation, continuous integration and deployment that we have at Mozilla.

I don't know who said this during the conference but I really liked it

Shipping is the heartbeat of your company

It was interesting to learn how all these people are making their companies heart beat stronger via DevOps practices and tools.

June 02, 2016 08:28 PM

May 13, 2016

Kim Moir (kmoir)

Welcome Mozilla Releng summer interns

We're delighted to have Francis Kang and Connor Sheehan join the Mozilla release engineering team as summer interns.  Francis is studying at the University of Toronto while Connor attends McMaster University in Hamilton, Ontario.  We'll have another intern (Anthony) join us later on in the summer who will be working from our San Francisco office.

Francis and Connor will be working on implementing some new features in release promotion as well as  migrating some builds to taskcluster.  I'll be mentoring Francis,  while Rail will be mentoring Connor.  If you are in the Toronto office, please drop by to say hi to them.  Or welcome them on irc as fkang or sheehan. 

Kim, Francis, Connor and Rail
They are both already off to a great start and have pull requests merged into production that fixed some release promotion issues.  Their code was used in the Firefox 47.0 beta 5 release promotion that we ran last night so their first week was quite productive.


Mentoring an intern provides an opportunity to see the systems we run from a fresh perspective.  They both have lots of great questions which makes us revisit why design decisions were made, could we do things better?   Like all teaching roles, I always find that I learn a tremendous amount from the experience, and hope they have fun learning real world software engineering concepts with respect to running large distributed systems.

Welcome to Mozilla!

May 13, 2016 03:20 PM

May 03, 2016

Armen Zambrano G. (@armenzg)

Replay Pulse messages

If you know what is Pulse and you would like to write some integration tests for an app that consumes them pulse_replay might make your life a bit easier.

You can learn more about by reading this quick README.md.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 03, 2016 07:52 PM

May 02, 2016

Armen Zambrano G. (@armenzg)

Open Platform Operations’ logo design

Last year, the Platform Operations organization was born and it brought together multiple teams across Mozilla which empower development with tools and processes.

This year, we've decided to create a logo that identifies us an organization and builds our self-identify.

We've filed this issue for a logo design [1] and we would like to have a call for any community members to propose their designs. We would like to have all applications in by May 13th. Soon after that, we will figure out a way to narrow it down to one logo! (details to be determined).

We would also like to thank whoever made the logo which we pick at the end (details also to be determined).

Looking forward to collaborate with you and see what we create!

[1] https://github.com/mozilla/Community-Design/issues/62


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

May 02, 2016 05:45 PM

April 27, 2016

Rail Alliev (rail)

Firefox 46.0 and SHA512SUMS

In my previous post I introduced the new release process we have been adopting in the 46.0 release cycle.

Release build promotion has been in production since Firefox 46.0 Beta 1. We have discovered some minor issues; some of them are already fixed, some still waiting.

One of the visible bugs is Bug 1260892. We generate a big SHA512SUMS file, which should contain all important checksums. With numerous changes to the process the file doesn't represent all required files anymore. Some files are missing, some have different names.

We are working on fixing the bug, but you can use the following work around to verify the files.

For example, if you want to verify http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe, you need use the following 2 files:

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums

http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc

Example commands:

# download all required files
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/win64/ach/Firefox%20Setup%2046.0.exe
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums
$ wget -q http://ftp.mozilla.org/pub/firefox/candidates/46.0-candidates/build5/win64/ach/firefox-46.0.checksums.asc
$ wget -q http://ftp.mozilla.org/pub/firefox/releases/46.0/KEY
# Import Mozilla Releng key into a temporary GPG directory
$ mkdir .tmp-gpg-home && chmod 700 .tmp-gpg-home
$ gpg --homedir .tmp-gpg-home --import KEY
# verify the signature of the checksums file
$ gpg --homedir .tmp-gpg-home --verify firefox-46.0.checksums.asc && echo "OK" || echo "Not OK"
# calculate the SHA512 checksum of the file
$ sha512sum "Firefox Setup 46.0.exe"
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479  Firefox Setup 46.0.exe
# lookup for the checksum in the checksums file
$ grep c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 firefox-46.0.checksums
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 sha512 46275456 install/sea/firefox-46.0.ach.win64.installer.exe

This is just a temporary work around and the bug will be fixed ASAP.

April 27, 2016 04:47 PM

April 23, 2016

Hal Wine (hwine)

Enterprise Software Writers R US

Enterprise Software Writers R US

Someone just accused me of writing Enterprise Software!!!!!

Well, the “someone” is Mahmoud Hashemi from PayPal, and I heard him on the Talk Python To Me podcast (episode 54). That whole episode is quite interesting - go listen to it.

Read more...

April 23, 2016 07:00 AM

April 22, 2016

Armen Zambrano G. (@armenzg)

The Joy of Automation

This post is to announce The Joy of Automation YouTube channel. In this channel you should be able to watch presentations about automation work by Mozilla's Platforms Operations. I hope more folks than me would like to share their videos in here.

This follows the idea that mconley started with The Joy of Coding and his livehacks.
At the moment there is only "Unscripted" videos of me hacking away. I hope one day to do live hacks but for now they're offline videos.

Mistakes I made in case any Platform Ops member wanting to contribute want to avoid:




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 22, 2016 02:26 PM

April 17, 2016

Armen Zambrano G. (@armenzg)

Project definition: Give Treeherder the ability to schedule TaskCluster jobs

This is a project definition that I put up for GSoC 2016. This helps students to get started researching the project.

The main things I give in here are:


NOTE: This project has few parts that have risks and could change the implementation. It depends on close collaboration with dustin.


-----------------------------------
Mentor: armenzg 
IRC:   #ateam channel

Give Treeherder the ability to schedule TaskCluster jobs

This work will enable "adding new jobs" on Treeherder to work with pushes lacking TaskCluster jobs (our new continuous integration system).
Read this blog post to know how the project was built for Buildbot jobs (our old continous integration system).

The main work for this project is tracked in bug 1254325.

In order for this to work we need the following pieces:

A - Generate data source with all possible tasks

B - Teach Treeherder to use the artifact

C - Teach pulse_actions to listen for requests from Treeherder

  • pulse_actions is a pulse listener of Treeherder actions
  • You can see pulse_actions’ workflow in here
  • Once part B is completed, we will be able to listen for messages requesting certain TaskCluster tasks to be scheduled and we will schedule those tasks on behalf of the user
  • RISK: Depending if the TaskCluster actions project is completed on time, we might instead make POST requests to an API


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 17, 2016 04:01 PM

Project definition: SETA re-write

As an attempt to attract candidates to GSoC I wanted to make sure that the possible projects were achievable rather than lead them on a path of pain and struggle. It also helps me picture the order on which it makes more sense to accomplish.

It was also a good exercise for students to have to read and ask questions about what was not clear and give lots to read about the project.

I want to share this and another project definition in case it is useful for others.

----------------------------------
We want to rewrite SETA to be easy to deploy through Heroku and to support TaskCluster (our new continuous integration system) [0].

Please read carefully this document before starting to ask questions. There is high interest in this project and it is burdensome to have to re-explain it to every new prospective student.

Main mentor: armenzg (#ateam)
Co-mentor: jmaher (#ateam)

Please read jmaher’s blog post carefully [1] before reading anymore.

Now that you have read jmaher’s blog post, I will briefly go into some specifics.
SETA reduces the number of jobs that get scheduled on a developer’s push.
A job is every single letter you see on Treeherder. For every developer’s push there is a number of these jobs scheduled.
On every push, Buildbot [6] decides what to schedule depending on the data that it fetched from SETA [7].

The purpose of this project is two-fold:
  1. Write SETA as an independent project that is:
    1. maintainable
    2. more reliable
    3. automatically deployed through Heroku app
  2. Support TaskCluster, our new CI (continuous integration system)

NOTE: The current code of SETA [2] lives within a repository called ouija.

Ouija does the following for SETA:
  1. It has a cronjob which kicks in every 12 hours to scrape information about jobs from every push
  2. It takes the information about jobs (which it grabs from Treeherder) into a database

SETA then goes a queries the database to determine which jobs should be scheduled. SETA chooses jobs that are good at reporting issues introduced by developers. SETA has its own set of tables and adds the data there for quick reference.

Involved pieces for this project:
  1. Get familiar with deploying apps and using databases in Heroku
  2. Host SETA in Heroku instead of http://alertmanager.allizom.org/seta.html
  3. Teach SETA about TaskCluster
  4. Change the gecko decision task to reliably use SETA [5][6]
    1. If the SETA service is not available we should fall back to run all tasks/jobs
  5. Document how SETA works and auto-deployments of docs and Heroku
    1. Write automatically generated documentation
    2. Add auto-deployments to Heroku and readthedocs
  6. Add tests for SETA
    1. Add tox/travis support for tests and flake8
  7. Re-write SETA using ActiveData [3] instead of using data collected by Ouija
  8. Make the current CI (Buildbot) use the new SETA Heroku service
  9. Create SETA data for per test information instead of per job information (stretch goal)
    1. On Treeherder we have jobs that contain tests
    2. Tests re-order between those different chunks
    3. We want to run jobs at a per-directory level or per-manifest
  10. Add priorities into SETA data (stretch goal)
    1. Priority 1 gets every time
    2. Priority 2 gets triggered on Y push

[0] http://docs.taskcluster.net/
[1] https://elvis314.wordpress.com/tag/seta/
[2] https://github.com/dminor/ouija/blob/master/tools/seta.py
[3] http://activedata.allizom.org/tools/query.html
[4] https://bugzilla.mozilla.org/show_bug.cgi?id=1243123
[5] https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=gecko
[6] testing/taskcluster/mach_commands.py#l280
[7] http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla-tests/config_seta.py
[8] http://alertmanager.allizom.org/seta.html


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 17, 2016 03:54 PM

April 13, 2016

Armen Zambrano G. (@armenzg)

Improving how I write Python tests

The main focus of this post is about what I've learning about writing Python tests, using mocks and patching functions properly. This is not an exhaustive post.

What I'm writing now is something I should have learned many years ago as a Python developer. It can be embarrassing to recognize it, however, I've thought of sharing this with you since I know it would have helped me earlier on my career and I hope it might help you as well.

Somebody has probably written about this topic and if you're aware of a good blog post covering this similar topic please let me know. I would like to see what else I've missed.

Also, if you want to start a Python project from scratch or to improve your current one, I suggest you read "Open Sourcing a Python Project the Right Way". Many of the things he mentions is what I follow for mozci.

This post might also be useful for new contributors trying to write tests for your project.

My takeaway

These are some of the things I've learned

  1. Make running tests easy
    • We use tox to help us create a Python virtual environment, install the dependencies for the project and to execute the tests
    • Here's the tox.ini I use for mozci
  2. If you use py.test learn how to not capture the output
    • Use the -s flag to not capture the output
    • If your project does not print but instead it uses logging, add the pytest-capturelog plugin to py.test and it will immediately log for you
  3. If you use py.test learn how to jump into the debugger upon failures
    • Use --pdb to using the Python debugger upon failure
  4. Learn how to use @patch and Mock properly

How I write tests

This is what I do:


@patch properly and use Mocks

What I'm doing now to patch modules is the following:



The way that Mozilla CI tools is designed it begs for integration tests, however, I don't think it is worth doing beyond unit testing + mocking. The reason is that mozci might not stick around once we have fully migrated from Buildbot which was the hard part to solve.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 13, 2016 07:26 PM

April 12, 2016

Armen Zambrano G. (@armenzg)

mozci-trigger now installs with pip install mozci-scripts

If you use mozci from the command line this applies to you; otherwise, carry on! :)

In order to use mozci from the command line you now have to install with this:
pip install mozci-scripts
instead of:
pip install mozci

This helps to maintain the scripts separately from the core library since we can control which version of mozci the scripts use.

All scripts now lay under the scripts/ directory instead of the library:
https://github.com/mozilla/mozilla_ci_tools/tree/master/scripts




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

April 12, 2016 07:32 PM

April 05, 2016

Rail Alliev (rail)

Release Build Promotion Overview

Hello from Release Engineering! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Release Build Promotion.

What is Release Build Promotion?

Release build promotion (or "build promotion", or "release promotion" for short), is the latest release pipeline for Firefox being developed by Release Engineering at Mozilla.

Release build promotion starts with the builds produced and tested by CI (e.g. on mozilla-beta or mozilla-release). We take these builds, and use them as the basis to generate all our l10n repacks, partial updates, etc. that are required to release Firefox. We "promote" the CI builds to the release channel.

How is this different?

The previous release pipeline also started with builds produced and tested by CI. However, when it came time to do a release, we would create an entirely new set of builds with slightly different build configuration. These builds would not get the regular CI testing.

Release build promotion improves the process by removing the second set of builds. This drastically improves the total time to do a release, and also increases our confidence in our products since we now are shipping exactly what's been tested. We also improve visibility of the release process; all the tasks that make up the release are now reported to Treeherder along with the corresponding CI builds.

Current status

Release build promotion is in use for Firefox desktop starting with the 46 beta cycle. ESR and release branches have not yet been switched over.

Firefox for Android is also not yet handled. We plan to have this ready for Firefox 47.

Some figures

One of the major reasons of this project was our release end-to-end times. I pulled some data to compare:

  • One of the Firefox 45 betas took almost 12 hours
  • One of the Firefox 46 betas took less than 3 hours

What's next?

  • Support Firefox for Android
  • Support release and ESR branches
  • Extend this process back to the aurora and nightly channels

Can I contribute?

Yes! We still have a lot of things to do and welcome everyone to contribute.

  • Bug 1253369 - Notifications on release promotion events.
  • (No bug yet) Redesign and modernize Ship-it to reflect the new release work flow. This will include new UI, multiple sign-offs, new release-runner, etc.
  • Tracking bug

More information

For more information, please refer to these other resources about build promotion:

There will be multiple blog posts regarding this project. You have probably seen Jordan's blog on how to be productive when distributed teams get together. It covers some of our experience we had during the project sprint week in Vancouver.

April 05, 2016 01:34 PM

March 31, 2016

Jordan Lund (jlund)

being productive when distributed teams get together

I'm a Mozilla Release Engineer. Which means I am a strong advocate for remote teams. Our team is very distributed and I believe we are successful at it too.

Funnily enough, I think a big part of the distributed model includes getting remote people physically together once in a while. When you do, you create a special energy. Unfortunately that energy can sometimes be wasted.

Have you ever had a work week that goes something like the following:

You stumble your way to the office in a strange environment on day one. You arrive, find some familiar faces, hug, and then pull out your laptop. You think, 'okay, first things first, I'll clear my email, bugmail, irc backscroll, and maybe even that task I left hanging from the week before.'

At some point, someone speaks up and suggests you come up with a schedule for the week. A google doc is opened and shared and after a few 'bikeshedding' moments, it's lunch time! A local to the area or advanced yelper in the group advertises a list of places to eat and after the highest rated food truck wins your stomach's approval, you come back to the office and ... quickly check your email.

The above scenario plays out in a similar fashion for the remainder of the week. Granted, I exaggerate and some genuine ideas get discussed. Maybe even a successful side sprint happens. But I am willing to bet that you, too, have been to a team meet up like this.

So can it be done better? Well I was at one recently in Vancouver and this post will describe what I think made the difference.

forest firefighting

Prior to putting out burning trees at Mozilla, I put out burning trees as a Forest Firefighter in Canada. BC Forest Protection uses the Incident Command System (ICS). That framework enabled us to safely and effectively suppress wildfires. So how does it work and why am I bringing it up? Well, without this framework, firefighters would spend a lot of time on the fire line deciding where to camp, what to eat, what part of the fire to suppress first, and how to do it. But thanks to ICS, these decisions are already made and the firefighters can get back to doing work that feels productive!

You can imagine how team meet ups could benefit from such organization. With ICS, there are four high level branches: Logistics, Planning, Operations, and Finance & Administration. The last one doesn't really apply to our 'work week' scenario as we use Egencia prior to arriving and Expensify after leaving so it doesn't really affect productivity during the week. However, let's dive into the other three and discover how they correlate to team meet ups.

For each of these branches, someone should be nominated or assigned and complete the branch responsibilities.

Logistics

Ideally the Logistics lead should be someone who is local to the area or has been there before. This person is required to create an etherpad/Google Doc that:

Now you might be saying, "wait a second, I can do all those things myself and don't need to be hand held." And while that is true, the benefit here is you reduce the head space required on each individual, the time spent debating, and you get everyone doing the same thing at the same time. This might not sound very flexible or distributed but remember, that's the point; you're centralized and together for the week! You might also be thinking "I really enjoy choosing a hotel and restaurant." That's fine too, but I propose you coordinate with the logistics assignee prior to the work week rather than spend quality work week time on these decisions.

Planning

Now that you have logistics sorted, it's time to do all the action planning. Traditionally we've had work weeks where we pre-plan high level goals we want to accomplish but we don't actually fill out the schedule until Monday as a group. The downside here is this can chew up a lot of time and you can easily get side track before completing the schedule. So, like Logistics, assign someone to Planning.

This person is required to create a [insert issue tracker of choice] list and determine a bugs/issues that should be tackled during the week. The way this is done of course depends on the issue tracker, style of the group, and type of team meet up but here is an example we used for finishing a deliverable related goal.

write a list of issues for each of the following categories:

For the above, we used Trello which is nice as it's really just a board of sticky notes. I could write a whole separate blog post on how to to be effective with it by incorporating bugzilla links, assignees to cards, tags, sub-lists, and card voting but for now, here is a visual example:

Trello Work Week Board

The beauty here is that all of the tasks (sticky notes) are done upfront and each team member simply plucks them off the 'hard blockers' and 'nice to have' lists one by one, assigns them to themselves, and moves them into the completed section.

No debating or brainstorming what to do, just sprint!

Operations

The Operations assignee here should:

If you want to take advantage of a successful physical team meetup, forget about the communication tools that are designed for distributed environments.

During the work week I think it is best to ignore email, bug mail, and irc. Treat the week like you are on PTO: change your bugzilla name and create a vacation responder. Have the Operations assignee respond to urgent requests and act as a proxy to the outside world.

It is also nice to have the Operations assignee moderate internally by constantly iterating over the trello board state, grasping what needs to be done, where you are falling behind, and what new things have come up.

Vancouver by accident

This model wasn't planned or agreed upon prior to the Vancouver team meetup. It actually just happened by accident. I (jlund) took on the Logistics, rail handled Planning, and catlee acted as that sort of moderator/proxy role in Operations. Everyone at the meet up finished the week satisfied and I think hungry to try it again.

I'm looking forward to using a framework like this in the future. What's your thoughts?

March 31, 2016 09:32 PM

March 07, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - March 4, 2016

It was a busy week with many releases in flight, as well as preparation for running beta 1 with release promotion next week.  We also are in the process of adding more capacity to certain test platform pools to lower wait times given all the new e10s tests that have been enabled.

Improve Release Pipeline:
Everyone gets a release promotion!  Source: http://i.imgur.com/WMmqSDI.jpg

Improve CI Pipeline:

Release:

The releases calendar is getting busier as we get closer to the end of the cycle. Many releases were shipped or are still in-flight:
As always, you can find more specific release details in our post-mortem minutes:
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-02 
https://wiki.mozilla.org/Releases:Release_Post_Mortem:2016-03-09

Operational:

Until next time!

March 07, 2016 03:56 PM

February 29, 2016

Kim Moir (kmoir)

RelEng & RelOps Weekly highlights - February 26, 2016

It was a busy week for release engineering as several team members travelled to the Vancouver office to sprint on the release promotion project. The goal of the release promotion project is to promote continuous integration builds to release channels, allowing us to ship releases much more quickly.



Improve Release Pipeline:


Improve CI Pipeline:

Release:

Operational:

February 29, 2016 08:22 PM

January 21, 2016

Rail Alliev (rail)

Rebooting productivity

Every new year gives you an opportunity to sit back, relax, have some scotch and re-think the passed year. Holidays give you enough free time. Even if you decide to not take a vacation around the holidays, it's usually calm and peaceful.

This time, I found myself thinking mostly about productivity, being effective, feeling busy, overwhelmed with work and other related topics.

When I started at Mozilla (almost 6 years ago!), I tried to apply all my GTD and time management knowledge and techniques. Working remotely and in a different time zone was an advantage - I had close to zero interruptions. It worked perfect.

Last year I realized that my productivity skills had faded away somehow. 40h+ workweeks, working on weekends, delivering goals in the last week of quarter don't sound like good signs. Instead of being productive I felt busy.

"Every crisis is an opportunity". Time to make a step back and reboot myself. Burning out at work is not a good idea. :)

Here are some ideas/tips that I wrote down for myself you may found useful.

Concentration

  • Task #1: make a daily plan. No plan - no work.
  • Don't start your day by reading emails. Get one (little) thing done first - THEN check your email.
  • Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
  • Meetings are time consuming, so "Set a goal for each meeting". Consider skipping a meeting if you don't have any goal set, unless it's a beer-and-tell meeting! :)
  • Constantly ask yourself if what you're working on is important.
  • 3-4 times a day ask yourself whether you are doing something towards your goal or just finding something else to keep you busy. If you want to look busy, take your phone and walk around the office with some papers in your hand. Everybody will think that you are a busy person! This way you can take a break and look busy at the same time!
  • Take breaks! Pomodoro technique has this option built-in. Taking breaks helps not only to avoid RSI, but also keeps your brain sane and gives you time to ask yourself the questions mentioned above. I use Workrave on my laptop, but you can use a real kitchen timer instead.
  • Wear headphones, especially at office. Noise cancelling ones are even better. White noise, nature sounds, or instrumental music are your friends.

(Home) Office

  • Make sure you enjoy your work environment. Why on the earth would you spend your valuable time working without joy?!
  • De-clutter and organize your desk. Less things around - less distractions.
  • Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them. Your health is more important and expensive. Thanks to mhoye for this advice!

Other

  • Don't check email every 30 seconds. If there is an emergency, they will call you! :)
  • Reward yourself at a certain time. "I'm going to have a chocolate at 11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you are Pavlov's dog too!
  • Don't try to read everything NOW. Save it for later and read in a batch.
  • Capture all creative ideas. You can delete them later. ;)
  • Prepare for next task before break. Make sure you know what's next, so you can think about it during the break.

This is my list of things that I try to use everyday. Looking forward to see improvements!

I would appreciate your thoughts this topic. Feel free to comment or send a private email.

Happy Productive New Year!

January 21, 2016 02:06 AM

January 08, 2016

Kim Moir (kmoir)

Tips from a resume nerd

Before I begin this post a few caveats:
I'm kind of a resume and interview nerd.  I like helping friends fix their resumes and write amazing cover letters. In the past year I've helped a few (non-Mozilla) friends fix up their resumes, write cover letters, prepare for interviews as they search for new jobs.  This post will discuss some things I've found to be helpful in this process.

Picture by GotCredit - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/jakerust/16223669794/sizes/l

Preparation
Everyone tends to jump into looking at job descriptions and making their resume look pretty. Another scenario is that people have a sudden realization that they need to get out of their current position and find a new job NOW and frantically start applying for anything that matches their qualifications.  Before you do that, take a step back and make a list of things that are important to you.  For example, when I applied at Mozilla, my list was something like this

People spend a lot of time at work. Life is too short to be unhappy every day.  Writing a list of what is important serves as a checklist to when you are looking at job descriptions and immediately weed out the ones that don't match your list.  

Picture by Mufidah Kassalias - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/mufidahkassalias/10519774073/sizes/o/
 
People tend focus a lot on the technical skills they want to use or new ones you want to learn.  You should also think about what kind of culture where you want to work.  Do the goals and ethics of the organization align with your own? Who will you be working with? Will you enjoy working with this team?  Are you interested in remote work or do you want to work in an office? How will a long commute impact or relocation your quality of life? What is the typical career progression of someone in this role? Are there both management and technical tracks for advancement?


Picture by mugley - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/mugley/4221455156/sizes/o/


To summarize, itemize the skills you'd like to use or learn, the culture of the company and the team and why you want to work there.

Cover letter

Your cover letter should succinctly map your existing skills to the role you are applying for and convey enthusiasm and interest.  You don't need to have a long story about how you worked on a project at your current job that has no relevance to your potential new employer.  Teams that are looking to hire have problems to solve.  Your cover letter needs to paint a picture that your have the skills to solve them.

Picture by Jim Bauer - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) https://www.flickr.com/photos/lens-cap/10320891856/sizes/l


Refactoring your resume

Developers have a lot of opportunities these days, but if you intend to move from another industry, into a tech company, it can be more tricky.  The important thing is to convey the skills you have in a a way that people can see they can be applied to the problems they want to hire you to fix. 

Many people describe their skills and accomplishments in a way that is too company specific.  They may have a list of acronyms and product names on their resume that are unlikely to be known by people outside the company.  When describing the work you did in a particular role, describe the work that you did in a that is measurable way that highlights the skills you have.  An excellent example of a resume that describes the skills that without going into company specific detail is here. (Julie Pagano also has a terrific post about how she approached her new job search.)

Another tip is to leave out general skills that are very common.  For instance, if you are a technical writer, omit the fact that you know how to use Windows and Word and focus on highlighting your skills and accomplishments. 


Non-technical interview preparation

Every job has different technical requirements and there are many books and blog posts on how to prepare for this aspect of the interview process. So I'm going to just cover the non-technical aspects.

When I interview someone, I like to hear lots of questions.  Questions about the work we do and upcoming projects.  This indicates that have taken the time to research the team, company and work that we do.  It also shows that enthusiasm and interest.

Here is a list suggestions to prepare for interviews

1.  Research the company make a list of relevant questions
Not every company is open about the work that they do, but most will be have some public information that you can use to formulate questions during the interviews.  Do you know anyone you can have coffee or skype with to who works for the company and can provide insight? What products/services do the company produce? Is the product nearing end of life?  If so, what will it be replaced by? What is the companies market share, is it declining, stable or experiencing growth? Who are their main competitors? What are some of the challenges they face going forward? How will this team help address these challenges?

2.  Prepare a list of questions for every person that interviews you ahead of time
Many companies will give you the list of names of people who will interview you.
Have they recently given talks? Watch the videos online or read the slides.
Does the team have github or other open repositories?  What are recent projects are they working on? Do they have a blog or are active on twitter? If so, read it and formulate some questions to bring to the interview.
Do they use open bug tracking tools?  If so, look at the bugs that have recent activity and add them to the list of questions for your interview. 
A friend of mine read the book of a person that interviewed him had written and asked questions about the book in the interview.  That's serious interview preparation!

Photo by https://www.flickr.com/photos/wocintechchat/ https://www.flickr.com/photos/wocintechchat/22506109386/sizes/l


3. Team dynamics and tools
Is the team growing or are you hiring to replace somebody who left?
What's the onboarding process like? Will you have a mentor?
How is this group viewed by the rest of the company? You want to be in a role where you can make a valuable contribution.  Joining a team where their role is not valued by the company or not funded adequately is a recipe for disappointment.
What does a typical day look like?  What hours do people usually work?
What tools do people use? Are there prescribed tools or are you free to use what you'd like?

4.  Diversity and Inclusion
If you're a member of an underrepresented group in tech, the numbers are lousy in this industry with some notable exceptions. And I say that while recognizing that I'm personally in the group that is the lowest common denominator for diversity in tech. 

The entire thread on this tweet is excellent  https://twitter.com/radiomorillo/status/589158122108932096


I don't really have good advice for this area other than do your research to ensure you're not entering a toxic environment.  If you look around the office where you're being interviewed and nobody looks like you, it's time for further investigation.   Look at the company's website - is the management team page white guys all the way down?  Does the company support diverse conferences, scholarships or internships? Ask on a mailing list like devchix if others have experience working at this company and what it's like for underrepresented groups. If you ask in the interview why there aren't more diverse people in the office and they say something like "well, we only hire on merit" this is a giant red flag. If the answer is along the lines of "yes, we realize this and these are the steps we are taking to rectify this situation",  this is a more encouraging response.

A final piece of advice, ensure that you meet with your manager that you're going to report to as part of your hiring process.  You want to ensure that you have rapport with them and can envision a productive working relationship. 

What advice do you have for people preparing to find a new job?

Further reading

Katherine Daniels gave at really great talk at Beyond the Code 2014 about how to effectively start a new job.  Press start: Beginning a New Adventure Job
She is also the co-author of Effective Devops which has fantastic chapter on hiring.
Erica Joy writes amazing articles about the tech industry and diversity.
Cate Huston has some beautiful posts on how to conduct technical interviews and how to be a better interviewer
Camille Fournier's blog is excellent reading on career progression and engineering management.
Mozilla is hiring!

January 08, 2016 08:37 PM

December 10, 2015

Nick Thomas (nthomas)

Updates for Nightly on Windows

You may have noticed that Windows has had no updates for Nightly for the last week or so. We’ve had a few issues with signing the binaries as part of moving from a SHA-1 certificate to SHA-2. This needs to be done because Windows won’t accept SHA-1 signed binaries from January 1 2016 (this is tracked in bug 1079858).

Updates are now re-enabled, and the update path looks like this

older builds  →  20151209095500  →  latest Nightly

Some people may have been seeing UAC prompts to run the updater, and there could be one more of those when updating to the 20151209095500 build (which is also the last SHA-1 signed build). Updates from that build should not cause any UAC prompts.

December 10, 2015 03:57 PM

December 03, 2015

Hal Wine (hwine)

Tuning Legacy vcs-sync for 2x profit!

Tuning Legacy vcs-sync for 2x profit!

One of the challenges of maintaining a legacy system is deciding how much effort should be invested in improvements. Since modern vcs-sync is “right around the corner”, I have been avoiding looking at improvements to legacy (which is still the production version for all build farm use cases).

While adding another gaia branch, I noticed that the conversion path for active branches was both highly variable and frustratingly long. It usually took 40 minutes for a commit to an active branch to trigger a build farm build. And worse, that time could easily be 60 minutes if the stars didn’t align properly. (Actually, that’s the conversion time for git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to generate the trigger.)

The full details are in bug 1226805, but a simple rearrangement of the jobs removed the 50% variability in the times and cut the average time by 50% as well. That’s a savings of 20-40 minutes per gaia push!

Moral: don’t take your eye off the legacy systems – there still can be some gold waiting to be found!

December 03, 2015 08:00 AM

December 01, 2015

Chris AtLee (catlee)

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

December 01, 2015 09:31 PM

November 27, 2015

Chris AtLee (catlee)

Firefox builds on the Taskcluster Index

RIP FTP?

You have have heard rumblings that FTP is going away...

61319299.jpg

Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???

65722041.jpg

This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the
gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:

gecko.v2.mozilla-central.nightly.revision.47b49b0d32360fab04b11ff9120970979c426911.firefox.win64-opt

Or the date:

gecko.v2.mozilla-central.nightly.2015.11.27.latest.firefox.win64-opt

The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

[1]A historical name referring back to the time when we used the FTP prototol to serve these files. Today, the files are available only via HTTP(S)
[2]in fact, all Firefox builds right now are currently uploaded to S3. we've just had to implement some compatibility layers to make S3 appear in many ways like the old FTP service.
[3]yes, you need to know the version number...for now. we're considering stripping that from the filenames. if you have thoughts on this, please get in touch!
[4]ignore the warning on the right about "Task not found" - that just means there are no tasks with that exact route; kind of like an empty directory

November 27, 2015 09:21 PM

November 26, 2015

Armen Zambrano G. (@armenzg)

Mozhginfo/Pushlog client released

Hi,
If you've ever spent time trying to query metadata from hg with regards to revisions, you can now use a Python library we've released to do so.

In bug 1203621 [1], our community contributor @MikeLing has helped us release the pushlog.py module we had written for Mozilla CI tools.

You can find the pushlog_client package in here [3] and you can find the code in here [4]

Thanks MikeLing!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1203621
[2] https://github.com/MikeLing
[3] https://pypi.python.org/pypi/pushlog_client
[4] https://hg.mozilla.org/hgcustom/version-control-tools/rev/6021c9031bc3


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 26, 2015 03:17 PM

November 24, 2015

Armen Zambrano G. (@armenzg)

Welcome F3real, xenny and MikeLing!

As described by jmaher, we started this week our first week of mozci's quarter of contribution.

I want to personally welcome Stefan, Vaibhav and Mike to mozci. We hope you get to learn and we thank you for helping Mozilla move forward in this corner of our automation systems.

I also want to give thanks to Alice for committing at mentoring. This could not be possible without her help.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:58 PM

Mozilla CI tools meet up

In order to help the contributors' of mozci's quarter of contribution, we have set up a Mozci meet up this Friday.

If you're interested on learning about Mozilla's CI, how to contribute or how to build your own scheduling with mozci come and join us!

9am ET -> other time zones
Vidyo room: https://v.mozilla.com/flex.html?roomdirect.html&key=GC1ftgyxhW2y


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 24, 2015 05:52 PM

Kim Moir (kmoir)

USENIX Release Engineering Summit 2015 recap

November 13th, I attended the USENIX Release Engineering Summit in Washington, DC.  This summit was along side the larger LISA conference at the same venue. Thanks to Dinah McNutt, Gareth Bowles, Chris Cooper,  Dan Tehranian and John O'Duinn for organizing.



I gave two talks at the summit.  One was a long talk on how we have scaled our Android testing infrastructure on AWS, as well as a look back at how it evolved over the years.

Picture by Tim Norris - Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.flickr.com/photos/tim_norris/2600844073/sizes/o/

Scaling mobile testing on AWS: Emulators all the way down from Kim Moir

I gave a second lightning talk in the afternoon on the problems we face with our large distributed continuous integration, build and release pipeline, and how we are working to address the issues. The theme of this talk was that managing a large distributed system is like being the caretaker for the water, or some days, the sewer system for a city.  We are constantly looking system leaks and implementing system monitoring. And probably will have to replace it with something new while keeping the existing one running.

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l


Distributed Systems at Scale: Reducing the Fail from Kim Moir

In preparation for this talk, I did a lot of reading on complex systems design and designing for recovery from failure in distributed systems.  In particular, I read Donatella Meadows' book Thinking in Systems. (Cate Huston reviewed the book here). I also watched several talks by people who talked about the challenges they face managing their distributed systems including the following:
I'd also like to thank all the members of Mozilla releng/ateam who reviewed my slides and provided feedback before I gave the presentations.
The attendees of the summit attended the same keynote as the LISA attendees.  Jez Humble, well known for his Continuous Delivery and Lean Enterprise books provided a keynote on Lean Configuration Management which I really enjoyed. (Older version of slides from another conference, are available here and here.)



In particular, I enjoyed his discussion of the cultural aspects of devops. I especially like that he stated that "You should not have to have planned downtime or people working outside business hours to release".  He also talked a bit about how many of the leaders that are looked up to as visionaries in the tech industry are known for not treating people very well and this is not a good example to set for others who believe this to be the key to their success.  For instance, he said something like "what more could Steve Jobs have accomplished had he treated his employees less harshly".

Another concept he discussed which I found interesting was that of the strangler application. When moving from a large monolithic application, the goal is to split out the existing functionality into services until the originally application is left with nothing.  Exactly what Mozilla releng is doing as we migrate from Buildbot to taskcluster.


http://www.slideshare.net/jezhumble/architecting-for-continuous-delivery-54192503


At the release engineering summit itself,   Lukas Blakk from Pinterest gave a fantastic talk Stop Releasing off Your Laptop—Implementing a Mobile App Release Management Process from Scratch in a Startup or Small Company.  This included grumpy cat picture to depict how Lukas thought the rest of the company felt when that a more structured release process was implemented.


Lukas also included a timeline of the tasks that implemented in her first six months working at Pinterest. Very impressive to see the transition!


Another talk I enjoyed was Chaos Patterns - Architecting for Failure in Distributed Systems by Jos Boumans of Krux. (Similar slides from an earlier conference here). He talked about some high profile distributed systems that failed and how chaos engineering can help illuminate these issues before they hit you in production.


For instance, it is impossible for Netflix to model their entire system outside of production given that they consume around one third of nightly downstream bandwidth consumption in the US. 

Evan Willey and Dave Liebreich from Pivotal Cloud Foundry gave a talk entitled "Pivotal Cloud Foundry Release Engineering: Moving Integration Upstream Where It Belongs". I found this talk interesting because they talked about how the built Concourse, a CI system that is more scaleable and natively builds pipelines.   Travis and Jenkins are good for small projects but they simply don't scale for large numbers of commits, platforms to test or complicated pipelines. We followed a similar path that led us to develop Taskcluster

There were many more great talks, hopefully more slides will be up soon!

November 24, 2015 03:57 PM

November 19, 2015

Armen Zambrano G. (@armenzg)

Buildapi client released - thanks F3real!

When we wrote Mozilla CI tools, we created a module called buildapi.py in order to schedule Buildbot jobs via Self-serve/Buildapi.

We recently ported it as a Python package and released it:
https://pypi.python.org/pypi/buildapi_client

This was all thanks to F3real, who joined us from Mozilla's community and released his first Python package. He has also brought forth the integration tests we wrote for it. Here's the issue and  PR if you're curious.

F3real will now be looking at removing the buildapi module from mozci and making use of the python package instead.

Thanks F3real!


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

November 19, 2015 04:31 PM

Chris AtLee (catlee)

MozFest 2015

I had the privilege of attending MozFest last week. Overall it was a really great experience. I met lots of really wonderful people, and learned about so many really interesting and inspiring projects.

My biggest takeaway from MozFest was how important it is to provide good APIs and data for your systems. You can't predict how somebody else will be able to make use of your data to create something new and wonderful. But if you're not making your data available in a convenient way, nobody can make use of it at all!

It was a really good reminder for me. We generate a lot of data in Release Engineering, but it's not always exposed in a way that's convenient for other people to make use of.

The rest of this post is a summary of various sessions I attended.

Friday

Friday night started with a Science Fair. Lots of really interesting stuff here. Some of the projects that stood out for me were:

  • naturebytes - a DIY wildlife camera based on the raspberry pi, with an added bonus of aiding conservation efforts.
  • histropedia - really cool visualizations of time lines, based on data in Wikipedia and Wikidata. This was the first time I'd heard of Wikidata, and the possibilities were very exciting to me! More on this later, as I attended a whole session on Wikidata.
  • Several projects related to the Internet-of-Things (IOT)

Saturday

On Saturday, the festival started with some keynotes. Surman spoke about how MozFest was a bit chaotic, but this was by design. In a similar way that the web is an open platform that you can use as a platform for building your own ideas, MozFest should be an open platform so you can meet, brainstorm, and work on your ideas. This means it can seem a bit disorganized, but that's a good thing :) You get what you want out of it.

I attended several good sessions on Saturday as well:

  • Ending online tracking. We discussed various methods currently used to track users, such as cookies and fingerprinting, and what can be done to combat these. I learned, or re-learned, about a few interesting Firefox extensions as a result:

    • privacybadger. Similar to Firefox's tracking protection, except it doesn't rely on a central blacklist. Instead, it tries to automatically identify third party domains that are setting cookies, etc. across multiple websites. Once identified, these third party domains are blocked.
    • https everywhere. Makes it easier to use HTTPS by default everywhere.
  • Intro to D3JS. d3js is a JS data visualization library. It's quite powerful, but something I learned is that you're expected to do quite a bit of work up-front to make sure it's showing you the things you want. It's not great as a data exploration library, where you're not sure exactly what the data means, and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.

  • 6 kitchen cases for IOT We discussed the proposed IOT design manifesto briefly, and then split up into small groups to try and design a product, using the principles outlined in the manifesto. Our group was tasked with designing some product that would help connect hospitals with amateur chefs in their local area, to provide meals for patients at the hospital. We ended up designing a "smart cutting board" with a built in display, that would show you your recipes as you prepared them, but also collect data on the frequency of your meal preparations, and what types of foods you were preparing.

    Going through the exercise of evaluating the product with each of the design principles was fun. You could be pretty evil going into this and try and collect all customer data :)

Sunday

  • How to fight an internet shutdown - we role played how we would react if the internet was suddenly shut down during some political protests. What kind of communications would be effective? What kind of preparation can you have done ahead of time for such an event?

    This session was run by Deji from accessnow. It was really eye opening to see how internet shutdowns happen fairly regularly around the world.

  • Data is beaufitul Introduction to wikidata Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody can edit and query the database. One of the really interesting features of Wikidata is that localization is kind of built-in as part of the design. Each item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas Adams. The description for each item is simply a table of locale -> localized description. There's no inherent bias towards English, or any other language. The beauty of this is that you can reference the same piece of data from multiple languages, only having to focus on localizing the various descriptions. You can imagine different translations of the same Wikipedia page right now being slightly inconsistent due to each one having to be updated separately. If they could instead reference the data in Wikidata, then there's only one place to update the data, and all the other places that reference that data would automatically benefit from it.

    The query language is quite powerful as well. A simple demonstration was "list all the works of art in the same room in the Louvre as the Mona Lisa."

    It really got me thinking about how powerful open data is. How can we in Release Engineering publish our data so others can build new, interesting and useful tools on top of it?

  • Local web Various options for purely local web / networks were discussed. There are some interesting mesh network options available commotion was demo'ed. These kind of distributions give you file exchange, messaging, email, etc. on a local network that's not necessarily connected to the internet.

November 19, 2015 01:35 PM

November 16, 2015

Nick Thomas (nthomas)

The latest on firefox/releases/latest

The primary way to download Firefox is at www.mozilla.org, but Mozilla’s Release Engineering team has also maintained directories like

https://ftp.mozilla.org/pub/firefox/releases/latest/

to provide a stable location for scripted downloads. There are similar links for betas and extended support releases for organisations. Read on to learn how these directories have changed, and how you can continue to download the latest releases.

Until recently these directories were implemented using a symlink to the current version, for example firefox/releases/42.0/. The storage backend has now changed to Amazon S3 and this is no longer possible. To implement the same functionality we’d need a duplicate set of keys, which incurs more maintenance overhead. And we already have a mechanism for delivering files independent of the current shipped version – our download redirector Bouncer. For example, here’s the latest release for Windows 32bit, U.S. English:

https://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US

Modifying the product, os, and/or lang allow other combinations. This is described in the README.txt files for beta, release, and esr, as well as the Thunderbird equivalents release and beta.

Please adapt your scripts to use download.mozilla.org links. We hope it will help you simplify at the same time, as scraping to determine the current version is no longer necessary.

PS. We’ve also removed some latest- directories which were old and crufty, eg firefox/releases/latest-3.6.

November 16, 2015 11:37 PM

November 13, 2015

Hal Wine (hwine)

Complexity & * Practices

Complexity & * Practices

I was fortunate enough to be able to attend Dev Ops Days Silicon Valley this year. One of the main talks was given by Jason Hand, and he made some great points. I wanted to highlight two of them in this post:

  1. Post Mortems are really learning events, so you should hold them when things go right, right? RIGHT!! (Seriously, why wouldn’t you want to spot your best ideas and repeat them?)
  2. Systems are hard – if you’re pushing the envelope, you’re teetering on the line between complexity and chaos. And we’re all pushing the envelope these days - either by getting fancy or getting lean.

Post Mortems as Learning Events

Our industry has talked a lot about “Blameless Post Mortems”, and techniques for holding them. Well, we can call them “blameless” all we want, but if we only hold them when things go wrong, folks will get the message loud and clear.

If they are truly blameless learning events, then you would also hold them when things go right. And go meh. Radical idea? Not really - why else would sports teams study game films when they win? (This point was also made in a great Ignite by Katie Rose: GridIronOps - go read her slides.)

My $0.02 is - this would also give us a chance to celebrate success. That is something we do not do enough, and we all know the dedication and hard work it takes to not have things go sideways.

And, by the way, terminology matters during the learning event. The person who is accountable for an operation is just that: capable of giving an account of the operation. Accountability is not responsibility.

Terminology and Systems – Setting the right expectations

Part way through Jason’s talk, he has this awesome slide about how system complexity relates to monitoring which relates to problem resolution. Go look at slide 19 - here’s some of what I find amazing in that slide:

  • It is not a straight line with a destination. Your most stable system can suddenly display inexplicable behavior due to any number of environmental reasons. And you’re back in the chaotic world with all that implies.
  • Systems can progress out of chaos, but that is an uphill battle. Knowing which stage a system is in (roughly) informs the approach to problem resolution.
  • Note the wording choices: “known” vs “unknowable” – for all but the “obvious” case, it will be confusing. That is a property of the system, not a matter of staff competency.

While not in his slide, Jason spoke to how each level really has different expectations. Or should have, but often the appropriate expectation is not set. Here’s how he related each level to industry terms.

Best Practices:

The only level with enough certainty to be able to expect the “best” is the known and familiar one. This is the “obvious” one, because we’ve all done exactly this before over a long enough time period to fully characterize the system, its boundaries, and abnormal behavior.

Here, cause and effect are tightly linked. Automation (in real time) is possible.

Good Practices:

Once we back away from such certainty, it is only realistic to have less certainty in our responses. With the increased uncertainty, the linkage of cause and effect is more tenuous.

Even if we have all the event history and logs in front of us, more analysis is needed before appropriate corrective action can be determined. Even with automation, there is a latency to the response.

Emergent Practices:

Okay, now we are pushing the envelope. The system is complex, and we are still learning. We may not have all the data at hand, and may need to poke the system to see what parts are stuck.

Cause and effect should be related, but how will not be visible until afterwards. There is much to learn.

Novel Practices:
For chaotic systems, everything is new. A lot is truly unknowable because that situation has never occurred before. Many parts of the system are effectively black boxes. Thus resolution will often be a process of trying something, waiting to see the results, and responding to the new conditions.
Next Steps

There is so much more in that diagram I want to explore. The connecting of problem resolution behavior to complexity level feels very powerful.

<hand_waving caffeine_level=”deprived”>

My experience tells me that many of these subjective terms are highly context sensitive, and in no way absolute. Problem resolution at 0300 local with a bad case of the flu just has a way of making “obvious” systems appear quite complex or even chaotic.

By observing the behavior of someone trying to resolve a problem, you may be able to get a sense of how that person views that system at that time. If that isn’t the consensus view, then there is a gap. And gaps can be bridged with training or documentation or experience.

</hand_waving>

November 13, 2015 08:00 AM

October 23, 2015

Nick Thomas (nthomas)

Updates disabled for Android Nightly and Aurora

Due to a bug with the new ftp server we’ve had to disable updates for

They’ll resume just as soon as we can get the fix landed.

Update (Oct 25th): Updates are re-enabled, thanks to Mike Shal for the fix.

October 23, 2015 08:39 AM

October 21, 2015

Nick Thomas (nthomas)

Try Server – please use up-to-date code to avoid upload failures

Today we started serving an important set of directories on ftp.mozilla.org using Amazon S3, more details on that over in the newsgroups. Some configuration changes landed in the tree to make that happen.

Please rebase your try pushes to use revision 0ee21e8d5ca6 or later, currently on mozilla-inbound. Otherwise your builds will fail to upload, which means they won’t run any tests. No fun for anyone.

October 21, 2015 10:02 AM

October 13, 2015

Armen Zambrano G. (@armenzg)

mozci 0.15.1: Buildbot Bridge support + bug fixes


It's been a while since our last announced released and I would like to highlight some of the significant changes since then.
A major highlight for the latest release is the support to schedule Buildbot jobs through TaskCluster making use of the Buildbot Bridge. For more information about this please read in here.

Contributors

@nikkisquared is our latest contributor who landed more formalized exceptions and error handling. Thanks!
As always, many thanks to @adusca and @vaibhavmagarwal for their endless contributions.

How to update

Run "pip install -U mozci" to update

New Features

  • Query Treeherder for hidden jobs
  • BuildbotBridge support
    • It allows submitting a graph of Buildbot jobs with dependencies
    • The graph is submitted to TaskCluster

Minor changes

  • Fix backfilling issue
  • Password prompt improvements
  • More specific exceptions
  • Skip Windows 10 jobs for talos

All changes

You can see all changes in here:
0.14.0...0.15.1


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

October 13, 2015 07:55 PM

October 02, 2015

Hal Wine (hwine)

duo MFA & viscosity no-cell setup

duo MFA & viscosity no-cell setup

The Duo application is nice if you have a supported mobile device, and it’s usable even when you you have no cell connection via TOTP. However, getting Viscosity to allow both choices took some work for me.

For various reasons, I don’t want to always use the Duo application, so would like for Viscosity to alway prompt for password. (I had already saved a password - a fresh install likely would not have that issue.) That took a bit of work, and some web searches.

  1. Disable any saved passwords for Viscosity. On a Mac, this means opening up “Keychain Access” application, searching for “Viscosity” and deleting any associated entries.

  2. Ask Viscosity to save the “user name” field (optional). I really don’t need this, as my setup uses a certificate to identify me. So it doesn’t matter what I type in the field. But, I like hints, so I told Viscosity to save just the user name field:

    defaults write com.viscosityvpn.Viscosity RememberUsername -bool true

With the above, you’ll be prompted every time. You have to put “something” in the user name field, so I chose to put “push or TOTP” to remind me of the valid values. You can put anything there, just do not check the “Remember details in my Keychain” toggle.

October 02, 2015 07:00 AM

September 25, 2015

Kim Moir (kmoir)

The mystery of high pending counts

In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android).  High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs.  The time developers have to wait for their test results is longer than ideal.


Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests.  However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches.  This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.


The work to determine the cause of high pending counts is always an interesting mystery.
Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem.  We have determined that since the end of August a number of new test jobs were enabled that increased the compute time per push on Windows by 13% or 2.5 hours per push.  Most of these new test jobs are for e10s
Increase in seconds that new jobs added to the total compute time per push.  (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)
The e10s initiative is an important initiative for Mozilla to make Firefox performance and security even better.  However, since new e10s and old tests will continue to run in parallel, we need to get creative on how to have acceptable wait times given the limitations of our current Windows tests pools.  (All of our Windows test run on bare metal in our datacentre, not on Amazon).
 
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives: 

To reduce Linux pending counts:
  • Added 200 new instances to the tst-emulator64 pool (run Android test jobs on Linux emulators) (bug 1204756)
  • In process of adding more Linux32 and Linux64 buildbot masters (bug 1205409) which will allow us to expand our capacity more

Ongoing work to reduce the Windows pending counts:


How can you help? 

Please be considerate when invoking try pushes and only select the platforms that you explicitly require to test.  Each try push for all platforms and all tests invokes over 800 jobs.

September 25, 2015 07:47 PM

September 22, 2015

Hal Wine (hwine)

Using Password Store

Using Password Store

Password Store (aka “pass”) is a very handy wrapper for dealing with pgp encrypted secrets. It greatly simplifies securely working with multiple secrets. This is still true even if you happen to keep your encrypted secrets in non-password-store managed repositories, although that setup isn’t covered in the docs. I’ll show my setup here. (See the Password Store page for usage: “pass show -c <spam>” & “pass search <eggs>” are among my favorites.)

Short version:
  1. Have gpg installed on your machine.

  2. Install Password Store on your machine. There are OS specific instructions. Be sure to enable tab completion for your shell!

  3. Setup a local password store. Scroll down in the usage section to “Setting it up” for instructions.

  4. Clone your secrets repositories to your normal location. Do not clone inside of ~/.password-store/.

  5. Set up symlinks inside of ~/.password-store/ to directories inside your clone of the secrets repository. I did:

    ln -s ~/path/to/secrets-git/passwords rePasswords
    ln -s ~/path/to/secrets-git/keys reKeys
  6. Enjoy command line search and retrieval of all your secrets. (Use the regular method for your separate secrets repository to add and update secrets.)

Rationale:

  • By using symlinks, pass will not allow me to create or update secrets in the other repositories. That prevents mistakes, as the process is different for each of those alternate stores.
  • I prefer to have just one tree of secrets to search, rather than the “multiple configuration” approach documented on the Password Store site.
  • By using symlinks, I can control the global namespace, and use names that make sense to me.
  • I’ve migrated from using KeePassX to using pass for my personal secret management. That is my “main” password-store setup (backed by a git repo).

Notes:

  • If you’d prefer a GUI, there’s qtpass which also works with the above setup.

September 22, 2015 07:00 AM

September 21, 2015

Armen Zambrano G. (@armenzg)

Minimal job scheduling

One of the biggest benefits of moving the scheduling into the tree is that you can adjust the decisions on what to schedule from within the tree.

As chmanchester and I were recently discussing, this is very important as we believe we can do much better on deciding what to schedule on try.

Currently, too many developers push to try with  -p all -u all (which schedules all platforms and all tests). It is understandable, it is the easiest way to reduce the risk of your change being backed out when it lands on one of the integration trees (e.g. mozilla-inbound).

In-tree scheduling analysis

What if your changes would get analysed and we would determine the best educated guess set of platforms and test jobs required to test your changes in order to not be backed out on an integration tree?

For instance, when I push Mozharness changes to mozilla-inbound, I wish I could tell the system that I only need these set of platforms and not those other ones.

If everyone had the minimum amount of jobs added to their pushes, our systems would be able to return results faster (less load) and no one would need to take short cuts.

This would be the best approximation and we would need to fine tune the logic over time to get things as right as possible. We would need to find the right balance of some changes being backed out because we did not get the right scheduling on try and getting results faster for everyone.

Prioritized tests

There is already some code that chmanchester landed where we can tell the infrastructure to run a small set of tests based on files changed. In this case we hijack one of the jobs (e.g. mochitest-1) to run the most relevant tests to your changes which would can normally be tested on different chunks. Once the prioritized tests are run, we can run the remaining tests as we would normally do. Prioritized tests also applies to suites that are not chunked (run a subset of tests instead of all).

There are some UI problems in here that we would need to figure out with Treeherder and Buildbot.

Tiered testing

Soon, we will have all technological pieces to create a multi tiered job scheduling system.

For instance, we could run things in this order (just a suggestion):

This has the advantage of using prioritized tests as a canary job which would prevent running the remaining tests if we fail the canary (shorter) job.

Post minimal run (automatic) precise scheduling (manual)

This is not specifically to scheduling the right thing automatically but to extending what gets scheduled automatically.
Imagine that you're not satisfied with what gets scheduled automatically and you would like to add more jobs (e.g. missing platforms or missing suites).
You will be able to add those missing jobs later directly from Treeherder by selecting which jobs are missing.
This will be possible once bug 1194830 lands.

NOTE: Mass scheduling (e.g. all mochitests across all platforms) would be a bit of a pain to do through Treeherder. We might want to do a second version of try-extender.



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 21, 2015 01:15 PM

September 18, 2015

Armen Zambrano G. (@armenzg)

Mozharness' support for Buildbot Bridge triggered test jobs

I have landed today [1] some code which allows Buildbot *test* jobs triggered through the Buildbot Bridge (BBB) to work properly.

In my previous post I explain a bit on how Mozci works with the Buildbot Bridge.
In this post I will only explain what we fixed on the Mozharness side.

Buildbot Changes

If a Buildbot test job is scheduled through TaskCluster (The Buildbot Bridge supports this), then the generated Buildbot Change associated to a test job does not have the installer and
test urls necessary for Mozharness to download for a test job.

What is a Buildbot Change? It is an object which represents the files touched associated to a code push. For the build jobs, this value gets set as part of the process of polling the Mercurial repositories, however, the test jobs are triggered via  a "buildbot sendchange" step part of the build job.
This sendchange creates the Buildbot Change for the test job which Mozharness can then use.
The BBB does not listen to sendchanges, hence, jobs triggered via the BBB have an empty changes object. Therefore, we can't download the files needed to run a test job and fail to execute.

In order to overcome this limtation, we have to detect when a Buildbot job is triggered normally or through the Buildbot Bridge.
Buildbot Bridge triggered jobs have a 'taskId' property defined (this represents the task associated to this Buildbot job). Through this 'taskId' we can determine the parent task and find a file called properties.json [2], which, it is uploaded by every BBB triggered job.
In such file, we can find both the installer and test urls.


[1] https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=03233057f1e6
[2] https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/E7lb9P-IQjmeCOvkNfdIuw/0/public/properties.json

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:26 PM

Mozilla CI tools: Upcoming support for Buildbot Bridge

What is the Buildbot Bridge?

The Buildbot Bridge (BBB) allows scheduling Buildbot jobs through TaskCluster.
In other words, you can have taskcluster tasks which represent Buildbot jobs.
This allows having TaskCluster graphs composed of tasks which will be executed either on Buildbot or TaskCluster, hence, allowing for *all* relationships between tasks to happen in TaskCluster.

Read my recent post on the benefits of scheduling every job via TaskCluster.

The next Mozilla CI tools (mozci) release will have support for BBB.

Brief explanation

You can see in this try push both types of Buildbot jobs [1].
One set of jobs were triggered through Buildbot's analysis of the try syntax in the commit message while two of the jobs should not have been scheduled.

Those two jobs were triggered off-band via Mozci submitting a task graph.
You can see the TaskCluster graph representing them in here [2].

These jobs were triggered using this simple data structure:
{
  'Linux x86-64 try build': [
    'Ubuntu VM 12.04 x64 try opt test mochitest-1'
  ]
}

Mozci turns this simple graph into a TaskCluster graph.
The graph is composed of tasks which follow this structure [3]

Notes about the Buildbot Bridge

bhearsum's original post, recording and slides:
http://hearsum.ca/blog/buildbot-taskcluster-bridge-an-overview.html
https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879
http://hearsum.ca/slides/buildbot-bridge-overview/#/

Some notes which Selena took about the topic:
http://www.chesnok.com/daily/2015/06/03/taskcluster-migration-about-the-buildbot-bridge/

The repository is in here:
https://github.com/mozilla/buildbot-bridge

---------
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9417c6151f2c
[2] https://tools.taskcluster.net/task-graph-inspector/#mAI0H1GyTJSo-YwpklZqng
[3] https://github.com/armenzg/mozilla_ci_tools/blob/mozci_bbb/mozci/sources/buildbot_bridge.py#L37


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 18, 2015 07:16 PM

September 17, 2015

Armen Zambrano G. (@armenzg)

Platform Operations lightning talks (Whistler 2015)

You can read and watch in here about the Platform Operations lighting talks:
https://wiki.mozilla.org/Platform_Operations/Ligthning_talks

Here the landing pages for the various Platform Operations teams:
https://wiki.mozilla.org/Platform_Operations

PS = I don't know what composes platform operations so feel free to add your team if I'm missing it.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 17, 2015 03:57 PM

September 10, 2015

Armen Zambrano G. (@armenzg)

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

September 10, 2015 02:13 PM

August 03, 2015

Rail Alliev (rail)

Funsize enabled for Nightly builds

Keep calm and update Firefox

Note: this post has been sitting in the drafts queue for some reason. Better to publish it. :)

As of Tuesday, Aug 4, 2015 Funsize has been enabled on mozilla-central. From now on all Nightly builds builds will get updates for builds up to 4 days in the past (for yesterday, for the day before yesterday, etc). This should make people who don't run their nightlies every day happier.

Firefox Developer Edition partial updates will be enabled after 42.0 hits mozilla-aurora, but early adopters can use the aurora-funsize channel.

Partial updates as a part of builds and L10N repacks on nightly will be disabled as soon as Bug 1173459 is resolved.

As a bonus, you can take a look at the presentation I gave in Whistler during the work week.

Reporting Issues

If you see any issues, please report them to Bugzilla.

August 03, 2015 09:22 PM

July 30, 2015

Hal Wine (hwine)

Decoding Hashed known_hosts Files

Decoding Hashed known_hosts Files

tl;dr: You might find this gist handy if you enable HashKnownHosts

Modern ssh comes with the option to obfuscate the hosts it can connect to, by enabling the HashKnownHosts option. Modern server installs have that as a default. This is a good thing.

The obfuscation occurs by hashing the first field of the known_hosts file - this field contains the hostname,port and IP address used to connect to a host. Presumably, there is a private ssh key on the host used to make the connection, so this process makes it harder for an attacker to utilize those private keys if the server is ever compromised.

Super! Nifty! Now how do I audit those files? Some services have multiple IP addresses that serve a host, so some updates and changes are legitimate. But which ones? It’s a one way hash, so you can’t decode.

Well, if you had an unhashed copy of the file, you could match host keys and determine the host name & IP. [1] You might just have such a file on your laptop (at least I don’t hash keys locally). [2] (Or build a special file by connecting to the hosts you expect with the options “-o HashKnownHosts=no -o UserKnownHostsFile=/path/to/new_master”.)

I through together a quick python script to do the matching, and it’s at this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.

Bonus Tip: https://github.com/defunkt/gist

Is a very nice way to manage gists from the command line.

Footnotes

[1]A lie - you’ll only get the host name and IP’s that you have connected to while building your reference known_hosts file.
[2]I use other measures to keep my local private keys unusable.

July 30, 2015 07:00 AM

July 22, 2015

Armen Zambrano G. (@armenzg)

Few mozci releases + reduced memory usage

While I was away adusca released few releases of mozci.

From the latest release I want to highlight that we're replacing the standard json library for ijson since it solves some memory leak issues we were facing for pulse_actions (bug 1186232).

This was important to fix since our Heroku instance for pulse_actions has an upper limit of 1GB of RAM.

Here are the release notes and the highlights of them:




Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

July 22, 2015 02:21 PM

July 21, 2015

Jordan Lund (jlund)

Mozharness now lives in Gecko

What's changed?

continuous-integration and release jobs that use Mozharness will now get Mozharness from the Gecko repo that the job is running against.

How?

Whether the job is a build (requires a full gecko checkout) or a test (only requires a Firefox/Fennec/Thunderbird/B2G binary), automation will first grab a copy of Mozharness from the gecko tree, even before checking out the rest of the tree. Effectively minimizing changes to our current infra.

This is thanks to a new relengapi endpoint, Archiver, and hg.mozilla.org's sub directory archiving abilities. Essentially Archiver will get a tar ball of Mozharness from within a target gecko repo, rev, and sub-repo-directory and upload it to Amazon's S3.

What's nice about Achiver is that it is not restricted to just grabbing Mozharness. You could, for example, put https://hg.mozilla.org/build-tools in the Gecko tree or, improving on our tests.zip model, simply grab subdirectories from within the testing/* part of the tree and request them on a suite by suite basis.

What does this mean for you?

it depends. if you are...

1) developing on Mozharness

You will need to checkout gecko and patches will now land like any other gecko patch: 1) land on a development tree-branch (e.g. mozilla-inbound) 2) ride the trains. This now means:

This also means:

2) just needing to deploy Mozharness or get a copy of it without gecko

Like the usage docs linked to Archiver above, you could hit the API directly. But I recommend using the client that buildbot uses. The client will wait until the api call is complete, download the archive from a response location, and unpack it to a specified destination.

Let's take a look at that in action: say you want to download and unpack a copy of mozharness based on mozilla-beta at 93c0c5e4ec30 to some destination.

python archiver_client.py mozharness --repo releases/mozilla-beta --rev 93c0c5e4ec30 --destination /home/jlund/downloads/mozharness  

Note: if that was the first time Archiver was polled for that repo + rev, it might take a few seconds as it has to download Mozharness from hgmo and then upload it to S3. Subsequent calls will happen near instantly

Note 2: if your --destination path already exists with a copy of Mozharness or something else, the client won't rm that path, it will merge (just like unpacking a tarball behaves)

3) a Release Engineering service that is still using hg.mozilla.org/build/mozharness

Not all Mozharness scripts are used for continuous integration / release jobs. There are a number of Releng services that are based on Mozharness: e.g. Bumper, vcs-sync, and merge_day. As these services transition to using Archiver, they will continue to use hgmo/build/mozharness as the Repository of Record (RoR).

If certain services that can not use gecko based Mozharness, then we can fork Mozharness and setup a separate repo. That will of course mean such services won't receive upstream changes from the gecko copy so we should avoid this if possible.

If you are an owner or major contributor to any of these releng services, we should meet and talk about such a transition. Archiver and its client should make deployments pretty painless in most cases.

Have something that may benefit from Archiver?

If you want to move something into a larger repository or be able to pull something out of such a repository for lightweight deployments, feel free to chat to me about Archiver and Relengapi.

As always, please leave your questions, comments, and concerns below

July 21, 2015 10:28 PM

July 17, 2015

Kim Moir (kmoir)

Learning Data Science and evidence based teaching methods

This spring, I took several online courses on the topic of data science.  I became interested in expanding my skills in this area because as release engineers, we deal with a lot of data.  I wanted to learn new tools to extract useful information from the distributed systems behemoth we manage.


This xckd reminded me of the challenges of managing our buildfarm somedays :-)
From http://xkcd.com/1546/

I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes).  There are quizzes and assignments that are due each week.  Each course below was about four weeks long.

The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R.  Most of the quizzes and examples used publicly available data for the programming exercises.  I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments.  The lectures didn't provide a lot of the material needed to complete the assignments.  Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R.  Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing  R programs.  The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.

There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning.   My next coursera  class is Introduction to Systems Engineering which I'll start in a couple of weeks.  I've really become interested in learning more about this subject after reading Thinking in Systems.

The other course I took this spring was the Software Carpentry Instructor training course.   The Software Carpentry Foundation teachers researchers basic software skills.  For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others.  These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively.  The instructor course was excellent, thanks Greg Wilson for your work teaching us.

We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material.  How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more.  Really interesting research.

For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises.  We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix.  Then we are ready to sign up to teach a Software Carpentry course!

In conclusion, data science is a great skill to have if you are managing large distributed systems.  Also, using evidence based teaching methods to help others learn is the way to go!

Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy

July 17, 2015 03:41 PM