Planet Taskcluster

December 14, 2018

Wander Lairson Costa

Running packet.net images in qemu

For the past months, I have been working on adding Taskcluster support for packet.net cloud provider. The reason for that is to get faster Firefox for Android CI tests. Tests showed that jobs run up to 4x faster on bare metal machines than EC2.

I set up 25 machines to run a small subset of the production tasks, and so far results are excellent. The problem is that those machines are up 24/7 and there is no dynamic provisioning. If we need more machines, I have to manually change the terraform script to scale it up. We need a smart way to do that. We are going to build something similar as aws-provisioner. However, we need a custom packet.net image to speed up instance startup.

The problem is that if you can’t ssh into the machine, there is no way to get access to it to see what’s wrong. In this post,l I am going to show how you can run a packet image locally with qemu.

You can find documentation about creating custom packet images here and here.

Let’s create a sample image for the post. After you clone the packet-images repo, run:

$ ./tools/build.sh -d ubuntu_14_04 -p t1.small.x86 -a x86_64 -b ubuntu_14_04-t1.small.x86-dev

This creates the image.tar.gz file, which is your packet image. The goal of this post is not to guide you on creating your custom image; you can refer to the documentation linked above for this. The goal here is, once you have your image, how you can run it locally with qemu.

The first step is to create a qemu disk to install the image into it:

$ qemu-img create -f raw linux.img 10G

This command creates a raw qemu image. We now need to create a disk partition:

$ cfdisk linux.img

Select dos for the partition table, create a single primary partition and make it bootable. We now need to create a loop device to handle our image:

$ sudo losetup -Pf linux.img

The -f option looks for the first free loop device for attachment to the image file. The -P option instructs losetup to read the partition table and create a loop device for each partition found; this avoids we having to play with disk offsets. Now let’s find our loop device:

$ sudo losetup -l
NAME        SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                         DIO LOG-SEC
/dev/loop1          0      0         1  1 /var/lib/snapd/snaps/gnome-calculator_260.snap      0     512
/dev/loop8          0      0         1  1 /var/lib/snapd/snaps/gtk-common-themes_818.snap     0     512
/dev/loop6          0      0         1  1 /var/lib/snapd/snaps/core_5662.snap                 0     512
/dev/loop4          0      0         1  1 /var/lib/snapd/snaps/gtk-common-themes_701.snap     0     512
/dev/loop11         0      0         1  1 /var/lib/snapd/snaps/gnome-characters_139.snap      0     512
/dev/loop2          0      0         1  1 /var/lib/snapd/snaps/gnome-calculator_238.snap      0     512
/dev/loop0          0      0         1  1 /var/lib/snapd/snaps/gnome-logs_45.snap             0     512
/dev/loop9          0      0         1  1 /var/lib/snapd/snaps/core_6034.snap                 0     512
/dev/loop7          0      0         1  1 /var/lib/snapd/snaps/gnome-characters_124.snap      0     512
/dev/loop5          0      0         1  1 /var/lib/snapd/snaps/gnome-3-26-1604_70.snap        0     512
/dev/loop12         0      0         0  0 /home/walac/work/packet-images/linux.img            0     512
/dev/loop3          0      0         1  1 /var/lib/snapd/snaps/gnome-system-monitor_57.snap   0     512
/dev/loop10         0      0         1  1 /var/lib/snapd/snaps/gnome-3-26-1604_74.snap        0     512

We see that our loop device is /dev/loop12. If we look in the /dev directory:

$ ls -l /dev/loop12*
brw-rw---- 1 root   7, 12 Dec 17 10:39 /dev/loop12
brw-rw---- 1 root 259,  0 Dec 17 10:39 /dev/loop12p1

We see that, thanks to the -P option, losetup created the loop12p1 device for the partition we have. It is time to set up the filesystem:

$ sudo mkfs.ext4 -b 1024 /dev/loop12p1 
mke2fs 1.44.4 (18-Aug-2018)
Discarding device blocks: done                            
Creating filesystem with 10484716 1k blocks and 655360 inodes
Filesystem UUID: 2edfe9f2-7e90-4c35-80e2-bd2e49cad251
Superblock backups stored on blocks: 
        8193, 24577, 40961, 57345, 73729, 204801, 221185, 401409, 663553, 
        1024001, 1990657, 2809857, 5120001, 5971969

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done     

Ok, finally we can mount our device and extract the image to it:

$ mkdir mnt
$ sudo mount /dev/loop12p1 mnt/
$ sudo tar -xzf image.tar.gz -C mnt/

The last step is to install the bootloader. As we are running an Ubuntu image, we will use grub2 for that.

Firstly we need to install grub in the boot sector:

$ sudo grub-install --boot-directory mnt/boot/ /dev/loop12
Installing for i386-pc platform.
Installation finished. No error reported.

Notice we point to the boot directory of our image. Next, we have to generate the grub.cfg file:

$ cd mnt/
$ for i in /proc /dev /sys; do sudo mount -B $i .$i; done
$ sudo chroot .
# cd /etc/grub.d/
# chmod -x 30_os-prober
# update-grub
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-3.13.0-123-generic
Found initrd image: /boot/initrd.img-3.13.0-123-generic
done

We bind mount the /dev, /proc and /sys host mount points inside the Ubuntu image, then chroot into it. Next, to avoid grub creating entries for our host OSes, we disable the 30_os-prober script. Finally we run update-grub and it creates the /boot/grub/grub.cfg file. Now the only thing left is cleanup:

# exit
$ for i in dev/ sys/ proc/; do sudo umount $i; done
$ cd ..
$ sudo umount mnt/

The commands are self explanatory. Now let’s run our image:

$ sudo qemu-system-x86_64 -enable-kvm -hda /dev/loop12

qemu

And that’s it, you now can run your packet image locally!

December 14, 2018 12:00 AM

August 29, 2018

John Ford

Shrinking Go Binaries

As part of the efforts to build a new artifact system, I wrote a CLI program to handle Taskcluster Artifact upload and download.  This is written in Go and as a result the binaries are quite large.  Since I'd like this utility to be used broadly within Mozilla CI, which requires a reasonably sized binary.  I was curious about what the various methods are and what the trade-offs of each would be.

A bit of background is that Go binaries are static binaries which have the Go runtime and standard library built into them.  This is great if you don't care about binary size but not great if you do.

This graph has the binary size on the left Y axis using a linear scale in blue and the number of nanoseconds each reduction in byte takes to compute on the right Y axis using a logarithmic scale.

To reproduce my results, you can do the following:

go get -u -t -v github.com/taskcluster/taskcluster-lib-artifact-go
cd $GOPATH/src/github.com/taskcluster/taskcluster-lib-artifact-go
git checkout 6f133d8eb9ebc02cececa2af3d664c71a974e833
time (go build) && wc -c ./artifact
time (go build && strip ./artifact) && wc -c ./artifact
time (go build -ldflags="-s") && wc -c ./artifact
time (go build -ldflags="-w") && wc -c ./artifact
time (go build -ldflags="-s -w") && wc -c ./artifact
time (go build && upx -1 ./artifact) && wc -c ./artifact
time (go build && upx -9 ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx -1 ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx --brute ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx --ultra-brute ./artifact) && wc -c ./artifact
time (go build && strip && upx -9 ./artifact) && wc -c ./artifact

Since I was removing a lot of debugging information, I figured it'd be worthwhile checking that stack traces are still working. To ensure that I could definitely crash, I decided to panic with an error immediately on program startup.


Even with binary stripping and the maximum compression, I'm still able to get valid stack traces.  A reduction from 9mb to 2mb is definitely significant.  The binaries are still large, but they're much smaller than what we started out with.  I'm curious if we can apply this same configuration to other areas of the Taskcluster Go codebase with similar success, and if the reduction in size is worthwhile there.

I think that using strip and upx -9 is probably the best path forward.  This combination provides enough of a benefit over the non-upx options that the time tradeoff is likely worth the effort.

August 29, 2018 09:35 PM

August 28, 2018

John Ford

Taskcluster Artifact API extended to support content verification and improve error detection

Background

At Mozilla, we're developing the Taskcluster environment for doing Continuous Integration, or CI.  One of the fundamental concerns in a CI environment is being able to upload and download files created by each task execution.  We call them artifacts.  For Mozilla's Firefox project, an example of how we use artifacts is that each build of Firefox generates a product archive containing a build of Firefox, an archive containing the test files we run against the browser and an archive containing the compiler's debug symbols which can be used to generate stacks when unit tests hit an error.

The problem

In the old Artifact API, we had an endpoint which generated a signed S3 url that was given to the worker which created the artifact.  This worker could upload anything it wanted at that location.  This is not to suggest malicious usage, but that any errors or early termination of uploads could result in a corrupted artifact being stored in S3 as if it were a correct upload.

If you created an artifact with the local contents "hello-world\n", but your internet connection dropped midway through, the S3 object might only contain "hello-w".  This went silent and uncaught until something much later down the pipeline (hopefully!) complained that the file it got was corrupted.  This corruption is the cause of many orange-factor bugs, but we have no way to figure out exactly where the corruption is happening.

Our old API was also very challenging to use and artifact handling in tasks.  It would often require a task writer to use one of our client libraries to generate a Taskcluster-Signed-URL and Curl to do uploads.  For a lot of cases, this is really hazard fraught.  Curl doesn't fail on errors by default (!!!), Curl doesn't automatically handle "Content-Encoding: gzip" responses without "Accept: gzip", which we sometimes need to serve.  It requires each user figure all of this out for themselves, each time they want to use artifacts.

We also had a "Completed Artifact" pulse message which isn't actually sending anything useful.  It would send a message when the artifact is allocated in our metadata tables, not when the artifact was actually complete.  We could mark a task as being completed before all of the artifacts were finished being uploaded.  In practice, this was avoided by avoiding a call to complete the task before the uploads were done, but it was a convention.

Our solution

We wanted to address a lot of issues with Taskcluster Artifacts.  Specifically the following issues are ones which we've tackled:
  1. Corruption during upload should be detected
  2. Corruption during download should be detected
  3. Corruption of artifacts should be attributable
  4. S3 Eventual Consistency error detection
  5. Caches should be able to verify whether they are caching valid items
  6. Completed Artifact messages should only be sent when the artifact is actually complete
  7. Tasks should be unresolvable until all uploads are finished
  8. Artifacts should be really easy to use
  9. Artifacts should be able to be uploaded with browser-viewable gzip encoding

Code

Here's the code we wrote for this project:
  1. https://github.com/taskcluster/remotely-signed-s3 -- A library which wraps the S3 APIs using the lower level S3 REST Api and uses the aws4 request signing library
  2. https://github.com/taskcluster/taskcluster-lib-artifact -- A light wrapper around remotely-signed-s3 to enable JS based uploads and downloads
  3. https://github.com/taskcluster/taskcluster-lib-artifact-go -- A library and CLI written in Go
  4. https://github.com/taskcluster/taskcluster-queue/commit/6cba02804aeb05b6a5c44134dca1df1b018f1860 -- The final Queue patch to enable the new Artifact API

Upload Corruption

If an artifact is uploaded with a different set of bytes to those which were expected, we should fail the upload.  The S3 V4 signatures system allows us to sign a request's headers, which includes an X-Amz-Content-Sha256 and Content-Length header.  This means that the request headers we get back from signing can only be used for a request which sets the X-Amz-Content-Sha256 and Content-Length to the value provided at signing.  The S3 library checks that the body of each request's Sha256 checksum matches the value provided in this header and also the Content-Length.

The requests we get from the Taskcluster Queue can only be used to upload the exact file we asked permission to upload. This means that the only set of bytes that will allow the request(s) to S3 to complete sucessfully will be the ones we initially told the Taskcluster Queue about.

The two main cases we're protecting against here are disk and network corruption.  The file ends up being read twice, once to hash and once to upload.  Since we have the hash calculated, we can be sure to catch corruption if the two hashes or sizes don't match.  Likewise, the possibility of network interuption or corruption is handled because the S3 server will report an error if the connection is interupted or corrupted before data matching the Sha256 hash exactly is uploaded.

This does not protect against all broken files from being uploaded.  This is an important distinction to make.  If you upload an invalid zip file, but no corruption occurs once you pass responsibility to taskcluster-lib-artifact, we're going to happily store this defective file, but we're going to ensure that every step down the pipeline gets an exact copy of this defective file.

Download Corruption

Like corruption during upload, we could experience corruption or interruptions during downloading.  In order to combat this, we set some metadata on the artifacts in S3.  We set some extra headers during uploading:
  1. x-amz-meta-taskcluster-content-sha256 -- The Sha256 of the artifact passed into a library -- i.e. without our automatic gzip encoding
  2. x-amz-meta-taskcluster-content-length -- The number of bytes of the artifact passed into a library -- i.e. without our automatic gzip encoding
  3. x-amz-meta-taskcluster-transfer-sha256 -- The Sha256 of the artifact as passed over the wire to S3 servers.  In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256.  In the case of Gzip encoding, it is almost certainly not identical.
  4. x-amz-meta-taskcluster-transfer-length -- The number of bytes of the artifact as passed over the wire to S3 servers.  In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256.  In the case of Gzip encoding, it is almost certainly not identical.
You would be right to question whether we can trust these values once created.  The good news is that headers on S3 objects cannot be changed after upload.  These headers are also part of the S3 request signing we do on the queue.  This means that the only values which can be set are those which the Queue expects, and that they are immutable.

Important to note is that because these are non-standard headers, verification requires explicit action on the part of the artifact downloader.  That's a big part of why we've written supported artifact downloading tools.

Attribution of Corruption

Corruption is inevitable in a massive system like Taskcluster.  What's really important is that when corruption happens we detect it and we know where to focus our remediation efforts.  In the new Artifact API, we can zero in on the culprit for corruption.

With the old Artifact API, we don't have any way to figure out if an artifact is corrupted or where that happened.  We never know what the artifact was on the build machine, we can't verify corruption in caching systems and when we have an invalid artifact downloaded on a downstream task, we don't know whether it is invalid because the file was defective from the start or if it was because of a bad transfer.

Now, we know that if the Sha256 checksums of the downloaded artifact, the original file was broken before it was uploaded.  We can build caching systems which ensure that the value that they're caching is valid and alert us to corruption.  We can track corruption to detect issues in our underlying infrastructure.

Completed Artifact Messages and Task Resolution

Previously, as soon as the Taskcluster Queue stored the metadata about the artifact in its internal tables and generated a signed url for the S3 object, the artifact was marked as completed.  This behaviour resulted in a slightly deceptive message being sent.  Nobody cares when this allocation occurs, but someone might care about an artifact becoming available.

On a related theme, we also allowed tasks to be resolved before the artifacts were uploaded.  This meant that a task could be marked as "Completed -- Success" without actually uploading any of its artifacts.  Obviously, we would always be writing workers with the intention of avoiding this error, but having it built into the Queue gives us a stronger guarantee.

We achieved this result by adding a new method to the flow of creating and uploading an artifact and adding a 'present' field in the Taskcluster Queue's internal Artifact table.  For those artifacts which are created atomically, and the legacy S3 artifacts, we just set the value to true.  For the new artifacts, we set it to false.  When you finish your upload, you have to run a complete artifact method.  This is sort of like a commit.

In the complete artifact method, we verify that S3 sees the artifact as present and only once it's completed do we send the artifact completed method.  Likewise, in the complete task method, we ensure that all artifacts have a present value of true before allowing the task to complete.

S3 Eventual Consistency and Caching Error Detection

S3 works on an Eventual consistency model for some operations in some regions.  Caching systems also have a certain level of tolerance for corruption.  We're now able to determine whether the bytes we're downloading are those which we expect.  We can now rely on more than http status code to know whether the request worked.

In both of these cases we can programmatically check if the download is corrupt and try again as appropriate.  In the future, we could even build smarts into our download libraries and tools to request caches involved to drop their data or try bypassing caches as a last result.

Artifacts should be easy to use

Right now, if your working with artifacts directly, you're probably having a hard time.  You have to use something like Curl and building urls or signed urls.  You've probably hit pitfalls like Curl not exiting with an error on a non-200 HTTP Status.  You're not getting any content verification.  Basically, it's hard.

Taskcluster is about enabling developers to do their job effectively.  Something so critical to CI usage as artifacts should be simple to use.  To that end, we've implemented libraries for interacting with artifacts in Javascript and Go.  We've also implemented a Go based CLI for interacting with artifacts in the build system or shell scripts.
Javascript
The Javascript client uses the same remotely-signed-s3 library that the Taskcluster Queue uses internally.  It's a really simple wrapper which provides an put() and get() interface.  All of the verification of requests is handled internally, as is decompression of Gzip resources.  This was primarily written to enable integration in Docker-Worker directly.
Go
We also provide a Go library for downloading and uploading artifacts.  This is intended to be used in the Generic-Worker, which is written in Go.  The Go Library uses the minimum useful interface in the Standard I/O library for inputs and outputs.  We're also doing type assertions to do even more intelligent things on those inputs and outputs which support it.
CLI
For all other users of Artifacts, we provide a CLI tool.  This provides a simple interface to interact with artifacts.  The intention is to make it available in the path of the task execution environment, so that users can simply call "artifact download --latest $taskId $name --output browser.zip.

Artifacts should allow serving to the browser in Gzip

We want to enable large text files which compress extremely well with Gzip to be rendered by web browsers.  An example is displaying and transmitting logs.  Because of limitations in S3 around Content-Encoding and its complete lack of content negotiation, we have to decide when we upload an artifact whether or not it should be Gzip compressed.

There's an option in the libraries to support automatic Gzip compression of things we're going to upload.  We chose Gzip over possibly-better encoding schemes because this is a one time choice at upload time, so we wanted to make sure that the scheme we used would be broadly implemented.

Further Improvements

As always, there's still some things around artifact handling that we'd like to improve upon.  For starters, we should work on splitting artifact handling out of our Queue.  We've already agreed on a design of how we should store artifacts.  This involves splitting out all of the artifact handling out of the Queue into a different service and having the Queue track only which artifacts belong to each task run.

We're also investigting an idea to store each artifact in the region it is created in.  Right now, all artifacts are stored in EC2's US West 2 region.  We could have a situation where a build vm and test vm are running on the same hypervisor in US East 1, but each artifact has to be upload and downloaded via US West 2.

Another area we'd like to work on is supporting other clouds.  Taskcluster ideally supports whichever cloud provider you'd like to use.  We want to support other storage providers than S3, and splitting out the low level artifact handling gives us a huge maintainability win.

Possible Contributions

We're always open to contributions!  A great one that we'd love to see is allowing concurrency of multipart uploads in Go.  It turns out that this is a lot more complicated than I'd like it to be in order to support passing in the low level io.Reader interface.  We'd want to do some type assertions to see if the input supports io.ReaderAt, and if not, use a per-go-routine offset and file mutex to guard around seeking on the file.  I'm happy to mentor this project, so get in touch if that's something you'd like to work on.

Conclusion

This project has been a really interesting one for me.  It gave me an opportunity to learn the Go programming language and work with the underlying AWS Rest API.  It's been an interesting experience after being heads down in Node.js code and has been a great reminder of how to use static, strongly typed languages.  I'd forgotten how nice a real type system was to work with!

Integration into our workers is still ongoing, but I wanted to give an overview of this project to keep everyone in the loop.  I'm really excited to see a reduction in the amount of corruptions for artifacts

August 28, 2018 03:30 PM

August 22, 2018

Dustin Mitchell

Introducing CI-Admin

A major focus of recent developments in Firefox CI has been putting control of the CI process in the hands of the engineers working on the project. For the most part, that means putting configuration in the source tree. However, some kinds of configuration don’t fit well in the tree. Notably, configuration of the trees themselves must reside somewhere else.

CI-Configuration

This information is collected in the ci-configuration repository. This is a code-free library containing YAML files describing various aspects of the configuration – currently the available repositories (projects.yml) and actions.

This repository is designed to be easy to modify by anyone who needs to modify it, through the usual review processes. It is even Phabricator-enabled!

CI-Admin

Historically, we’ve managed this sort of configuration by clicking around in the Taskcluster Tools. The situation is analogous to clicking around in the AWS console to set up a cloud deployment – it works, and it’s quick and flexible. But it gets harder as the configuration becomes more complex, it’s easy to make a mistake, and it’s hard to fix that mistake. Not to mention, the tools UI shows a pretty low-level view of the situation that does not make common questions (“Is this scope available to cron jobs on the larch repo?”) easy to answer.

The devops world has faced down this sort of problem, and the preferred approach is embodied in tools like Puppet or Terraform:

  • write down the desired configuration in a human-parsable text files
  • check it into a repository and use the normal software-development processes (CI, reviews, merges..)
  • apply changes with a tool that enforces the desired state

This “desired state” approach means that the tool examines the current configuration, compares it to the configuration expressed in the text files, and makes the necessary API calls to bring the current configuration into line with the text files. Typically, there are utilities to show the differences, partially apply the changes, and so on.

The ci-configuration repository contains those human-parsable text files. The tool to enforce that state is ci-admin. It has some generic resource-manipulation support, along with some very Firefox-CI-specific code to do weird things like hashing .taskcluster.yml.

Making Changes

The current process for making changes is a little cumbersome. In part, that’s intentional: this tool controls the security boundaries we use to separate try from release, so its application needs to be carefully controlled and subject to significant human review. But there’s also some work to do to make it easier (see below).

The process is this:

  • make a patch to either or both repos, and get review from someone in the “Build Config - Taskgraph” module
  • land the patch
  • get someone with the proper access to run ci-admin apply for you (probably the reviewer can do this)

Future Plans

Automation

We are in the process of setting up some automation around these repositories. This includes Phabricator, Lando, and Treeherder integration, along with automatic unit test runs on push.

More specific to this project, we also need to check that the current and expected configurations match. This needs to happen on any push to either repo, but also in between pushes: someone might make a change “manually”, or some of the external data sources (such as the Hg access-control levels for a repo) might change without a commit to the ci-configuration repo. We will do this via a Hook that runs ci-admin diff periodically, notifying relevant people when a difference is found. These results, too, will appear in Treeherder.

Grants

One of the most intricate and confusing aspects of configuration for Firefox CI is the assignment of scopes to various jobs. The current implementation uses a cascade of role inheritance and * suffixes which, frankly, no human can comprehend. The new plan is to “grant” scopes to particular targets in a file in ci-configuration. Each grant will have a clear purpose, with accompanying comments if necessary. Then, ci-admin will gather all of the grants and combine them into the appropriate role definitions.

Worker Configurations

At the moment, the configuration of, say, aws-provsioner-v1/gecko-t-large is a bit of a mystery. It’s visible to some people in the AWS-provisioner tool, if you know to look there. But that definition also contains some secret data, so it is not publicly visible like roles or hooks are.

In the future, we’d like to generate these configurations based on ci-configuration. That both makes it clear how a particular worker type is configured (instance type, capacity configuration, regions, disk space, etc.), and allows anyone to propose a modification to that configuration – perhaps to try a new instance type.

Terraform Provider

As noted above, ci-admin is fairly specific to the needs of Firefox CI. Other users of Taskcluster would probably want something similar, although perhaps a bit simpler. Terraform is already a popular tool for configuring cloud services, and supports plug-in “providers”. It would not be terribly difficult to write a terraform-provider-taskcluster that can create roles, hooks, clients, and so on.

This is left as an exercise for the motivated user!

Links

August 22, 2018 03:00 PM

June 15, 2018

Dustin Mitchell

Actions as Hooks

You may already be familiar with in-tree actions: they allow you to do things like retrigger, backfill, and cancel Firefox-related tasks They implement any “action” on a push that occurs after the initial hg push operation.

This article goes into a bit of detail about how this works, and a major change we’re making to that implementation.

History

Until very recently, actions worked like this: First, the decision task (the task that runs in response to a push and decides what builds, tests, etc. to run) creates an artifact called actions.json. This artifact contains the list of supported actions and some templates for tasks to implement those actions. When you click an action button (in Treeherder or the Taskcluster tools, or any UI implementing the actions spec), code running in the browser renders that template and uses it to create a task, using your Taskcluster credentials.

I talk a lot about functionality being in-tree. Actions are yet another example. Actions are defined in-tree, using some pretty straightforward Python code. That means any engineer who wants to change or add an action can do so – no need to ask permission, no need to rely on another engineer’s attention (aside from review, of course).

There’s Always a Catch: Security

Since the beginning, Taskcluster has operated on a fairly simple model: if you can accomplish something by pushing to a repository, then you can accomplish the same directly. At Mozilla, the core source-code security model is the SCM level: try-like repositories are at level 1, project (twice) repositories at level 2, and release-train repositories (autoland, central, beta, etc.) are at level 3. Similarly, LDAP users may have permisison to push to level 1, 2, or 3 repositories. The current configuration of Taskcluster assigns the same scopes to users at a particular level as it does to repositories.

If you have such permission, check out your scopes in the Taskcluster credentials tool (after signing in). You’ll see a lot of scopes there.

The Release Engineering team has made release promotion an action. This is not something that every user who can push to level-3 repository – hundreds of people – should be able to do! Since it involves signing releases, this means that every user who can push to a level-3 repository has scopes involved in signing a Firefox release. It’s not quite as bad as it seems: there are lots of additional safeguards in place, not least of which is the “Chain of Trust” that cryptographically verifies the origin of artifacts before signing.

All the same, this is something we (and the Firefox operations security team) would like to fix.

In the new model, users will not have the same scopes as the repositories they can push to. Instead, they will have scopes to trigger specific actions on task-graphs at specific levels. Some of those scopes will be available to everyone at that level, while others will be available only to more limited groups. For example, release promotion would be available to the Release Management team.

Hooks

This makes actions a kind of privilege escalation: something a particular user can cause to occur, but could not do themselves. The Taskcluster-Hooks service provides just this sort of functionality: a hook creates a task using scopes assiged by a role, without requiring the user calling triggerHook to have those scopes. The user must merely have the appropriate hooks:trigger-hook:.. scope.

So, we have added a “hook” kind to the action spec. The difference from the original “task” kind is that actions.json specifies a hook to execute, along with well-defined inputs to that hook. The user invoking the action must have the hooks:trigger-hook:.. scope for the indicated hook. We have also included some protection against clickjacking, preventing someone with permission to execute a hook being tricked into executing one maliciously.

Generic Hooks

There are three things we may wish to vary for an action:

  • who can invoke the action;
  • the scopes with which the action executes; and
  • the allowable inputs to the action.

Most of these are configured within the hooks service (using automation, of course). If every action is configured uniquely within the hooks service, then the self-service nature of actions would be lost: any new action would require assistance from someone with permission to modify hooks.

As a compromise, we noted that most actions should be available to everyone who can push to the corresponding repo, have fairly limited scopes, and need not limit their inputs. We call these “generic” actions, and creating a new such action is self-serve. All other actions require some kind of external configuration: allocating the scope to trigger the task, assigning additional scopes to the hook, or declaring an input schema for the hook.

Hook Configuration

The hook definition for an action hook is quite complex: it involves a complex task definition template as well as a large schema for the input to triggerHook. For decision tasks, cron tasks, an “old” actions, this is defined in .taskcluster.yml, and we wanted to continue that with hook-based actions. But this creates a potential issue: if a push changes .taskcluster.yml, that push will not automatically update the hooks – such an update requires elevated privileges and must be done by someone who can sanity-check the operation. To solve this, ci-admin creates tasks hooksed on the .taskcluster.yml it finds in any Firefox repository, naming each after a hash of the file’s content. Thus, once a change is introduced, it can “ride the trains”, using the same hash in each repository.

Implementation and Implications

As of this writing, two common actions are operating as hooks: retrigger and backfill. Both are “generic” actions, so the next step is to start to implement some actions that are not generic. Ideally, nobody notices anything here: it is merely an implementation change.

Once all actions have been converted to hooks, we will begin removing scopes from users. This will have a more significant impact: lots of activities such as manually creating tasks (including edit-and-create) will no longer be allowed. We will try to balance the security issues against user convenience here. Some common activities may be implemented as actions (such as creating loaners). Others may be allowed as exceptions (for example, creating test tasks). But some existing workflows may need to change to accomodate this improvement.

We hope to finish the conversion process in July 2018, with that time largely taken with a slow rollout to accomodate unforseen implications. When the project is finished, Firefox releases and other sensitive operations will be better-protected, with minimal impact to developers’ existing worflows.

June 15, 2018 03:00 PM

May 21, 2018

Dustin Mitchell

Redeploying Taskcluster: Hosted vs. Shipped Software

The Taskcluster team’s work on redeployability means switching from a hosted service to a shipped application.

A hosted service is one where the authors of the software are also running the main instance of that software. Examples include Github, Facebook, and Mozillians. By contrast, a shipped application is deployed multiple times by people unrelated to the software’s authors. Examples of shipped applications include Gitlab, Joomla, and the Rust toolchain. And, of course, Firefox!

Hosted Services

Operating a hosted service can be liberating. Blog posts describe the joys of continuous deployment – even deploying the service multiple times per day. Bugs can be fixed quickly, either by rolling back to a previous deployment or by deploying a fix.

Deploying new features on a hosted service is pretty easy, too. Even a complex change can be broken down into phases and accomplished without downtime. For example, changing the backend storage for a service can be accomplished by modifying the service to write to both old and new backends, mirroring existing data from old to new, switching reads to the new backend, and finally removing the code to write to the old backend. Each phase is deployed separately, with careful monitoring. If anything goes wrong, rollback to the old backend is quick and easy.

Hosted service developers are often involved with operation of the service, and operational issues can frequently be diagnosed or even corrected with modifications to the software. For example, if a service is experiencing performance issues due to particular kinds of queries, a quick deployment to identify and reject those queries can keep the service up, followed by a patch to add caching or some other approach to improve performance.

Shipped Applications

A shipped application is sent out into the world to be used by other people. Those users may or may not use the latest version, and certainly will not update several times per day (the heroes running Firefox Nightly being a notable exception). So, many versions of the application will be running simultaneously. Some applications support automatic updates, but many users want to control when – and if – they update. For example, upgrading a website built with a CMS like Joomla is a risky operation, especially if the website has been heavily customized.

Upgrades are important both for new features and for bugfixes, including for security bugs. An instance of an application like Gitlab might require an immediate upgrade when a security issue is discovered. However, especially if the deployment is several versions old, that critical upgrade may carry a great deal of risk. Producers of shipped software sometimes provide backported fixes for just this purpose, at least for long term support (LTS) or extended support release (ESR) versions, but this has a substantial cost for the application developers.

Upgrading services like Gitlab or Joomla is made more difficult because there is lots of user data that must remain accessible after the upgrade. For major upgrades, that often requires some kind of migration as data formats and schemas change. In cases where the upgrade spans several major versions, it may be necessary to apply several migrations in order. Tools like Alembic help with this by maintaining and applying step-by-step database migrations.

Taskcluster

Today, Taskcluster is very much a hosted application. There is only one “instance” of Taskcluster in the world, at taskcluster.net. The Taskcluster team is responsible for both development and operation of the service, and also works closely with the Firefox build team as a user of the service.

We want to make Taskcluster a shipped application. As the descriptions above suggest, this is not a simple process. The following sections highlight some of the challenges we are facing.

Releases and Deployment

We currently deploy Taskcluster microservices independently. That is, when we make a change to a service like taskcluster-hooks, we deploy an upgrade to that service without modifying the other services. We often sequence these changes carefully to ensure continued compatibility: we expect only specific combinations of services to run together.

This is a far more intricate process than we can expect users to follow. Instead, we will ship Taskcluster releases comprised of a set of built Docker images and a spec file identifying those images and how they should be deployed. We will test that this particular combination of versions works well together.

Deploying a release involves combining that spec file with some deployment-specific configuration and some infrastructure information (implemented via Terraform) to produce a set of Kubernetes resources for deployment with kubectl. Kubernetes and Terraform both have limited support for migration from one release to another: Terraform will only create or modify changed resources, and Kubernetes will perform a phased roll-out of any modified resources.

By the way, all of this build-and-release functionality is implemented in the new taskcluster-installer.

Service Discovery

The string taskcluster.net appears quite frequently in the Taskcluster source code. For any other deployment, that hostname is not valid – but how will the service find the correct hostname? The question extends to determining pulse exchange names, task artifact hostnames, and so on. There are also security issues to consider: misconfiguration of URLs might enable XSS and CSRF attacks from untrusted content such as task artifacts.

The approach we are taking is to define a rootUrl from which all other URLs and service identities can be determined. Some are determined by simple transformations encapsulated in a new taskcluster-lib-urls library. Others are fetched at runtime from other services: pulse exchanges from the taskcluster-pulse service, artifact URLs from the taskcluster-queue service, and so on.

The rootUrl is a single domain, with all Taskcluster services available at sub-paths such as /api/queue. Users of the current Taskcluster installation will note that this is a change: queue is currently at https://queue.taskcluster.net, not https://taskcluster.net/queue. We have solved this issue by special-casing the rootUrl https://taskcluster.net to generate the old-style URLs. Once we have migrated all users out of the current installation, we will remove that special-case.

The single root domain is implemented using routing features supplied by Kubernetes Ingress resources, based on an HTTP proxy. This has the side-effect that when one microservice contacts another (for example, taskcluster-hooks calling queue.createTask), it does so via the same Ingress, a more circuitous journey than is strictly required.

Data Migrations

The first few deployments of Taskcluster will not require great support for migrations. A staging environment, for example, can be completely destroyed and re-created without any adverse impact. But we will soon need to support users upgrading Taskcluster from earlier releases with no (or at least minimal) downtime.

Our Azure tables library (azure-entities) already has rudimentary support for schema updates, so modifying the structure of table rows is not difficult, although refactoring a single table into multiple tables would be difficult.

As we transition to using Postgres instead of Azure, we will need to adopt some of the common migration tools. Ideally we can support downtime-free upgrades like azure-entities does, instead of requiring downtime to run DB migrations synchronously. Bug 1431783 tracks this work.

Customization

As a former maintainer of Buildbot, I’ve had a lot of experience with CI applications as they are used in various organizations. The surprising observation is this: every organization thinks that their approach to CI is the obvious and only way to do things; and every organization does things in a radically different way. Developers gonna develop, and any CI framework will get modified to suit the needs of each user.

Lots of Buildbot installations are heavily customized to meet local needs. That has caused a lot of Buildbot users to get “stuck” at older versions, since upgrades would conflict with the customizations. Part of this difficulty is due to a failure of the Buildbot project to provide strong guidelines for customization. Recent versions of Buildbot have done better by providing clearly documented APIs and marking other interfaces as private and subject to change.

Taskcluster already has strong APIs, so we begin a step ahead. We might consider additional guidelines:

  • Users should not customize existing services, except to make experimental changes that will eventually be merged upstream. This frees the Taskcluster team to make changes to services without concern that those will conflict with users’ modifications.

  • Users are encouraged, instead, to develop their own services, either hosted within the Taskcluster deployment as a site-specific service, or hosted externally but following Taskcluster API conventions. A local example is the tc-coalesce service, developed by the release engineering team to support Mozilla-specific task-superseding needs and hosted outside of the Taskcluster installation. On the other hand, taskcluster-stats-collector is deployed within the Firefox Taskcluster deployment, but is Firefox-specific and not part of a public Taskcluster release.

  • While a Taskcluster release will likely encompass some pre-built worker images for various cloud platforms, sophisticated worker deployment is the responsibility of individual users. That may mean deploying workers to hardware where necessary, perhaps with modifications to the build configurations or even entirely custom-built worker implementations. We will provide cloud-provisioning tools that can be used to dynamically instantiate user-specified images.

Generated Client Libraries

The second point above raises an interesting quandry: Taskcluster uses code generation to create its API client libraries. Historically, we have just pushed the “latest” client to the package repository and carefully choreographed any incompatible changes. For users who have not customized their deployment, this is not too much trouble: any release of Taskcluster will have a client library in the package repository corresponding to it. We don’t have a great way to indicate which version that is, but perhaps we will invent something.

But when Taskcluster installations are customized by adding additional services, progress is no longer linear: each user has a distinct “fork” of the Taskcluster API surface containing the locally-defined services. Development of Taskcluster components poses a similar challenge: if I add a new API method to a service, how do I call that method from another service without pushing a new library to the package repository?

The question is further complicated by the use of compiled languages. While Python and JS clients can simply load a schema reference file at runtime (for example, a file generated at deploy time), the Go and Java clients “bake in” the references at compile time.

Despite much discussion, we have yet to settle on a good solution for this issue.

Everything is Public!

Mozilla is Open by Design, and so is Taskcluster: with the exception of data that must remain private (passwords, encryption keys, and material covered by other companies’ NDAs), everything is publicly accessible. While Taskcluster does have a sophisticated and battle-tested authorization system based on scopes, most read-only API calls do not require any scopes and thus can be made with a simple, un-authenticated HTTP request.

We take advantage of the public availability of most data by passing around simple, authentication-free URLs. For example, the action specification describes downloading a decision task’s public/action.json artifact. Nowhere does it mention providing any credentials to fetch the decision task, nor to fetch the artifact itself.

This is a rather fundamental design decision, and changing it would be difficult. We might embark on that process, but we might also declare Taskcluster an open-by-design system, and require non-OSS users to invent other methods of hiding their data, such as firewalls and VPNs.

Transitioning from taskcluster.net

Firefox build, test, and release processes run at massive scale on the existing Taskcluster instance at https://taskcluster.net, along with a number of smaller Mozilla-associated projects. As we work on this “redeployability” project, we must continue to deploy from master to that service as well – the rootUrl special-case mentioned above is a critical part of this compatibility. We will not be running either new or old instances from long-living Git branches.

Some day, we will need to move all of these projects to a newly redeployed cluster and delete the old. That day is still in the distant future. It will likely involve some running of tasks in parallel to expunge any leftover references to taskcluster.net, then a planned downtime to migrate everything over (we will want to maintain task and artifact history, for example). We will likely finish up by redeploying a bunch of permanent redirects from taskcluster.net domains.

Conclusion

That’s just a short list of some of the challenges we face in transmuting a hosted service into a shipped application.

All the while, of course, we must “keep the lights on” for the existing deployment, and continue to meet Firefox’s needs. At the moment that includes a project to deploy Taskcluster workers on arm64 hardware in https://packet.net, development of the docker-engine to replace the aging docker worker, using hooks for actions to reduce the scopes afforded to level-3 users, improving taskcluster-github to support defining decision tasks, and the usual assortment of contributed pull requests, issue debugging, service requests.

May 21, 2018 03:00 PM

May 01, 2018

Dustin Mitchell

Design of Task-Graph Generation

Almost two years ago, Bug 1258497 introduced a new system for generating the graph of tasks required for each push to a Firefox source-code repository. Work continues to modify the expected tasks and add features, but the core design is stable. Lots of Firefox developers have encountered this system as they add or modify a job or try to debug why a particular task is failing. So this is a good time to review the system design at a high level.

A quick note before beginning: the task-graph generation system is implemented entirely in the Firefox source tree, and is administered as a sub-module of the Build Config module. While it is designed to interface with Taskcluster, and some of the authors are members of the Taskcluster team, it is not a part of Taskcluster itself.

Requirements

A task is a unit of work in the aptly-named Taskcluster platform. This might be a Firefox build process, or a run of a chunk of a test suite. More esoteric tasks include builds of the toolchains and OS environments used by other tasks; cryptographic signing of Firefox installers; configuring Balrog, the service behind Firefox’s automatic updates; and pushing APKs to the Google Play Store.

A task-graph is a collection of tasks linked by their dependencies. For example, a test task cannot run until the build it is meant to test has finished, and that build cannot run until the compiler toolchain it requires has been built.

The task-graph generation system, then, is responsible for generating a task-graph containing the tasks required to test a try push, a landing on a production branch, a nightly build, and a full release of Firefox. That task graph must be minimal (for example, not rebuilding a toolchain if it has already been built) and specific to the purpose (some tasks only run on mozilla-central, for example).

Firefox has been using some CI system – Tinderbox, then Buildbot, and now Taskcluster – for decades, so the set of requirements is quite deep and shrouded in historical mystery.

While the resulting system may seem complex, it is a relatively simple expression of the intricate requirements it must meet. It is also designed with approachability in mind: many common tasks can be accomplished without fully understanding the design.

System Design

The task-graph generation process itself runs in a task, called the Decision Task. That task is typically created in response to a push to version control, and is typically the first task to appear in Treeherder, with a “D” symbol. The decision task begins by checking out the pushed revision, and then runs the task-graph generation implementation in that push. That means the system can be tested in try, and can ride the trains just like any other change to Firefox.

Task-Graph Generation Process

The decision task proceeds in a sequence of steps:

  1. Generate a graph containing all possible tasks (the full task graph). As of this writing, the full task graph contains 10,972 tasks!

  2. Filter the graph to select the required tasks for this situation. Each project (a.k.a. “branch” or “repo”) has different requirements. Try pushes are a very flexible kind of filtering, selecting only the tasks indicated by the (arcane!) try syntax or the (awesome!) try-select system (more on this below). The result is the target task graph.

  3. “Optimize” the graph, by trimming unnecessary tasks. Some tasks, such as tests, can simply be dropped if they are not required. Others, such as toolchain builds, must be replaced by an existing task containing the required data. The result is the optimized task graph.

  4. Create each of the tasks using the Taskcluster API.

The process is a bit more detailed but this level of detail will do for now.

Kinds and Loaders

We’ll now focus on the first step: generating the full task graph. In an effort to segment the mental effort required, tasks are divided into kinds. There are some obvious kinds – build, test, toolchain – and a lot of less obvious kinds. Each kind has a directory in taskcluster/ci.

Each kind is responsible for generating a list of tasks and their dependencies. The tasks for all kinds are combined to make the full task graph. Each kind can generate its tasks in a different way; this is the job of the kind’s loader. Each kind has a kind.yml which points to a Python function that acts as its loader.

Most loaders just load task definitions from YAML files in the kind directory. There are a few more esoteric loaders – for example, the test loader creates one copy of each test for each platform, allowing a single definition of, say mochitest-chrome to run on all supported platforms.

Transforms

A “raw” task is designed for execution by a Taskcluster worker. It has all sorts of details of the task baked into environment variables, the command to run, routes, and so on. We do not want to write expressions to generate that detail over and over for each task, so we design the inputs in the YAML files to be much more human-friendly. The system uses transforms to bridge the gap: each task output from the load is passed through a series of transforms, in the form of Python generator functions, to produce the final, raw task.

To bring some order to the process, there are some specific forms defined, with schemas and sets of transforms to turn one into the next:

  • Test Description - how to perform a test, including suite and flavor, hardware features required, chunking configuration, and so on.

  • Job Description - how to perform a job; essentially “run Mozharness with these arguments” or “run the Debian package-building process with these inputs”

  • Task Description - how to build a task description; this contains all of the command arguments, environment variables, and so on but is not specific to a particular worker implementation.

There are several other “descriptions”, but these give the general idea.

The final effect is that a relatively concise, readable build description like this:

linux64/debug:
    description: "Linux64 Debug"
    index:
        product: firefox
        job-name: linux64-debug
    treeherder:
        platform: linux64/debug
        symbol: B
    worker-type: aws-provisioner-v1/gecko-{level}-b-linux
    worker:
        max-run-time: 36000
    run:
        using: mozharness
        actions: [get-secrets build check-test update]
        config:
            - builds/releng_base_firefox.py
            - builds/releng_base_linux_64_builds.py
        script: "mozharness/scripts/fx_desktop_build.py"
        secrets: true
        custom-build-variant-cfg: debug
        tooltool-downloads: public
        need-xvfb: true
    toolchains:
        - linux64-clang
        - linux64-gcc
        - linux64-sccache
        - linux64-rust

Can turn into a much larger task definition like this.

Cron

We ship “nightlies” of Firefox twice a day (making the name “nightly” a bit of a historical artifact). This, too, is controlled in-tree, and is general enough to support other time-based task-graphs such as Valgrind runs or Searchfox updates.

The approach is fairly simple: the hooks service creates a “cron task” for each project every 15 minutes. This task checks out the latest revision of the project and runs a mach command that examines .cron.yml in the root of the source tree. It then creates a decision task for each matching entry, with a custom task-graph filter configuration to select only the desired tasks.

Actions

For the most part, the task-graph for a push (or cron task) is defined in advance. But developers and sheriffs often need to modify a task-graph after it is created, for example to retrigger a task or run a test that was accidentally omitted from a try push. Taskcluster defines a generic notion of an “action” for just this purpose: acting on an existing task-graph.

Briefly, the decision task publishes a description of the actions that are available for the tasks in the task-graph. Services like Treeherder and the Taskcluster task inspector then use that description to connect user-interface elements to those actions. When an action is executed, the user interface creates a new task called an action task that performs the desired action.

Action tasks are similar to decision and cron tasks: they clone the desired revision of the source code, then run a mach command to do whatever the user has requested.

Multiple Projects

The task-graph generation code rides the trees, with occasional uplifts, just like the rest of the Firefox codebase. That means that the same code must work correctly for all branches; we do not have a different implementation for the mozilla-beta branch, for example.

While it might seem like, to run a new task on mozilla-central, you would just land a patch adding that task on mozilla-central, it’s not that simple: without adjusting the filtering, that task would eventually be merged to all other projects and execute everywhere.

This also makes testing tricky: since the task-graph generation is different for every project, it’s possible to land code which works fine in try and inbound, but fails on mozilla-central. It is easy to test task-graph generation against specific situations (all inputs to the process are encapsulated in a parameters.yml file easily downloaded from a decision task). The artistry is in figuring out which situations to test.

Try Pushes

Pushes to try trigger decision tasks just like any other project, but the filtering process is a little more complex.

If the push comes with legacy try syntax (-b do -p win64,linux64 -u all[linux64-qr,windows10-64-qr] -t all[linux64-qr,windows10-64-qr] - clear as mud, right?), we do our best to emulate the behavior of the Buildbot try parser in filtering out tasks that were not requested. The legacy syntax is deeply dependent on some Buildbot-specific features, and does not cover new functionality like optimization, so there are lots of edge cases where it behaves strangely or does not work at all.

The better alternative is try-select, where the push contains a try_task_config.json listing exactly which tasks to run, along with desired modifications to those tasks. The command ./mach try fuzz creates such a file. In this case, creating the target task-graph is as simple as filtering for tasks that match the supplied list.

Conclusion

This has been a long post! The quote “make everything as simple as possible and no simpler”, commonly attributed to Einstein, holds the reason – the task-graph generation system satisfies an incredibly complex set of requirements. In designing the system, we considered these requirements holistically and with a deep understanding of how they developed and why they exist, and then designed a system that was as simple as possible. The remaining complexity is inherent to the problem it solves.

The task-graph generation is covered in the Firefox source-docs and its source is in the /taskcluster directory in the Firefox source tree.

May 01, 2018 03:00 PM

February 23, 2018

Dustin Mitchell

Internship Applications: Make the First Move

There’s an old story about Charles Proteus Steinmetz, a famous GE engineer in the early 20th century. He was called to one of Henry Ford’s factories, where a huge generator was having problems that the local engineers could not solve. After some investigation and calculation, Steinmetz made a mark on the shell of the generator and told the factory engineers to open that spot and replace the windings there. He later sent a bill for his services to Henry Ford: $10,000. Ford demanded an itemized bill – after all, Steinmetz had only made a single mark on the generator. The bill came back: “Making chalk mark on generator: $1. Knowing where to make mark: $9,999.”

Like electrical engineering, software development is more than just writing code. Sometimes it can take hours to write a 3-line patch. The hard part is knowing what patch to write.

It takes time to understand the system you’re developing and the systems it interacts with. Just undersatnding the problem you’re trying to solve can take some lengthy pondering. There are often new programming languages involved, or new libraries or tools. Once you start writing the code, new complications come up, and you must adjust course.

Experienced software engineers can make this look easy. They have an intuitive sense of what is important and what can be safely ignored, and for what problems might come up later. This is probably the most important skill for newcomers to the field to work on.

Make the First Move

Lately, I’ve gotten dozens of emails from Google Summer of Code and Outreachy applicants that go like this:

Dear Sir,

I am interested in the Outreachy project “…”. I have a background in JavaScript, HTML, CSS, and Java. Please connect me with a mentor for this project.

I’ve also seen dozens of bug comments like this:

I would like to work on this bug. Please guide me in what steps to take.

There is nothing inherently wrong with these messages. It’s always OK to ask for help.

What’s missing is evidence that applicant has made any effort to get started. In the first case, the applicant did not even read the full project description, which indicates that the next step is to make a contribution and has links to tools for finding those contributions. In the second case, it seems that the applicant has not even taken the first steps toward solving the bug. In most cases, they have not even read the bug!

If my first instructions to an applicant are “start by reading the bug” or “taskcluster-lib-app is at https://github.com/taskcluster/taskcluster-lib-app” (something Google will happily tell you in 0.55 seconds), that suggests the applicant’s problem-solving skills need some serious work. While GSoC and Outreachy are meant to be learning experiences, we look for applicants who are able to make the most of the experience by learning and growing on their own. A participant who asks “what is the next step” at every step, without ever trying to figure out what steps to take, is not going to learn very much.

Advice

If you are applying to a program like Google Summer of Code or Outreachy, take the time to try to problem-solve before asking for help. There is nothing wrong with asking for help. But when you do, show what you have already figured out, and ask a specific question. For example:

I would like to work on this bug. It seems that this would require modifying the taskcluster-lib-scopes library to add a formatter function. I can see how this formatter would handle anyOf and allOf, but how should it format a for loop?

This comment shows that the applicant has done some thinking about the problem already, and I can see exactly where they have gotten stuck.

February 23, 2018 03:00 PM

January 19, 2018

Dustin Mitchell

Taskcluster Redeployability

Taskcluster To Date

Taskcluster has always been open source: all of our code is on Github, and we get lots of contributions to the various repositories. Some of our libraries and other packages have seen some use outside of a Taskcluster context, too.

But today, Taskcluster is not a project that could practically be used outside of its single incarnation at Mozilla. For example, we hard-code the name taskcluster.net in a number of places, and we include our config in the source-code repositories. There’s no legal or contractual reason someone else could not run their own Taskcluster, but it would be difficult and almost certainly break next time we made a change.

The Mozilla incarnation is open to use by any Mozilla project, although our focus is obviously Firefox and Firefox-related products like Fennec. This was a practical decision: our priority is to migrate Firefox to Taskcluster, and that is an enormous project. Maintaining an abstract ability to deploy additional instances while working on this project was just too much work for a small team.

The good news is, the focus is now shifting. The migration from Buildbot to Taskcluster is nearly complete, and the remaining pieces are related to hardware deployment, largely by other teams. We are returning to work on something we’ve wanted to do for a long time: support redeployability.

Redeployability

Redeployability means that Taskcluster can be deployed easily, multiple times, similar to OpenStack or Hadoop. If, when we finish this effort, there exist several distinct “instances” of Taskcluster in the world, then we have been successful. We will start by building a “staging” deployment of the Firefox instance, then move on to deploy instances that see production usage, first for other projects in Mozilla, and later for projects outside of Mozilla.

In deciding to pursue this approach, we considered three options:

  • Taskcluster as a service (TCaaS) – we run the single global Taskcluster instance, providing that service to other projects just like Github or Travis-CI.
  • Firefox CI – Taskcluster persists only as a one-off tool to support Firefox’s continuous integration system
  • Redeployability (redeployability) – we provide means for other projects to run dedicated Taskcluster instances

TCaaS allows us to provide what we believe is a powerful platform for complex CI systems to a broad audience. While not quite as easy to get started with, Taskcluster’s flexibility extends far beyond what even a paid plan with CircleCI or Travis-CI can provide. However, this approach would represent a new and different business realm for Mozilla. While the organization has lots of public-facing services like MDN and Addons, other organizations do not depend on these services for production usage, nor do they pay us a fee for use of those services. Defining and meeting SLAs, billing, support staffing, abuse response – none of these are areas of expertise within Mozilla, much less the Taskcluster team. TCaaS would also require substantial changes to the platform itself to isolate paying customers from one another, hide confidential data, accurately meter usage, and so on.

Firefox CI is, in a sense, a scaled-back, internal version of TCaaS: we provide a service, but to only one customer (Firefox Engineering). It would mean transitioning the team to an operations focus, with little or no further development on the platform. It would also open the doors to Firefox-specific design within Taskcluster, such as checking out the Gecko source code in the workers or sorting queued tasks by Gecko branch. This would also shut the door to other projects such as Rust relying on Taskcluster.

Redeployability represents something of a compromise between the other two options. It allows us to make Taskcluster available outside of the narrow confines of Firefox CI without diving into a strange new business model. We’re Mozilla – shipping open source software is right in our wheelhouse.

It comes with some clear advantages, too:

  • Like any open-source project, users will contribute back, focusing on the parts of the system most related to their needs. Most Taskcluster users will be medium- to large-scale engineering organizations, and thus able to dedicate the resources to design and develop significant new features.

  • A well-designed deployment system will help us improve operations for Firefox CI (many of our outages today are caused by deployment errors) and enable deployment by teams focused on operations.

  • We can deploy an entire staging instance of Firefox’s Taskcluster, allowing thorough testing before deploying to production. The current approach to staging changes is ad-hoc and differs between services, workers, and libraries.

Challenges

Of course, the redeployability project is not going to be easy. The next few sections highlight some of the design challenges we are facing. We have begin solving all of these and more, but as none of the solutions are set in stone I will focus just on the challenges themselves.

Deployment Process

Deploying a set of microservices and backend services like databases is pretty easy: tools like Kubernetes are designed for the purpose. Taskcluster, however, is a little more complicated. The system uses a number of cloud providers (packet.net, AWS, and Azure), each of which needs to be configured properly before use.

Worker deployment is a complicated topic: workers must be built into images that can run in cloud services (such as AMIs), and those images must be capable of starting and contacting the Queue to fetch work without further manual input. We already support a wide array of worker deployments on the single instance of Taskcluster, and multiple deployments would probably see an even greater diversity, so any deployment system will need to be extremely flexible.

We want to use the deployment process for all deployments, so it must be fast and reliable. For example, to deploy a fix to the Secrets service, I would modify the configuration to point to the new version and initiate a full re-deploy of the Taskcluster instance. If the deployment process causes downtime by restarting every service, or takes hours to complete, we will find ourselves “cheating” and deploying things directly.

Client Libraries

The Taskcluster client libraries contain code that is generated from the API specification for the Taskcluster services. That means that the latest taskcluster package on PyPi corresponds to the APIs of the services as they are currently deployed. If an instance of Taskcluster is running an older version of those services, then the newer client may not be able to call them correctly. Likewise, an instance created for development purposes might have API methods that aren’t defined in any released version of the client libraries.

A related issue is service discovery: how does a client library find the right URL for a particular service? For platform services like the Queue and Auth, this is fairly simple, but grows more complex for services which might be deployed several times, such as the AWS provisioner.

Configuration and Customization

No two deployments of Taskcluster will be exactly alike – that would defeat the purpose. We must support a limited amount of flexibility: which services are enabled, what features of those services are enabled, and credentials for the various cloud services we use.

In some cases the configuration for a service relies on values derived from another service that must already be started. For example, the Queue needs Taskcluster credentials generated by calling createTask on a running Auth service.

Upgrades

Many of the new features we have added in Taskcluster have been deployed through a carefully-choreographed, manual process. For example, to deploy parameterized roles support, which involved a change to the Auth sevice’s backend support, I disabled writes to the backend, carefully copied the data to the new backend, then landed a patch to begin using the new backend with the old frontend, and so on. We cannot expect users to follow hand-written instructions for such delicate dances.

Conclusion

The Taskcluster team has a lot of work to do. But this is a direction many of us have been itching to move for several years now, so we are eagerly jumping into it. Look for more updates on the redeployability project in the coming months!

January 19, 2018 03:00 PM

December 08, 2017

Dustin Mitchell

Parameterized Roles

The roles functionality in Taskcluster is a kind of “macro expansion”: given the roles

group:admins -> admin-scope-1
                admin-scope-2
                assume:group:devs
group:devs   -> dev-scope

the scopeset ["assume:group:admins", "my-scope"] expands to

[
    "admin-scope-1",
    "admin-scope-2",
    "assume:group:admins",
    "assume:group:devs",
    "dev-scope",
    "my-scope",
]

because the assume:group:admins expanded the group:admins role, and that recursively expanded the group:devs role.

However, this macro expansion did not allow any parameters, similar to allowing function calls but without any arguments.

The result is that we have a lot of roles that look the same. For example, project-admin:.. roles all have similar scopes (with the project name included in them), and a big warning in the description saying “DO NOT EDIT”.

Role Parameters

Now we can do better! A role’s scopes can now include <..>. When expanding, this string is replaced by the portion of the scope that matched the * in the roleId. An example makes this clear:

project-admin:* -> assume:hook-id:project-<..>/*
                   assume:project:<..>:*
                   auth:create-client:project/<..>/*
                   auth:create-role:hook-id:project-<..>/*
                   auth:create-role:project:<..>:*
                   auth:delete-client:project/<..>/*
                   auth:delete-role:hook-id:project-<..>/*
                   auth:delete-role:project:<..>:*
                   auth:disable-client:project/<..>/*
                   auth:enable-client:project/<..>/*
                   auth:reset-access-token:project/<..>/*
                   auth:update-client:project/<..>/*
                   auth:update-role:hook-id:project-<..>/*
                   auth:update-role:project:<..>:*
                   hooks:modify-hook:project-<..>/*
                   hooks:trigger-hook:project-<..>/*
                   index:insert-task:project.<..>.*
                   project:<..>:*
                   queue:get-artifact:project/<..>/*
                   queue:route:index.project.<..>.*
                   secrets:get:project/<..>/*
                   secrets:set:project/<..>/*

With the above parameterized role in place, we can delete all of the existing project-admin:.. roles: this one will do the job. A client that has assume:project-admin:bugzilla in its scopes will have assume:hook-id:project:bugzilla/* and all the rest in its expandedScopes.

There’s one caveat: a client with assume:project-admin:nss* will have assume:hook-id:project:nss* – note the loss of the trailing /. The * consumes any parts of the scope after the <..>. In practice, as in this case, this is not an issue, but could certainly cause surprise for the unwary.

Implementation

Parameterized roles seem pretty simple, but they’re not!

Efficiency

Before parameterized roles the Taskcluster-Auth service would pre-compute the full expansion of every role. That meant that any API call requiring expansion of a set of scopes only needed to combine the expansion of each scope in the set – a linear operation. This avoided a (potentially exponential-time!) recursive expansion, trading some up-front time pre-computing for a faster response to API calls.

With parameterized roles, such pre-computation is not possible. Depending on the parameter value, the expansion of a role may or may not match other roles. Continuing the example above, the role assume:project:focus:xyz would be expanded when the parameter is focus, but not when the parameter is bugzilla.

The fix was to implement the recursive approach, but in such a way that non-pathological cases have reasonable performance. We use a trie which, given a scope, returns the set of scopes from any matching roles along with the position at which those scopes matched a * in the roleId. In principle, then, we resolve a scopeset by using this trie to expand (by one level) each of the scopes in the scopeset, substituting parameters as necessary, and recursively expand the resulting scopes.

To resolve a scope set, we use a queue to “flatten” the recursion, and keep track of the accumulated scopes as we proceed. We already had some utility functions that allow us to make a few key optimizations. First, it’s only necessary to expand scopes that start with assume: (or, for completeness, things like * or assu*). More importantly, if a scope is already included in the seen scopeset, then we need not enqueue it for recursion – it has already been accounted for.

In the end, the new implementation is tens of milliseconds slower for some of the more common queries. While not ideal, in practice that as not been problematic. If necessary, some simple caching might be added, as many expansions repeat exactly.

Loops

An advantage of the pre-computation was that it could seek a “fixed point” where further expansion does not change the set of expanded scopes. This allowed roles to refer to one another:

some-role -> assume:another-role
another*  -> assume:some-role

A naïve recursive resolver might loop forever on such an input, but could easily track already-seen scopes and avoid recursing on them again. The situation is much worse with parameterized roles. Consider:

some-role-*    -> assume:another-role-<..>x
another-role-* -> assume:some-role-<..>y

A simple recursive expansion of assume:some-role-abc would result in an infinite set of roles:

assume:another-role-abcx
assume:some-role-abcxy
assume:another-role-abcxyx
assume:some-role-abcxyxy
...

We forbid such constructions using a cycle check, configured to reject only cycles that involve parameters. That permits the former example while prohibiting the latter.

Atomic Modifications

But even that is not enough! The existing implementation of roles stored each role in a row in Azure Table Storage. Azure provides concurrent access to and modification of rows in this storage, so it’s conceivable that two roles which together form a cycle could be added simultaneously. Cycle checks for each row insertion would each see only one of the rows, but the result after both insertions would cause a cycle. Cycles will crash the Taskcluster-Auth service, which will bring down the rest of Taskcluster. Then a lot of people will have a bad day.

To fix this, we moved roles to Azure Blob Storage, putting all roles in a single blob. This service uses ETags to implement atomic modifications, so we can perform a cycle check before committing and be sure that no cyclical configuration is stored.

What’s Next

The parameterized role support is running in production now, but we have no yet updated any roles, aside from a few test roles, to use it. The next steps are to use the support to address a few known weak points in role configuration, including the project administration roles used as an example above.

December 08, 2017 03:00 PM

July 27, 2017

Dustin Mitchell

Taskgraph Optimization

For every push to a Gecko repository, and for periodic things like nightlies, we generate a task-graph containing all of the tasks to run in response. This is all controlled in-tree, making it easy for any developer to add new tasks when necessary.

The full task-graph is currently just over 8000 tasks, and growing. Running 8000 tasks for every push would be slow and wasteful, so we have two mechanisms to limit the tasks we run. The first is “target tasks”, which selects the desired tasks based on try syntax or the tree to which the push took place. The second is the topic of this post: optimization.

Why Optimization?

Before diving into the details, let’s address why optimization is important.

First and foremost, in many cases it gets important results back to developers sooner, increasing development velocity. Not displaying feedback unrelated to a push also helps to focus attention on the important results.

Optimization also helps to ease real resource issues Mozilla faces. Tasks such as Talos or OS X tests must run on our own hardware, and we have a finite (but large - about 500 just for OS X!) quantity of that hardware. When the quantity of work to do exceeds the capacity of that hardware pool, tasks begin to queue, delaying important feedback to developers. Buying more hardware is always an option, but the fixed costs and long lead times mean that effort spent optimizing the work done on the existing assets has a big return.

Most tasks take place in the cloud (AWS), which provides a more elastic environment that is able to burst when load is high. We run well over 10,000 spot instances simultaneously every business day, producing terabytes of data. Even at pennies per hour and gigabyte, that quickly adds up to “real money”. Reducing that cost, or limiting its growth, is an important goal in its own right.

Optimization has risks, too: over-optimization can skip tasks with important information. That might mean that a try push looks fine but has failures when it lands, or that push containing an error appears green but causes a failure on a subsequent push. In any case, time-consuming bug hunts and backouts ensue.

The try server causes a lot of consternation - it’s difficult to figure out what syntax to use to run all of the tasks that might be relevant to a push, resulting in either over-estimation (and thus wasted capacity) or under-estimation (risking missed bugs and backouts). Ideally, machines could figure this out: a push to try with no syntax would run just the necessary jobs, no more and no less.

Optimization Today

Back to the task-graph generation process. During the optimization step, each task is examined for reasons that it might not be run. For some tasks, such as toolchain or docker image builds, it is possible to find and substitute an existing task that used the same inputs. SETA also applies at this stage.

Finally, some jobs are annotated with “when.files-changed”, a list of filename patterns. When the set of files changed in a push does not match this list, the job is optimized away.

This works well for some tasks. For example, the eslint task lists all files that might contain Javascript, along with configuration and source for the linting process. But it doesn’t scale to more complex or common jobs. For example, consider what files should be included for a mochitest run on OS X? In general, this approach is verbose and couples the task descriptions tightly to the source code.

Optimization Tomorrow

Fine-tuning when we run eslint is only 1/8000’th of the problem. We need an approach to optimization that can take some “big bites” out of task graphs, such as omitting entire platforms or test suites. We have done a little of this for servo, e10s, and a few other projects, but using ad-hoc approaches.

While pursuing large impact, we need to ensure we do not over-optimize, so the approach must fail open: if in doubt, a task should run. Once wasteful runs are observed, it should be fairly simple to define the circumstance and represent that in code. This approach is the opposite of “when.files-changed”, which over-optimizes unless the author has named every file that might affect the task.

Tagging

The proposed approach is to tag each source file with named task groups that it “affects”. In this case, “affects” is taken strictly. For example, a change to a chrome file is best tested by the browser-chome suite, but could potentially affect other tests or even builds if it contains a syntax error. However, a push containing only changes under layout/reftests cannot possibly affect anything but reftests. Similarly, changes limited to mobile/android cannot possibly affect any platform but Android.

The tagging is done using familiar clauses in moz.build files:

with Files('mobile/android/**'):
    AFFECTS_TASKS += ['android']

(note that this syntax is illustrative; the details are not yet decided)

To “fail open”, files which do not have any tags are treated as having all tags. Given these annotations, for a given push, it is straightforward to calculate the set of affected tags.

Task Configuration

Task configuration can contain an optimization element specifying a set of tags containing this task:

label: android-test-android-4.3-arm7-api-15/debug-reftest-1
optimizations:
    - [skip-unless-affects, [android, reftests]]

Meaning, skip unless the changes affect either android or reftests. In most cases, this value would be calculated by the transforms rather than included directly in the YAML source files.

Practical Examples

Directories such as mobile/android or browser/ that are only used on certain platforms are an easy target. Changes limited to these directories can omit entire platforms.

Test-only changes can also be aggressively optimized. While each test file technically affects only one task, we can achieve a good balance of effectiveness and detailed configuration by tagging by test suite (browser-chrome, wpt, mochitest, etc.)

Servo synchronizes its changes into the servo/ directory in the Gecko source tree, and changes to that directory only realistically affect the linux64 and linux64-servo platforms. Tagging servo/** with servo, and adding servo to skip-unless-affects for tasks on those two platforms would be enough to achieve this result, with the advantage that tasks for other platforms can be re-added easily using the “add jobs” or “backfill” actions in Treeherder.

July 27, 2017 07:00 PM

July 19, 2017

Chinmay Kousik

Livelog Proxy(WebhookTunnel): Final Work Product

The project was initially named Livelog Proxy, but during the community bonding period was renamed as Webhooktunnel, as it more accurately captured the full scope of the project.The Webhooktunnel repository can be found here.

Tasks Completed:

Webhooktunnel Details:

Webhooktunnel works by multiplexing HTTP requests over a WebSocket connection. This allows clients to connect to the proxy and server webhooks over the websocket connection instead of exposing a port to the internet.

The connection process for clients(workers) is explained in the diagram below:

The client(worker) needs an ID and JWT to connect to the proxy. These are supplied by tc-auth. The proxy(whtunnel) responds by upgrading the HTTP(S) connection to a websocket connection and supplies the client’s base URL in a response header.

An example of request forwarding works as follows:

Webhooktunnel can also function as a websocket proxy.

Webhooktunnel has already been integrated into Taskcluster worker and is used for serving livelogs from task builds.

The core of Webhooktunnel is the multiplexing library wsmux. Wsmux allows creating client and server sessions over a WebSocket connection and creates multiplexed streams over the connection. These streams are exposed using a net.Conn interface.

Webhooktunnel also consists of a command line client, which can forward incoming connections from the proxy to a local port. This is useful as it can be used by servers which are behind a NAT/Firewall.

July 19, 2017 05:00 PM

June 22, 2017

Dustin Mitchell

Taskcluster Manual Revamp

As the Great Taskcluster Migration draws near the finish line, we are seeing people new to Taskcluster and keen to take advantage of its new features every day. It’s exciting to build something with such expressive power: easy-to-use loaners, automatic toolchain builds, and a simple process for adding new tests, to name just a few.

We have long had a thorough reference section, with technical details of the various microservices and workers that comprise Taskcluster, but that information is a bit too deep for a newcomer. A few years ago, we introduced a tutorial to guide the beginning user to the knowledge they need for their use-case, but the tutorial only goes so far.

Daniele Procida gave a great talk at PyCon 2017 about structuring documentation, which came down to this diagram:

 Tutorials   | How-To Guides 
-------------|---------------
 Discussions | Reference     

This shows four types of documentation. The top is practical, while the bottom is more theoretical. The left side is useful for learning, while the right side is useful when trying to solve a problem. So the missing components are “discussion” and “how-to guides”. Daniele’s “discussions” means prose-style expositions of a system, organized to increase the reader’s understanding of the system as a whole.

Taskcluster has had a manual for quite a while, but it did not really live up to this promise. Instead, it was a collection of documents that didn’t fit anywhere else.

Over the last few months, we have refashioned the manual to fit this form. It now starts out with a gentle but thorough description of tasks (the core concept of Taskcluster), then explains how tasks are executed before delving into the design of the system. At the end, it includes a bunch of use-cases with advice on how to solve them, filling the “how-to guides” requirement.

If you’ve been looking to learn more about Taskcluster, check it out!

June 22, 2017 11:22 AM

June 16, 2017

Chinmay Kousik

WebSocket Multiplexer Overview

General Idea

WebSocket multpilexer enables creation of multiple tcp-like streams over a WebSocket connection. Since each stream can be treated as a separate net.Conn instance, it is used by other components to proxy HTTP requests. A new stream can be opened for each request, and they can be handled in a manner identical to tcp streams. Wsmux contains two components: Sessions and Streams. Sessions wrap WebSocket connections and allow management of streams over the connection. Session implements net.Listener, and can be used by an http.Server instance to serve requests. Streams are the interface which allow users to send and receive multiplexed data. Streams implement net.Conn. Streams have internal mechanisms for buffering and congestion control.

Why WebSocket?

The decision to use WebSocket (github.com/gorilla/websocket) instead of supporting a net.Conn was made for the following reasons:

  • WebSocket handshakes can be used for intitiating a connection instead of writing a custom handshake. Wsmux can be used as a subprotocol in the WebSocket handshake. This greatly simplifies the implementation of wsmux.
  • WebSocket convenience mathods (ReadMessage and WriteMessage) simplify sending and receiving of data and control frames.
  • Control messages such as ping, pong, and close need not be implemented separately in wsmux. WebSocket control frames can be used for this purpose.
  • Adding another layer of abstraction over WebSocket enables connections to be half-closed. WebSocket does not allow for half closed connections, but wsmux streams can be half closed, thus simplifying usage.
  • Since WebSocket frames already contain the length of the message, the length field can be dropped from wsmux frames. This reduces the size of the wsmux header to 5 bytes.

Framing

WebSocket multiplexer implements a very simple framing technique. The total header size is 5 bytes.

[ message type - 8 bits ][ stream ID - 32 bits ]

Messages can have the following types:

  • msgSYN: Used to initiate a connection.
  • msgACK: Used to acknowledge bytes read on a connection.
  • msgDAT: Signals that data is being sent.
  • msgFIN: Signals stream closure.

Sessions

A Session wraps a WebSocket connection and enables usage of the wsmux subprotocol. A Session is typically used to manage the various wsmux streams. Sessions are of two types:Server, Client. The only difference between a Server Session and a Client Session is that the ID of a stream created by a Server Session will be even numbered while the ID of a stream created by a Client will be odd numbered. A connection must have only one Server and one Client. Sessions read messages from the WebSocket connection and forward the data to the appropriate stream. Streams are responsible for buffering and framing of data. Streams must send data by invoking the send method of their Session.

Streams

Streams allow users to interface with data tagged with a particular ID, called the stream ID. Streams contain circular buffers for congestion control and are also responsible for generating and sending msgACK frames to the remote session whenever data is read. Streams handle framing of data when data is being sent, and also allow setting of deadlines for Read and Write calls. Internally, streams are represented using a Finite State Machine, which has been described in a previous blog post. Performance metrics for streams have also been measured and are availabe here.

Conclusion

Wsmux is being used by the two major components of Webhooktunnel: Webhook Proxy, and Webhook Client. It has been demonstrated that wsmux can be used to multiplex HTTP requests reliably, and can also support WebSocket over wsmux streams.

The repository can be found here.

June 16, 2017 02:30 PM

May 31, 2017

Dustin Mitchell

TaskCluster-Github: Post Comments and Status Live!

The Taskcluster team maintains a Github App named, appropriately enough, “Taskcluster”. When a pull request is created or a change is pushed to a repository, this app can be configured to start tasks automatically. It’s typically used to build or run tests, but the sky is the limit: the full expressive power of Taskcluster is available!

Recently, Alexandre Poirot added support for making updates in the Github UI while a task is running:

  • createStatus allows updates to the commit status, such as to indicate the current phase or to rapidly indicate test failure before the entire task is complete.
  • createComment allows the task (or anything with the proper scopes, really) to comment on Github issues and pull requests.

This has been a common request, and the Taskcluster team is excited that Alexandre took the time to implement it.

There’s one fly in the ointment: when we set the app up, we did not configure it with permission to comment on issues. That permission is “read/write issues”, and gives the app permission to manipulate all issues in any configured repositories. It does not give the app permission to modify the source code in a repo.

While we can modify the app’s permissions, the result is not actually available for use until the relevant Github org administrators “OK” the change. Thus, this feature may not be available for your organization. Org admins will get an email when we modify the permission with instructions as to how to accept the new permissions.

May 31, 2017 06:54 PM

May 24, 2017

Chinmay Kousik

Stream Metrics

In the previous post, I gave a brief explanation of how the stream has been refactored to resemble a finite state machine. This post elaborates on the performance metrics of streams based on buffer size and number of concurrent streams.

Buffers

Each stream has an internal buffer which is used to store data which is sent to it from the remote side. The default buffer size is 1024 bytes as of now. The buffer size is immutable and cannot be changed once the stream has been created. The buffer is internally implemented as a circular queue of bytes, and implements io.ReadWriter. When a stream is created, the stream assumes the remote buffer capacity to be zero. When the stream is accepted, the remote connection informs the stream of its buffer size and the remote buffer capacity is updated. Streams are setup to track remote capacity, unblocking bytes when an ACK frame arrives, and reducing remote capacity when a certain number of bytes are written. Streams can only write as many bytes as remote capacity, and will block writes until further bytes are unblocked. Thus, buffer size has a significant effect on performance.
The following plot shows the time taken for a Write() call over 100 concurrent streams as a function of buffer size.

1500 bytes are sent and echoed back over each stream. It is clear that the time taken reduces exponentially as a function of buffer size. This is because smaller buffers require more messages to be sent over the websocket connection. A stream with a 1024 byte buffer needs to exchange a minimum of 3 messages for the data to be completely sent to the remote connection: write 1024 bytes, receive ACK >= 476 bytes, write 476 bytes. A stream with a large enough buffer can write data using a single message. The intended buffer size is 4k.

Concurrent Streams

Each session is capabale of handling concurrent streams. This test keeps the buffer size constant as 1024 bytes and varies the number of concurrent streams.
The following plot describes the time taken to echo 1500 bytes over each stream with a buffer size of 1k as a function of number of concurrent streams:

It is simple to fit a quadratic curve to this curve. A reason for this could be a limit on the throughput of the websocket connection.

May 24, 2017 02:30 PM

May 16, 2017

Chinmay Kousik

Stream States Part 1

Streams can be modelled as an FSM by determining the different states a stream can be in and all valid state transistions. Initially, a stream is in the created state. This state signifies that the stream has been created. This is possible in two different ways: a) The stream was created by the session, added to the stream map, and a SYN packet with the stream’s ID was sent to the remote session, or b) A SYN packet was received from the remote session and a new stream was created and added to the stream map. In case of (a) the stream waits for an ACK packet from the remote session and as soon as the ACK packet arrives, it transistions to the accepted state. In case of (b) the session sends an ACK packet, and the the stream transitions to the accepted state.

Once in the accepted state the stream can read and write from the stream. When a DAT packet arrives, the data is push to the stream’s buffer. When data is read out of the buffer using a Read() call, an ACK packet is sent to the remote stream with the number of bytes read. When an ACK packet is received in the accepted state, the number of bytes unblocked (the number of bytes the remote session is willing to accept), is updated. If the stream is closed by a call to Close(), then the stream transitions to the closed state and sends a FIN packet to the remote stream. When a FIN packet is received, the stream transitions to the remoteClosed state.

In the closed state, the stream can not write any data to the remote connection. All Write() calls return an ErrBrokenPipe error. The stream can still receive data, and canread data from the buffer.

The remoteClosed state signifies that the remote stream will not send any more data to the stream. Read() calls can still read data from the buffer. If the buffer is empty then the Read() calls return EOF. The stream can write data to the remote session.

If a FIN packet is received when in the closed state, or Close() is called in the remoteClosed state, the stream transitions to the dead state. All Write() calls fail in the dead state, but Read() can retreive data from the buffer. If the stream is in the dead state, and the buffer is empty, then the stream is removed by its Session.

The state transitions can be summed up in the following diagram:

stream states

May 16, 2017 05:00 PM

May 08, 2017

Chinmay Kousik

GSOC Project: Webhook Tunnel

I got accepted to Google Summer of Code (GSoC) 2017. I will be working with Mozilla Taskcluster, and my project is Webhook Tunnel (we changed the name from livelog proxy). TaskCluster workers are hosted on services such as EC2 and currently expose ports to the internet and allows clients to call API endpoints. This may not be feasible in a data center setup. Webhook proxy aims to mitigate this problem by allowing workers to connect to a proxy (part of webhook tunnel) over an outgoing WebSocket connection and the proxy in turn exposes API endpoints to the internet. This is implemented as a distributed system for handling high loads.

This is similar to ngrok, or localtunnel, but a key difference is that instead of providing a port that clients can connect to, webhook tunnel exposes APIs as “<worker-id>.taskcluster-proxy.net/<endpoint>”. This is a much more secure way of exposing endpoints.

The initial plan is to deploy this on Docker Cloud. Details will follow in further posts.

May 08, 2017 04:01 PM

November 30, 2016

Pete Moore

Task Execution on Windows™

Objectives

Windows 7

November 30, 2016 07:25 PM

August 03, 2016

Selena Deckelmann

TaskCluster 2016Q2 Retrospective

The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.

We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.

We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.

We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.

A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.

This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.

Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.

I can’t wait to see what this team accomplishes in Q3!

Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!

Things we did this quarter:

  • initial investigation and timing data around using sccache for linux builds
  • released update for sccache to allow working in a more modern python environment
  • created taskcluster managed s3 buckets with appropriate policies
  • tested linux builds with patched version of sccache
  • tested docker-worker on packet.net for on hardware testing
  • worked with jmaher on talos testing with docker-worker on releng hardware
  • created livelog plugin for taskcluster-worker (just requires tests now)
  • added reclaim logic to taskcluster-worker
  • converted gecko and gaia in-tree tasks to use new v2 treeherder routes
  • Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting
  • move docs, schemas, references to https
  • refactor documentation site into tutorial / manual / reference
  • add READMEs to reference docs
  • switch from a * certificate to a SAN certificate for taskcluster.net
  • increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration
  • use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes
  • allow non-employees to login with Okta, improve authentication experience
  • named temporary credentials
  • use npm shrinkwrap everywhere
  • enable coalescing
  • reduce the artifact retention time for try jobs (to reduce S3 usage)
  • support retriggering via the treeherder API
  • document azure-entities
  • start using queue dependencies (big-graph-scheduler)
  • worked with NSS team to have tasks scheduled and displayed within treeherder
  • Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)
  • added hg fingerprint verification to decision task
  • Responded and deployed patches to security incidents discovered in q2
  • taskcluster-stats-collector running with signalfx
  • most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor
  • Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine
  • QEMU/KVM engine for taskcluster-worker
  • Implemented Task Group Inspector
  • Organized efforts around front-end tooling
  • Re-wrote and generalized the build process for taskcluster-tools and future front-end sites
  • Created the Migration Dashboard
  • Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site
  • First Windows tasks in production – NSS builds running on Windows 2012 R2
  • Windows Firefox desktop builds running in production (currently shown on staging treeherder)
  • new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)
  • many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs
  • CI cleanup https://travis-ci.org/taskcluster
  • support for relative definitions in jsonschema2go
  • schema/references cleanup

Paying down technical debt

  • Fixed numerous issues/requests within mozilla-taskcluster
  • properly schedule and retrigger tasks using new task dependency system
  • add more supported repositories
  • Align job state between treeherder and taskcluster better (i.e cancels)
  • Add support for additional platform collection labels (pgo/asan/etc)
  • fixed retriggering of github tasks in treeherder
  • Reduced space usage on workers using docker-worker by removing temporary images
  • fixed issues with gaia decision task that prevented it from running since March 30th.
  • Improved robustness of image creation image
  • Fixed all linter issues for taskcluster-queue
  • finished rolling out shrinkwrap to all of our services
  • began trial of having travis publish our libraries (rolled out to 2 libraries now. talking to npm to fix a bug for a 3rd)
  • turned on greenkeeper everywhere then turned it off again for the most part (it doesn’t work with shrinkwrap, etc)
  • “modernized” (newer node, lib-loader, newest config, directory structure, etc) most of our major services
  • fix a lot of subtle background bugs in tc-gh and improve logging
  • shared eslint and babel configs created and used in most services/libraries
  • instrumented taskcluster-queue with statistics and error reporting
  • fixed issue where task dependency resolver would hang
  • Improved error message rendering on taskcluster-tools
  • Web notifications for one-click-loaner UI on taskcluster-tools
  • Migrated stateless-dns server from tutum.co to docker cloud
  • Moved provisioner off azure storage development account
  • Moved our npm package to a single npm organization

August 03, 2016 07:56 PM

June 30, 2016

Ben Hearsum

Building and Pushing Docker Images with Taskcluster-Github

Earlier this year I spent some time modernizing and improving Balrog's toolchain. One of my goals in doing so was to switch from Travis CI to Taskcluster both to give us more flexibility in our CI configuration, as well as help dogfood Taskcluster-Github. One of the most challenging aspects of this was how to build and push our Docker image, and I'm hoping this post will make it easier for other folks who want to do the same in the future.

The Task Definition

Let's start by breaking down Task definition from Balrog's .taskcluster.yml. Like other Taskcluster-Github jobs, we use the standard taskcluster.docker provisioner and worker.

  - provisionerId: "{{ taskcluster.docker.provisionerId }}"
    workerType: "{{ taskcluster.docker.workerType }}"

Next, we have something a little different. This section grants the Task access to a secret (managed by the Secrets Service). More on this later.

    scopes:
      - secrets:get:repo:github.com/mozilla/balrog:dockerhub

The payload has a few things of note. Because we're going to be building Docker images it makes sense to use Taskcluster's image_builder Docker image as well as enabling the docker-in-docker feature. The taskclusterProxy feature is needed to access the Secrets Service.

    payload:
      maxRunTime: 3600
      image: "taskcluster/image_builder:0.1.3"
      features:
        dind: true
        taskclusterProxy: true
      command:
        - "/bin/bash"
        - "-c"
        - "git clone $GITHUB_HEAD_REPO_URL && cd balrog && git checkout $GITHUB_HEAD_BRANCH && scripts/push-dockerimage.sh"

The extra section has some metadata for Taskcluster-Github. Unlike CI tasks, we limit this to only running on pushes (not pull requests) to the master branch of the repository. Because only a few people can push to this branch, it means that only these can trigger Docker builds.

    extra:
      github:
        env: true
        events:
          - push
        branches:
          - master

Finally, we have the metadata, which is just standard Taskcluster stuff.

    metadata:
      name: Balrog Docker Image Creation
      description: Balrog Docker Image Creation
      owner: "{{ event.head.user.email }}"
      source: "{{ event.head.repo.url }}"

Secrets

I mentioned the "Secrets Service" earlier, and it's the key piece that enables us to securely push Docker images. Putting our Dockerhub password in it means access is limited to those who have the right scopes. We store it in a secret with the key "repo:github.com/mozilla/balrog:dockerhub", which means that anything with the "secrets:get:repo:github.com/mozilla/balrog:dockerhub" scope is granted access to it. My own personal Taskcluster account has it, which lets me set or change the password:

We also have a Role called "repo:github.com/mozilla/balrog:branch:master" which has that scope:

You can see from its name that this Role is associated with the Balrog repository's master branch. Because of this, any Tasks created for as a result of pushes to that branch in that repository and branch may assign the scopes that Role has - like we did above in the "scopes" section of the Task.

Building and Pushing

The last piece of the puzzle here is the actual script that does the building and pushing. Let's look at a few specific parts of it.

To start with, we deal with retrieving the Dockerhub password from the Secrets Service. Because we enabled the taskclusterProxy earlier, "taskcluster" resolves to the hosted Taskcluster services. Had we forgotten to grant the Task the necessary scope, this would return a 403 error.

password_url="taskcluster/secrets/v1/secret/repo:github.com/mozilla/balrog:dockerhub"
dockerhub_password=$(curl ${password_url} | python -c 'import json, sys; a = json.load(sys.stdin); print a["secret"]["dockerhub_password"]')

We build, tag, and push the image, which is very similar to building it locally. If we'd forgotten to enable the dind feature, this would throw errors about not being able to run Docker.

docker build -t mozilla/balrog:${branch_tag} .
docker tag mozilla/balrog:${branch_tag} "mozilla/balrog:${date_tag}"
docker login -e $dockerhub_email -u $dockerhub_username -p $dockerhub_password
docker push mozilla/balrog:${branch_tag}
docker push mozilla/balrog:${date_tag}

Finally, we attach an artifact to our Task containing the sha256 of the Docker images. This allows consumers of the Docker image to verify that they're getting exactly what we built, and not something that may have been tampered on Dockerhub or in transit.

sha256=$(docker images --no-trunc mozilla/balrog | grep "${date_tag}" | awk '/^mozilla/ {print $3}')
put_url=$(curl --retry 5 --retry-delay 5 --data "{\"storageType\": \"s3\", \"contentType\": \"text/plain\", \"expires\": \"${artifact_expiry}\"}" ${artifact_url} | python -c 'import json; import sys; print json.load(sys.stdin)["putUrl"]')
curl --retry 5 --retry-delay 5 -X PUT -H "Content-Type: text/plain" --data "${sha256}" "${put_url}"

The Result

Now that you've seen how it's put together, let's have a look at the end result. This is the most recent Balrog Docker build Task. You can see the sha256 artifact created on it:

And of course, the newly built image has shown up on the Balrog Dockerhub repo:

June 30, 2016 07:29 PM

June 27, 2016

Wander Lairson Costa

The taskcluster-worker Mac OSX engine

In this quarter, I worked on implementing the taskcluster-worker Mac OSX engine. Before talking about this specific implementation, let me explain what a worker is and how taskcluster-worker differs from docker-worker, the currently main worker in Taskcluster.

The role of a Taskcluster worker

When a user submits a task graph to Taskcluster, contrary to the common sense (at least if you are used on how OSes schedulers usually work), these tasks are submitted to the scheduler first, which is responsible to process dependencies and enqueue them. In the Taskcluster manual page there is a clear picture ilustrating this concept.

The provisioner is responsible for looking at the queue and determine how many pending tasks exist and, based on that, it launches worker instances to run these tasks.

Then comes the figure of the worker. The worker is responsible for actually executing the task. It claims a task from the queue, runs it, upload the generated artifacts and submits the status of the finished task, using the Taskcluster APIs.

docker-worker is a worker that runs task command inside a docker container. The task payload specifies a docker image as well as a command line to run, among other environment parameters. docker-worker pulls the specified docker image and runs task commands inside it.

taskcluster-worker and the OSX engine

taskcluster-worker is a generic and modularized worker under active development by the Taskcluster team. The worker delegates the task execution to one of the available engines. An engine is a component of taskcluster-worker responsible for running a task under a specific system environment. Other features, like environment variable setting, live logging, artifact uploading, etc., are handled by worker plugins.

I am implementing the Mac OSX engine, which will mainly be used to run Firefox automated tests in the Mac OSX environment. There is a macosx branch in my personal Github taskcluster-worker fork in which I push my commits.

One specific aspect of the engine implementation is the ability to run more than one task at the same time. For this, we need to implement some kind of task isolation. For docker-worker, each task ran in its own docker container so tasks were isolated by definition. But there is no such thing as a container for OSX engine. Our earlier tries with chroot failed miserably, due to incompatibilities with OSX graphic system. Our final solution was to create a new user on the fly and run the task with this user’s credentials. This not only provides some task isolation, but also prevents privilege escalation attacks by running tasks with different user than the worker.

Instead of dealing with the poorly documented Open Directory Framework, we chose to spawn the dscl command to create and configure users. Tasks usually takes a long time to execute, spawning loads of subprocess, so a few spawns of the dscl command won’t have any practical performance impact.

One final aspect is how we bootstrap task execution. A tasks boils down to a script that executes task duties. But where does this script come from? It doesn’t live in the machine that executes the worker. OSX engine provides a link field in task payload that a task can specify an executable to download and execute.

Running the worker

OSX engine will primarily be used to execute Firefox tests on Mac OSX, and the environment is expected to have a very specific tools and configurations set. Because of that, I am testing the code on a loaner machine. To start the worker, it is just a matter of opening a terminal and typing:

$ ./taskcluster-worker work macosx --logging-level debug

The worker connects to the Taskcluster queue, claims and execute the tasks available. At the time I am writing, all tests but Firefox UI functional tests” were green, running on optimized Firefox OSX builds. We intend to land Firefox tests in taskcluster-worker as Tier-2 on next quarter, running them in parallel with Buildbot.

June 27, 2016 12:00 AM

May 02, 2016

Maja Frydrychowicz

Not Testing a Firefox Build (Generic Tasks in TaskCluster)

A few months ago I wrote about my tentative setup of a TaskCluster task that was neither a build nor a test. Since then, gps has implemented “generic” in-tree tasks so I adapted my initial work to take advantage of that.

Triggered by file changes

All along I wanted to run some in-tree tests without having them wait around for a Firefox build or any other dependencies they don’t need. So I originally implemented this task as a “build” so that it would get scheduled for every incoming changeset in Mozilla’s repositories.

But forget “builds”, forget “tests” — now there’s a third category of tasks that we’ll call “generic” and it’s exactly what I need.

In base_jobs.yml I say, “hey, here’s a new task called marionette-harness — run it whenever there’s a change under (branch)/testing/marionette/harness”. Of course, I can also just trigger the task with try syntax like try: -p linux64_tc -j marionette-harness -u none -t none.

When the task is triggered, a chain of events follows:

For Tasks that Make Sense in a gecko Source Checkout

As you can see, I made the build.sh script in the desktop-build docker image execute an arbitrary in-tree JOB_SCRIPT, and I created harness-test-linux.sh to run mozharness within a gecko source checkout.

Why not the desktop-test image?

But we can also run arbitrary mozharness scripts thanks to the configuration in the desktop-test docker image! Yes, and all of that configuration is geared toward testing a Firefox binary, which implies downloading tools that my task either doesn’t need or already has access to in the source tree. Now we have a lighter-weight option for executing tests that don’t exercise Firefox.

Why not mach?

In my lazy work-in-progress, I had originally executed the Marionette harness tests via a simple call to mach, yet now I have this crazy chain of shell scripts that leads all the way mozharness. The mach command didn’t disappear — you can run Marionette harness tests with ./mach python-test .... However, mozharness provides clearer control of Python dependencies, appropriate handling of return codes to report test results to Treeherder, and I can write a job-specific script and configuration.

May 02, 2016 04:00 AM

April 01, 2016

Wander Lairson Costa

Overcoming browser same origin policy

One of my goals for 2016 Q1 was to write a monitoring dashboard for Taskcluster. It basically pings Taskcluster services to check if they are alive and also acts as a feed aggregator for services Taskcluster depends on. One problem with this approach is the same origin policy, in which web pages are only allowed to make requests to their own domain. For web servers which is safe to make these cross domain requests, they can either implement jsonp or CORS. CORS is the preferred way so we will focus on it for this post.

Cross-origin resource sharing

CORS is a mechanism that allows the web server tell the browser that is safe to accomplish a cross domain request. It consists of a set of HTTP headers with details for the conditions to accomplish the request. The main response header is Access-Control-Allow-Origin, which contains either a list of allowed domains or a *, indicating any domain can make a cross request to this server. In a CORS request, only a small set of headers is exposed to the response object. The server can tell the browser to expose additional headers through the Access-Control-Expose-Headers response header.

But what if the web server doesn’t implement CORS? The only solution is to provide a proxy that will make the actual request and add the CORS headers.

cors-proxy

To allow the monitoring dashboard make requests for status state on remote services that do not implement CORS, we created the cors-proxy. It exports a /request endpoint that allows you to make requests to any remote host. cors-proxy redirects it to the remote URL and sends the responses back, with appropriate CORS headers set.

Let’s see an example:

$.ajax({
  url: 'https://cors-proxy.taskcluster.net/request',
  method: 'POST',
  contentType: 'application/json',
  data: {
    url: 'https://queue.taskcluster.net/v1/ping',
  }
}).done(function(res) {
  console.log(res);
});

The information about the remote request is sent in the proxy request body. All parameter fields are shown in the project page.

Before you think on using the hosted server to bypass your own requests, cors-proxy only honors requests from a whitelist. So, only some subdomains under Taskcluster domain can use cors-proxy.

April 01, 2016 12:00 AM

March 30, 2016

Pete Moore

Walkthrough installing Cygwin SSH Daemon on AWS EC2 instances

One of the challenges we face at Mozilla is supporting Windows in an organisational environment which is predominantly *nix oriented. Furthermore, historically our build and test infrastructure has only provided a very limited ssh daemon, with an antiquated shell, and outdated unix tools.

With the move to hosting Windows environments in AWS EC2, the opportunity arose to review our current SSH daemon, and see if we couldn’t do something a little bit better.

When creating Windows environments in EC2, it is possible to launch a “vanilla” Windows instance, from an AMI created by Amazon. This instance is based on a standard installation of a given version of Windows, with a couple of AWS EC2 tools preinstalled.

One of the features of the preinstalled tools, is that they allow you to specify powershell and/or batch script snippets inside the instance User Data, that will be executed upon launch.

This makes it quite trivial to customise a Windows environment, by providing all of the customisation steps as a PowerShell snippet in the instance User Data.

In this Walkthrough, we will set up a Windows 2012 R2 Windows machine, with the cygwin ssh daemon preinstalled. In order to follow this walkthrough, you will need an AWS account, and the ability to spawn an instance.

Install AWS CLI

Although all of these steps can be performed via the web console, typically we would want to automate them. Therefore in this walkthrough, I’m using the AWS CLI to perform all of the actions, to make it easier should you want to script any of the setup.

Windows installation

Download and run the 64-bit or 32-bit Windows installer.

Mac and Linux installation

Requires Python 2.6.5 or higher.

Install using pip.

$ pip install awscli
Further help

See the AWS CLI guide if you get stuck.

Configuring AWS credentials

If this is your first time running the AWS CLI tool, configure your credentials with:

$ aws configure

See the AWS credentials configuration guide if you need more help.

Locate latest Windows Server 2012 R2 AMI (64bit)

The following command line will find you the latest Windows 2012 R2 stock image, provided by AWS, in your default region.

$ AMI="$(aws ec2 describe-images --owners self amazon --filters \
"Name=platform,Values=windows" \
"Name=name,Values=Windows_Server-2012-R2_RTM-English-64Bit-Base*" \
--query 'Images[*].{A:CreationDate,B:ImageId}' --output text \
| sort -u | tail -1 | cut -f2)"

Now we can see what the current AMI is, in our default region, with:

$ echo "Windows AMI: ${AMI}"
Windows AMI: ami-1719f677

Note, the actual AMI provided by AWS changes from week to week, and from region to region, so don’t be surprised if you get a different result to the one above.

Create a Security Group

We need our instance to be in a security group that allows us to SSH onto it.

First create a security group:

$ SECURITY_GROUP="$(aws ec2 create-security-group --group-name ssh-only \
--description "SSH only" --output text)"

And then update it to only allow inbound SSH traffic:

$ [ -n "${SECURITY_GROUP}" ] && aws ec2 authorize-security-group-ingress \
--group-id "${SECURITY_GROUP}" \
--ip-permissions '[{"IpProtocol": "tcp", "FromPort": 22, "ToPort": 22,
"IpRanges": [{"CidrIp": "0.0.0.0/0"}]}]'

Create a unique Client Token

We should create a unique client token that will allow us to make idempotent requests, should there be any failures. We will also use this as our “name” for the instance until we get the real instance name back.

$ TOKEN="$(date +%s)"

Create a dedicated Key Pair

We’ll need to specify a key pair in order to retrieve the Windows Password. Let’s create a dedicated one just for this instance.

$ aws ec2 create-key-pair --key-name "${TOKEN}" --query 'KeyMaterial' \
--output text > "${TOKEN}.pem" && chmod 400 "${TOKEN}.pem"

Create custom post-installation script

Typically, you’ll want to customise the cygwin environment, for example:

  • Changing the bash prompt
  • Setting vim options
  • Adding ssh authorized keys
  • ….

Let’s do this in a post installation bash script, which we can download as part of the installation.

In order to be able to authenticate with our new key, we’ll need to get the public part. Note, we could generate separate keys for ssh’ing to our machine, but we might as well reuse the key we just created.

$ PUB_KEY="$(ssh-keygen -y -f "${TOKEN}.pem")"

Create User Data

The AWS Windows Guide advises us that Windows PowerShell commands can be executed if supplied as part of the EC2 User Data. We’ll use this userdata to install cygwin and the ssh daemon from scratch.

Create a file userdata to store the User Data:

$ cat > userdata << 'EOF'
<powershell>

# needed for making http requests
$client = New-Object system.net.WebClient

# download cygwin
$client.DownloadFile("https://www.cygwin.com/setup-x86_64.exe", `
"C:\cygwin-setup-x86_64.exe")

# install cygwin
# complete package list: https://cygwin.com/packages/package_list.html
Start-Process "C:\cygwin-setup-x86_64.exe" -ArgumentList ("--quiet-mode " +
"--wait --root C:\cygwin --site http://cygwin.mirror.constant.com " +
"--packages openssh,vim,curl,tar,wget,zip,unzip,diffutils,bzr") -wait `
-NoNewWindow -PassThru -RedirectStandardOutput "C:\cygwin_install.log" `
-RedirectStandardError "C:\cygwin_install.err"

# open up firewall for ssh daemon
New-NetFirewallRule -DisplayName "Allow SSH inbound" -Direction Inbound `
-LocalPort 22 -Protocol TCP -Action Allow

# workaround for https://www.cygwin.com/ml/cygwin/2015-10/msg00036.html
# see:
#   1) https://www.cygwin.com/ml/cygwin/2015-10/msg00038.html
#   2) https://goo.gl/EWzeVV
$env:LOGONSERVER = "\\" + $env:COMPUTERNAME

# configure sshd
Start-Process "C:\cygwin\bin\bash.exe" -ArgumentList "--login
-c `"ssh-host-config -y -c 'ntsec mintty' -u 'cygwinsshd' \
-w 'qwe123QWE!@#'`"" -wait -NoNewWindow -PassThru -RedirectStandardOutput `
"C:\cygrunsrv.log" -RedirectStandardError "C:\cygrunsrv.err"

# start sshd
Start-Process "net" -ArgumentList "start sshd" -wait -NoNewWindow -PassThru `
-RedirectStandardOutput "C:\net_start_sshd.log" `
-RedirectStandardError "C:\net_start_sshd.err"

# download bash setup script
$client.DownloadFile(
"https://raw.githubusercontent.com/petemoore/myscrapbook/master/setup.sh",
"C:\cygwin\home\Administrator\setup.sh")

# run bash setup script
Start-Process "C:\cygwin\bin\bash.exe" -ArgumentList `
"--login -c 'chmod a+x setup.sh; ./setup.sh'" -wait -NoNewWindow -PassThru `
-RedirectStandardOutput "C:\Administrator_cygwin_setup.log" `
-RedirectStandardError "C:\Administrator_cygwin_setup.err"

# add SSH key
Add-Content "C:\cygwin\home\Administrator\.ssh\authorized_keys" "%{SSH-PUB-KEY}%"
</powershell>
EOF

Fix SSH key

We need to replace the SSH public key placeholder we just referenced in userdata with the actual public key

$ USERDATA="$(cat userdata | sed "s_%{SSH-PUB-KEY}%_${PUB_KEY}_g")"

Launch new instance

We’re now finally ready to launch the instance. We can do this with the following commands:

$ {
echo "Please be patient, this can take a long time."
INSTANCE_ID="$(aws ec2 run-instances --image-id "${AMI}" --key-name "${TOKEN}" \
--security-groups 'ssh-only' --user-data "${USERDATA}" \
--instance-type c4.2xlarge --block-device-mappings \
DeviceName=/dev/sda1,Ebs='{VolumeSize=75,DeleteOnTermination=true,VolumeType=gp2}' \
--instance-initiated-shutdown-behavior terminate --client-token "${TOKEN}" \
--output text --query 'Instances[*].InstanceId')"
PUBLIC_IP="$(aws ec2 describe-instances --instance-id "${INSTANCE_ID}" --query \
'Reservations[*].Instances[*].NetworkInterfaces[*].Association.PublicIp' \
--output text)"
unset PASSWORD
until [ -n "$PASSWORD" ]; do
    PASSWORD="$(aws ec2 get-password-data --instance-id "${INSTANCE_ID}" \
    --priv-launch-key "${TOKEN}.pem" --output text \
    --query PasswordData)"
    sleep 10
    echo -n "."
done
echo
echo "SSH onto your new instance (${INSTANCE_ID}) with:"
echo "    ssh -i '${TOKEN}.pem' Administrator@${PUBLIC_IP}"
echo
echo "Note, the Administrator password is \"${PASSWORD}\", but it"
echo "should not be needed when connecting with the ssh key."
echo
}

You should get some output similar to this:

Please be patient, this can take a long time.
................
SSH onto your new instance (i-0fe79e45ffb2c34db) with:
    ssh -i '1459795270.pem' Administrator@54.200.218.155

Note, the Administrator password is "PItDM)Ph*U", but it
should not be needed when connecting with the ssh key.

March 30, 2016 11:33 AM

March 11, 2016

Selena Deckelmann

[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

March 11, 2016 11:48 PM

March 08, 2016

Selena Deckelmann

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

March 08, 2016 10:20 PM

March 07, 2016

Selena Deckelmann

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.

March 07, 2016 11:51 PM

March 01, 2016

Jonas Finnemann Jensen

One-Click Loaners with TaskCluster

Last summer Edgar Chen (air.mozilla.org) built on an interactive shell for TaskCluster Linux workers, so developers can get a SSH-like session into a task container from their browser. We’ve slowly been improving this, and prior to Mozlando I added support for opening a VNC-like session connecting to an X-session inside a task container. I’ll admit I was mostly motivated by the prospect of giving an impressive demo, and the implementation details are likely to change as we improve it further. Consequently, we haven’t got many guides on how to use these features in their current state.

However, with people asking for TaskCluster “loaners” on IRC, I figure now is a good time to explain how these interactive features can be used to provide a loaner-on-demand flow for TaskCluster workers. At least on Linux, but hopefully we can do a similar thing on other platforms too. Before we dive in, I want to note that all of our Linux tasks runs under docker with one container per tasks. Hence, you can pull down the docker image and play with it locally, the process and caveats such as setting up loopback video and audio devices is beyond the scope of this post. But feel free to ask on IRC (#taskcluster), I’m sure Greg Arndt has all the details, some of them are already present in “Run Locally” script displayed in the task-inspector.

Quick Start

If you can’t wait to play, here are the bullet points:

  1. You’ll need a commit-level 1 access (and LDAP login)
  2. Go to treeherder.mozilla.org pick a task that runs on TaskCluster (I tried “[TC] Linux64 reftest-3”, build tasks don’t have X.org)
  3. Under “Job details” click the “Inspect Task” (this will open the task-inspector)
  4. In the top right corner in the task-inspector click “Login” (this opens login.taskcluster.net on a new tab)
    1. “Sign-in with LDAP” or  “Sign-in with Okta” (Okta only works for employees)
    2. Click the “Grant Access” button (to grant tools.taskcluster.net access)
  5. In the task-inspector under the “Task” tab, scroll down and click the “One-Click Loaner” button
  6. Click again to confirm and create a one-click loaner task (this takes you to a “Waiting for Loaner” page)
    1. Just wait… 30s to 5 min (you can open the task-inspector for your loaner task to see the live log, if you are impatient)
    2. Eventually you should see two big buttons to open an interactive shell or display
  7. You should now have an interactive terminal (and display) into a running task container.

Warning: These loaners runs on EC2 spot-nodes, they may disappear at any time. Use them for quickly trying something, not for writing patches.

Given all these steps, in particular the “Click again” in step (6), I recognize that it might take more than one click to get a “One-Click Loaner”. But we are just getting started, and all of this should be considered a moving target. The instructions above can also be found on MDN, where we will try to keep them up to date.

Implementation Details

To support interactive shell sessions the worker has an end-point that accepts websocket connections. For each new websocket the worker spawns a sh or bash inside the task container and pipes stdin, stdout and stderr over the websocket. In browser we use then have the websocket reading from and writing to hterm (from the chromium project) giving us a nice terminal emulator in the browser. There is still a few issues with the TTY emulation in docker, but it works reasonably for small things.

shell

For interactive display sessions (VNC-like sessions in the browser) the worker has an end-point which accepts both websocket connections and ordinary GET requests for listing displays. For each GET request the worker will run a small statically linked binary that lists all the X-sessions inside the task container, the result is then transformed to JSON and returned in the request. Once the user has picked a display, a websocket connection is opened with the display identifier in query-string. On the worker the websocket is piped to a statically linked instance of x11vnc running inside the task container. In the browser we then use noVNC to give the user an interactive remote display right in the browser.

novnc

As with the shell, there is also a few quirks to the interactive display. Some graphical artifacts and other “interesting” issues. When streaming a TCP connection over a websocket we might not be handling buffering all too well. Which I suspect introduces additional latency and possible bugs. I hope these things will get better in future iterations of the worker, which is currently undergoing an experimental rewrite from node to go.

Future Work

As mentioned in the “Quick Start” section, all of this is still a bit of a moving target. Access is to any loaner is effectively granted to anyone with commit level 1 or any employee. So your friends can technically hijack the interactive task you created. Obviously, we have to make that more fine-grained. At the moment, the “one-click loaner” button is also very specific to our Linux worker. As we add more platforms will have to extend support and find a way to abstract the platform dependent aspects. S it’s very likely that this will break on occasion.

We also recently introduced a hack defining the environment variable TASKCLUSTER_INTERACTIVE when a loaner task is created. A quick hack that we might refactor later, but for now it’s enabling Armen Zambrano to customize how the docker image used for tests runs in loaner-mode. In bug 1250904 there is on-going work to ensure that a loaner will setup the test environment, but not start running tests until a user connects and types the right command. I’m sure there are many other things we can do to make the task environment more useful in loaner-mode, but this is certainly a good start.

Anyways, much of this is still quick hacks, with rough edges that needs to be resolved. So don’t be surprised if it breaks while we improve stability and attempt to add support for multiple platforms. With a bit of time and resources I’m fairly confident that the “one-click loaner” flow could become the preferred method for debugging issues specific to the test environment.

March 01, 2016 06:02 AM

February 24, 2016

John Ford

cloud-mirror – Platform Engineering Operations Project of the Month

Hello from Platform Engineering Operations! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is our cloud-mirror.

The cloud-mirror is something that we've written to reduce costs and time of inter-region S3 transfers. Cloud-mirror was designed for use in the Taskcluster system, but is possible to run independently. Taskcluster, which is the new automation environment for Mozilla, can support passing artifacts between dependent tasks. An example of this is that when we do a build, we want to make the binaries available to the test machines. We originally hosted all of our artifacts in a single AWS region. This meant that every time a test was done in a region outside of the main region, we would incur an inter-region transfer for each test run. This is expensive and slow compared to in-region transfers.

We decided that a better idea would be to transfer the data from the main region to the other regions the first time it was requested in that region and then have all subsequent requests be inside of the region. This means that for the small overhead of an extra in-region copy of the file, we lose the cost and time overhead of doing inter-region transfers every single time.

Here's an example. We use us-west-2 as our main region for storing artifacts. A test machine in eu-central-1 requires "firefox-50.tar.bz2" for use in a test. The test machine in eu-central-1 will ask cloud mirror for this file. Since this is the first test to request this artifact in eu-central-1, cloud mirror will first copy "firefox-50.tar.bz2" into eu-central-1 then redirect to the copy of that file in eu-central-1. The second test machine in eu-central-1 will then ask for a copy of "firefox-50.tar.bz2" and because it's already in the region, the cloud mirror will immediately redirect to the eu-central-1 copy.

We expire artifacts from the destination regions so that we don't incur too high storage costs. We also use a redis cache configured to expire keys which have been used least recently first. Cloud mirror is written with Node 5 and uses Redis for storage. We use the upstream aws-sdk library for doing our S3 operations.

We're in the process of deploying this system to replace our original implementation called 's3-copy-proxy'. This earlier version was a much simpler version of this idea which we've been using in production. One of the main reasons for the rewrite was to be able to abstract the core concepts to allow anyone to write a backend for their storage type as well as being able to support more aws regions and move towards a completely HTTPS based chain.

If this is a project that's interesting to you, we have lots of ways that you could contribute! Here are some:
  • switch polling for pending copy operations to use redis's pub/sub features
  • write an Azure or GCE storage backend
  • Modify the API to determine which cloud storage pool a request should be redirected to instead of having to encode that into the route
  • Write a localhost storage backend for testing that serves content on 127.0.0.1
If you have any ideas or find some bugs in this system, please open an issue https://github.com/taskcluster/cloud-mirror/issues. For the time being, you will need to have an AWS account to run our integration tests (`npm test`). We would love to have a storage backend that allows running the non-service specific portions of the system without any extra permissions.
If you're interested in contributing, please ping me (jhford) in #taskcluster on irc.mozilla.org.

For more information about all Platform Ops projects, visit our wiki. If you're interested in helping out, http://ateam-bootcamp.readthedocs.org/en/latest/guide/index.html has resources for getting started.

February 24, 2016 03:13 PM

February 16, 2016

Maja Frydrychowicz

First Experiment with TaskCluster

TaskCluster is a new-ish continuous integration system made at Mozilla. It manages the scheduling and execution of tasks based on a graph of their dependencies. It’s a general CI tool, and could be used for any kind of job, not just Mozilla things.

However, the example I describe here refers to a Mozilla-centric use case of TaskCluster1: tasks are run per check-in on the branches of Mozilla’s Mercurial repository and then results are posted to Treeherder. For now, the tasks can be configured to run in Docker images (Linux), but other platforms are in the works2.

So, I want to schedule a task! I need to add a new task to the task graph that’s created for each revision submitted to hg.mozilla.org. (This is part of my work on deploying a suite of tests for the Marionette Python test runner, i.e. testing the test harness itself.)

The rest of this post describes what I learned while making this work-in-progress.

There are builds and there are tests

mozilla-taskcluster operates based on the info under testing/taskcluster/tasks in Mozilla’s source tree, where there are yaml files that describe tasks. Specific tasks can inherit common configuration options from base yaml files.

The yaml files are organized into two main categories of tasks: builds and tests. This is just a convention in mozilla-taskcluster about how to group task configurations; TC itself doesn’t actually know or care whether a task is a build or a test.

The task I’m creating doesn’t quite fit into either category: it runs harness tests that just exercise the Python runner code in marionette_client, so I only need a source checkout, not a Firefox build. I’d like these tests to run quickly without having to wait around for a build. Another example of such a task is the recently-created ESLint task.

Scheduling a task

Just adding a yaml file that describes your new task under testing/taskcluster/tasks isn’t enough to get it scheduled: you must also add it to the list of tasks in base_jobs.yml, and define an identifier for your task in base_job_flags.yml. This identifier is used in base_jobs.yml, and also by people who want to run your task when pushing to try.

How does scheduling work? First a decision task generates a task graph, which describes all the tasks and their relationships. More precisely, it looks at base_jobs.yml and other yaml files in testing/taskcluster/tasks and spits out a json artifact, graph.json3. Then, graph.json gets sent to TC’s createTask endpoint, which takes care of the actual scheduling.

In the excerpt below, you can see a task definition with a requires field and you can recognize a lot of fields that are in common with the ‘task’ section of the yaml files under testing/taskcluster/tasks/.

{
"tasks": [
    {
      "requires": [
        // id of a build task that this task depends on
        "fZ42HVdDQ-KFFycr9PxptA"  
      ], 
      "task": {
        "taskId": "c2VD_eCgQyeUDVOjsmQZSg"
        "extra": {
          "treeherder": {
              "groupName": "Reftest", 
              "groupSymbol": "tc-R", 
          }, 
        }, 
        "metadata": {
          "description": "Reftest test run 1", 
          "name": "[TC] Reftest", 
        //...
  ]
}

For now at least, a major assumption in the task-graph creation process seems to be that test tasks can depend on build tasks and build tasks don’t really4 depend on anything. So:

  • If you want your tasks to run for every push to a Mozilla hg branch, add it to the list of builds in base_jobs.yml.
  • If you want your task to run after certain build tasks succeed, add it to the list of tests in base_jobs.yml and specify which build tasks it depends on.
  • Other than the above, I don’t see any way to specify a dependency between task A and task B in testing/taskcluster/tasks.

So, I added marionette-harness under builds. Recall, my task isn’t a build task, but it doesn’t depend on a build, so it’s not a test, so I’ll treat it like a build.

# in base_job_flags.yml
builds:
  # ...
  - marionette-harness

# in base_jobs.yml
builds:
  # ...
  marionette-harness:
    platforms:
      - Linux64
    types:
      opt:
        task: tasks/tests/harness_marionette.yml

This will allow me to trigger my task with the following try syntax: try: -b o -p marionette-harness. Cool.

Make your task do stuff

Now I have to add some stuff to tasks/tests/harness_marionette.yml. Many of my choices here are based on the work done for the ESLint task. I created a base task called harness_test.yml by mostly copying bits and pieces from the basic build task, build.yml and making a few small changes. The actual task, harness_marionette.yml inherits from harness_test.yml and defines specifics like Treeherder symbols and the command to run.

The command

The heart of the task is in task.payload.command. You could chain a bunch of shell commands together directly in this field of the yaml file, but it’s better not to. Instead, it’s common to call a TaskCluster-friendly shell script that’s available in your task’s environment. For example, the desktop-test docker image has a script called test.sh through which you can call the mozharness script for your tests. There’s a similar build.sh script on desktop-build. Both of these scripts depend on environment variables set elsewhere in your task definition, or in the Docker image used by your task. The environment might also provide utilities like tc-vcs, which is used for checking out source code.

# in harness_marionette.yml
payload:
  command:
    + bash
    + -cx
    + >
        tc-vcs checkout ./gecko {{base_repository}} {{head_repository}} {{head_rev}} {{head_ref}} &&
        cd gecko &&
        ./mach marionette-harness-test

My task’s payload.command should be moved into a custom shell script, but for now it just chains together the source checkout and a call to mach. It’s not terrible of me to use mach in this case because I expect my task to work in a build environment, but most tests would likely call mozharness.

Configuring the task’s environment

Where should the task run? What resources should it have access to? This was probably the hardest piece for me to figure out.

docker-worker

My task will run in a docker image using a docker-worker5. The image, called desktop-build, is defined in-tree under testing/docker. There are many other images defined there, but I only considered desktop-build versus desktop-test. I opted for desktop-build because desktop-test seems to contain mozharness-related stuff that I don’t need for now.

# harness_test.yml
image:
   type: 'task-image'
   path: 'public/image.tar'
   taskId: '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}'

The image is stored as an artifact of another TC task, which makes it a ‘task-image’. Which artifact? The default is public/image.tar. Which task do I find the image in? The magic incantation '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}' somehow6 obtains the correct ID, and if I look at a particular run of my task, the above snippet does indeed get populated with an actual taskId.

"image": {
  "path": "public/image.tar",
  // Mystery task that makes a desktop-build image for us. Thanks, mystery task!
  "taskId": "aqt_YdmkTvugYB5b-OvvJw", 
  "type": "task-image"
}

Snooping around in the handy Task Inspector, I found that the magical mystery task is defined in image.yml and runs build_image.sh. Fun. It’s also quite convenient to define and test your own custom image.

Other details that I mostly ignored

# in harness_test.yml
scopes:
  # Nearly all of our build tasks use tc-vcs
  - 'docker-worker:cache:level-{{level}}-{{project}}-tc-vcs'
cache:
   # The taskcluster-vcs tooling stores the large clone caches in this
   # directory and will reuse them for new requests this saves about 20s~
   # and is the most generic cache possible.
   level-{{level}}-{{project}}-tc-vcs: '/home/worker/.tc-vcs'
  • Routes allow your task to be looked up in the task index. This isn’t necessary in my case so I just omitted routes altogether.
  • Scopes are permissions for your tasks, and I just copied the scope that is used for checking out source code.
  • workerType is a configuration for managing the workers that run tasks. To me, this was a choice between b2gtest and b2gbuild, which aren’t specific to b2g anyway. b2gtest is more lightweight, I hear, which suits my harness-test task fine.
  • I had to include a few dummy values under extra in harness_test.yml, like build_name, just because they are expected in build tasks. I don’t use these values for anything, but my task fails to run if I don’t include them.

Yay for trial and error

  • If you have syntax errors in your yaml, the Decision task will fail. If this happens during a try push, look under Job Details > Inspect Task to fine useful error messages.
  • Iterating on your task is pretty easy. Aside from pushing to try, you can run tasks locally using vagrant and you can build a task graph locally as well with mach taskcluster-graph.

Resources

Blog posts from other TaskCluster users at Mozilla:

There is lots of great documentation at docs.taskcluster.net, but these sections were especially useful to me:

Acknowledgements

Thanks to dustin, pmoore and others for corrections and feedback.


  1. This is accomplished in part thanks to mozilla-taskcluster, a service that links Mozilla’s hg repo to TaskCluster and creates each decision task. More at TaskCluster at Mozilla 

  2. Run tasks on any platform thanks to generic worker 

  3. To look at a graph.json artifact, go to Treeherder, click a green ‘D’ job, then Job details > Inspect Task, where you should find a list of artifacts. 

  4. It’s not really true that build tasks don’t depend on anything. Any task that uses a task-image depends on the task that creates the image. I’m sorry for saying ‘task’ five times in every sentence, by the way. 

  5. …as opposed to a generic worker

  6. {{#task_id_for_image}} is an example of a predefined variable that we can use in our TC yaml files. Where do they come from? How do they get populated? I don’t know. 

February 16, 2016 05:00 AM

October 12, 2015

John Ford

Splitting out taskcluster-base into component libraries

The intended audience of this post is people who either work with taskcluster-base now or are interested in implementing taskcluster services in the future.

Taskcluster serverside components are currently built using the suite of libraries in the taskcluster-base npm package. This package is many things: config parsing, data persistence, statistics, json schema validators, pulse publishers, a rest api framework and some other useful tools. Having these all in one single package means that each time a contributor wants to hack on one part of our platform, she'll have to figure out how to install and run all of our dependencies. This is annoying when it's waiting for a libxml.so library build, but just about impossible for contributors who aren't on the Taskcluster platform team. You need ​Azure, Influx and AWS accounts to be able to run the full test suite. You also might experience confusing errors in a part of the library you're not even touching.

Additionally, we are starting to get to the point where some services must upgrade one part of taskcluster-base without using other parts. This is generally frowned upon, but sometimes we just need to put a bandaid on a broken system that's being turned off soon. We deal with this currently by exporting base.Entity and base.LegacyEntity. I'd much rather we just export a single base.Entity and have people who need to keep using the old Entity library use taskcluster-lib-legacyentity directly

We're working on fixing this! The structure of taskcluster-base is really primed and ready to be split up since it's already a bunch of independent libraries that just so happen to be collocated. The new component loader that landed was the first library to be included in taskcluster-base this way and I converted our configs and stats libraries last week.

The naming convention that we've settled on is that taskcluster libraries will be prefix with taskcluster-lib-X. This means we have taskcluster-lib-config, taskcluster-lib-stats. We'll continue to name services as taskcluster-Y, like taskcluster-auth or taskcluster-confabulator.  The best way to get the current supported set of taskcluster libraries is still going to be to install the taskcluster-base npm module.

Some of our libraries are quiet large and have a lot of history in them. I didn't really want to just create a new repository and copy in the files we care about and destroy the history. Instead, I wrote a simple and ugly tool (https://github.com/jhford/taskcluster-base-split) which does the pedestrian tasks involved in this split up by filtering out irrelevant history for each project, moving files around and doing some preliminary cleanup work on the new library.

This tooling gets us 90% of the way to a split out repository, but as always, a human is required to take it the last step of the way. Imports need to be fixed, dependencies must be verified and tests need to be fixed. I'm also taking this opportunity to implement babel-transpiling support in as many libraries as I can. We use babel everywhere in our application code, so it'll be nice to have it available in our platform libraries as well. I'm using the babel-runtime package instead of requiring the direct use of babel. The code produced by our babel setup is tested in tested using the node 0.12 binary without any wrappers at all.

Having different libraries will introduce the risk of our projects having version number hell. We're still going to have a taskcluster-base npm package. This package will simply be a package.json file which specifies the supported versions of the taskcluster-lib-* packages we ship as a release and an index.js file which imports and re-exports the libraries that we provide. If we have two libraries that have codependent changes, we can land new versions in those repositories and use taskcluster-base as the synchronizing mechanism.

A couple of open questions that I'd love to get input on are how we should share package.json snippets and babel configurations. We mostly have a solution for eslint, but we'd love to be able to share as much as possible in our .babelrc configuration files. If you have a good idea for how we can do that, please get in touch!

One of the goals in doing this is to make writing taskcluster components easier to write. We'd love to see components written by other teams use our framework since we know it's tested to work with Taskcluster well. It also makes it easier for the task cluster team to advise on design and maintenance concerns.

Once a few key changes have landed, I will write a series of blog posts explaining how core taskcluster services are structured.

October 12, 2015 01:18 PM

October 09, 2015

Wander Lairson Costa

In tree tasks configuration

This post is about our plans for representing Taskcluster tasks inside the gecko tree. Jonas, Dustin and I had a discussion in Berlin about this, here I summarize what we have so far. We currently store tasks in an yaml file and they translate to json format using the mach command. The syntax we have now is not the most flexible one, it is hard to parameterize the task and very difficulty to represents tasks relationships.

Let us illustrate the shortcomings with two problems we currently have. Both apply to B2G.

B2G (as in Android) has three different build variants: user, userdebug and eng. Each one has slightly different task configurations. As there is no flexible way to parameterize tasks, we end up with one different task file for each build variant.

When doing nightly builds, we must send update data to the OTA server. We have plans to run a build task, then run the test tasks on this build, and if all tests pass, we run a task responsible to update the OTA server. The point is that today we have no way to represent this relationship inside the task files.

For the first problem Jonas has a prototype for json parameterization. There were discussions on Berlin work week either we should stick with yaml files or use Python files for task configuration. We do want to keep the syntax declarative, which favors yaml, but storing configurations in Python files brings much more expressiveness and flexibility, but this can result in the same configuration hell we have with Buildbot.

The second problem is more complex, and we still haven’t reached a final design. The first question is how we describe task dependencies, top-down, i.e., we specify which task(s) should run after a completed task, or ground up, a task specifies which tasks it depends on. In general, we all agreed to go to a top-down syntax, since most scenarios beg for a top down approach. Other either should put the description of tasks relationship inside the task files or in a separated configuration file. We would like to represent task dependencies inside the task file, the problem is how to check what’s the root task for the task graph. One suggestion is having a task file called root.yml which only contain root tasks.

October 09, 2015 12:00 AM

October 05, 2015

Selena Deckelmann

[berlin] TaskCluster Platform: A Year of Development

Back in September, the TaskCluster Platform team held a workweek in Berlin to discuss upcoming feature development, focus on platform stability and monitoring and plan for the coming quarter’s work related to Release Engineering and supporting Firefox Release. These posts are documenting the many discussions we had there.

Jonas kicked off our workweek with a brief look back on the previous year of development.

Prototype to Production

In the last year, TaskCluster went from an idea with a few tasks running to running all of FirefoxOS aka B2G continuous integration, which is about 40 tasks per minute in the current environment.

Architecture-wise, not a lot of major changes were made. We went from CloudAMQP to Pulse (in-house RabbitMQ). And shortly, Pulse itself will be moving it’s backend to CloudAMQP! We introduced task statuses, and then simplified them.

On the implementation side, however, a lot changed. We added many features and addressed a ton of docker worker bugs. We killed Postgres and added Azure Table Storage. We rewrote the provisioner almost entirely, and moved to ES6. We learned a lot about babel-node.

We introduced the first alternative to the Docker worker, the Generic worker. We for the first time had Release Engineering create a worker, the Buildbot Bridge.

We have several new users of TaskCluster! Brian Anderson from Rust created a system for testing all Cargo packages for breakage against release versions. We’ve had a number of external contributors create builds for FirefoxOS devices. We’ve had a few Github-based projects jump on taskcluster-github.

Features that go beyond BuildBot

One of the goals of creating TaskCluster was to not just get feature parity, but go beyond and support exciting, transformative features to make developer use of the CI system easier and fun.

Some of the features include:

Features coming in the near future to support Release

Release is a special use case that we need to support in order to take on Firefox production worload. The focus of development work in Q4 and beyond includes:

  • Secrets handling to support Release and ops workflows. In Q4, we should see secrets.taskcluster.net go into production and UI for roles-based management.
  • Scheduling support for coalescing, SETA and cache locality. In Q4, we’re focusing on an external data solution to support coalescing and SETA.
  • Private data hosting. In Q4, we’ll be using a roles-based solution to support these.

October 05, 2015 06:38 PM

TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.

Publications

Our team published a few blog posts and videos this quarter:

October 05, 2015 05:39 PM

Wander Lairson Costa

Running phone builds on Taskcluster

In this post I am going to talk about my work for phone builds inside the Taskcluster infrastructure. Mozilla is slightly moving from Buildbot to Taskcluster. Here I am going to give a survivor guide on Firefox OS phone builds.

Submitting tasks

A task is nothing more than a json file containing the description of the job to execute. But you don’t need to handle the json directly, all tasks are written in YAML, and it is then processed by the mach command. The in tree tasks are located at testing/taskcluster/tasks and the build tasks are inside the builds/ directory.

My favorite command to try out a task is the mach taskcluster-build command. It allows you to process a single task and output the json formatted task ready for Taskcluster submission.

$ ./mach taskcluster-build \
    --head-repository=https://hg.mozilla.org/mozilla-central 
    --head-rev=tip \
    --owner=foobar@mozilla.com \
    tasks/builds/b2g_desktop_opt.yml

Although we specify a Mercurial repository, Taskcluster also accepts git repositories interchangeably.

This command will print out the task to the console output. To run the task, you can copy the generated task and paste it in the task creator tool. Then just click on Create Task to schedule it to run. Remember that you need Taskcluster Credentials to run Taskcluster tasks. If you have taskcluster-cli installed, you can the pipe the mach output to taskcluster run-task.

The tasks are effectively executed inside a docker image.

Mozharness

Mozharness is what we use for effectively build stuff. Mozharness architecture, despite its code size, is quite simple. Under the scripts directory you find the harness scripts. We are specifically interested in the b2g_build.py script. As the script name says, it is responsible for B2G builds. The B2G harness configuration files are located at the b2g/config directory. Not surprisingly, all files starting with “taskcluster” are for Taskcluster related builds.

Here are the most common configurations:

default_vcs
This is the default vcs used to clone repositories when no other is given. [tc_vcs](https://tc-vcs.readthedocs.org/en/latest/) allows mozharness to clone either git or mercurial repositories transparently, with repository caching support.
default_actions
The actions to execute. They must be present and in the same order as in the build class `all_actions` attribute.
balrog_credentials_file
The credentials to send update data to the OTA server.
nightly_build
`True` if this is a nightly build.
upload
Upload info. Not used for Taskcluster.
repo_remote_mappings
Maps externals repository to [mozilla domain](https://git.mozilla.org).
env
Environment variables for commands executed inside mozharness.

The listed actions map to Python methods inside the build class, with - replaced by _. For example, the action checkout-sources maps to the method checkout_sources. That’s where the mozharness simplicity comes from: everything boils down to a sequence of method calls, just it, no secret.

For example, here is how you run mozharness to build a flame image:

python <gecko-dir>/testing/mozharness/scripts/b2g_build.py \
  --config b2g/taskcluster-phone.py \
  --disable-mock \
  --variant=user \
  --work-dir=B2G \
  --gaia-languages-file locales/languages_all.json \
  --log-level=debug \
  --target=flame-kk \
  --b2g-config-dir=flame-kk \
  --repo=https://hg.mozilla.org/mozilla-central \

Remember you need your flame connected to the machine so the build system can extract the blobs.

In general you don’t need to worry about mozharness command line because it is wrapped by the build scripts.

Hacking Taskcluster B2G builds

All Taskcluster tasks run inside a docker container. Desktop and emulator B2G builds run inside the builder docker image. Phone builds are more complex, because:

  1. Mozilla is not allowed to publicly redistribute phone binaries.

  2. Phone build tasks need to access the Balrog server to send OTA update data.

  3. Phone build tasks need to upload symbols to the crash reporter.

Due to (1), only users authenticated with a @mozilla account are allowed to download phone binaries (this works the same way as private builds). And because of (1), (2) and (3), the phone-builder docker image is secret, so only authorized users can submit tasks to it.

If you need to create a build task for a new phone, most of the time you will starting from an existing task (Flame and Aries tasks are preferred) and then make your customizations. You might need to add new features to the build scripts, which currently are not the most flexible scripts around.

If you need to customize mozharness, make sure your changes are Python 2.6 compatible, because mozharness is used to run Buildbot builds too, and the Buildbot machines run Python 2.6. The best way to minimize risk of breaking stuff is to submit your patches to try with “-p all -b do” flags.

Need help? Ask at the #taskcluster channel.

October 05, 2015 12:00 AM

September 30, 2015

Pete Moore

Building Firefox for Windows™ on Try using TaskCluster

Firefox on Windows screenshot

Try them out for yourself!

Here are the try builds we have created. They were built from the official in-tree mozconfigs that we use for the builds running in Buildbot.

Set up your own Windows™ Try tasks

We are porting over all of Mozilla’s CI tasks to TaskCluster, including Windows™ builds and tests.

Currently Windows™ and OS X tasks still run on our legacy Buildbot infrastructure. This is about to change.

In this post, I am going to talk you through how I set up Firefox Desktop builds in TaskCluster on Try. In future, the TaskCluster builds should replace the existing Buildbot builds, even for releases. Getting them running on Try was the first in a long line of many steps.

Spoiler alert: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc4b30cc56fb

Using the right Worker

In TaskCluster, Linux tasks run in a docker container. This doesn’t work on Windows, so we needed a different strategy.

TaskCluster defines the role of a Worker as component that is able to claim tasks from the Queue, execute them, publish artifacts, and report back status to the Queue.

For Linux, we have the Docker Worker. This is the component that takes care of executing Linux tasks inside a docker container. Since everything takes place in a container, consecutive tasks cannot interfere with each other, and you are guaranteed a clean environment.

This year I have been working on the Generic Worker. This takes care of running TaskCluster tasks on other platforms.

For Windows, we have a different isolation strategy: since we cannot yet easily run inside a container, the Generic Worker will create a new Windows user for each task it runs.

This user will have its own home directory, and will not have privileged access to the host OS. This means, it should not be able to make any persistent changes to the host OS that will outlive the lifetime of the task. The user only is able to affect HKEY_CURRENT_USER registry settings, and write to its home folder, which are both purged after task completion.

In other words, although not running in a container, the Generic Worker offers isolation to TaskCluster tasks by virtue of running each task as a different, custom created OS user with limited privileges.

Creating a Worker Type

TaskCluster considers a Worker Type as an entity which belongs to a Provisioner, and represents a host environment and hardware context for running one or more Workers. This is the Worker Type that I set up:

{
  "workerType": "win2012r2",
  "minCapacity": 0,
  "maxCapacity": 4,
  "scalingRatio": 0,
  "minPrice": 0.5,
  "maxPrice": 2,
  "canUseOndemand": false,
  "canUseSpot": true,
  "instanceTypes": [
    {
      "instanceType": "m3.2xlarge",
      "capacity": 1,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {},
      "launchSpec": {}
    }
  ],
  "regions": [
    {
      "region": "us-west-2",
      "secrets": {},
      "scopes": [],
      "userData": {},
      "launchSpec": {
        "ImageId": "ami-db657feb"
      }
    }
  ],
  "lastModified": "2015-09-30T10:15:30.349Z",
  "userData": {},
  "launchSpec": {
    "SecurityGroups": [
      "rdp-only"
    ]
  },
  "secrets": {},
  "scopes": [
    "*"
  ]
}

Not everybody has permission to create worker types - but there again, you only really need to do this if you are:

  • using Windows (or anything else non-linux)
  • not able to use an existing worker type

If you would like to create a new Worker Type, please contact the taskcluster team on irc.mozilla.org in #taskcluster channel.

The Worker Type above boils down to some AWS hardware specs, and an ImageId ami-db657feb. But where did this come from?

Generating the AMI for the Worker Type

It is a Windows 2012 R2 AMI, and it was generated with this code checked in to the try branch. This is not automatically run, but is checked in for reference purposes.

Here is the code. The first is a script that creates the AMI:

#!/bin/bash -exv

# cd into directory containing script...
cd "$(dirname "${0}")"

# generate a random slugid for aws client token...
# you need either go installed (https://golang.org/) and $GOPATH configured to run this,
# or alternatively download the 'slug' binary; see
# http://taskcluster.github.io/slugid-go/#installing-command-line-tool
go get github.com/taskcluster/slugid-go/slug
SLUGID=$("${GOPATH}/bin/slug")

# aws cli docs lie, they say userdata must be base64 encoded, but cli encodes for you, so just cat it...
USER_DATA="$(cat aws_userdata)"

# create base ami, and apply user-data
# filter output, to get INSTANCE_ID
# N.B.: ami-4dbcb67d referenced below is *the* Windows 2012 Server R2 ami offered by Amazon in us-west-2 - it is nothing we have made
# note, you'll need aws tool installed, access to the taskcluster AWS account, and your own private key file
INSTANCE_ID="$(aws --region us-west-2 ec2 run-instances --image-id ami-4dbcb67d --key-name pmoore-oregan-us-west-2 --security-groups "RDP only" --user-data "${USER_DATA}" --instance-type c4.2xlarge --block-device-mappings DeviceName=/dev/sda1,Ebs='{VolumeSize=75,DeleteOnTermination=true,VolumeType=gp2}' --instance-initiated-shutdown-behavior terminate --client-token "${SLUGID}" | sed -n 's/^ *"InstanceId": "\(.*\)", */\1/p')"

# sleep an hour, the installs take forever...
sleep 3600

# now capture the AMI - feel free to change the tags
IMAGE_ID="$(aws --region us-west-2 ec2 create-image --instance-id "${INSTANCE_ID}" --name "win2012r2 mozillabuild pmoore version ${SLUGID}" --description "firefox desktop builds on windows - taskcluster worker - version ${SLUGID}" | sed -n 's/^ *"ImageId": *"\(.*\)" *$/\1/p')"

# TODO: now update worker type...
# You must update the existing win2012r2 worker type with the new ami id generated ($IMAGE_ID var above)
# At the moment this is a manual step! It can be automated following the docs:
# http://docs.taskcluster.net/aws-provisioner/api-docs/#workerType
# http://docs.taskcluster.net/aws-provisioner/api-docs/#updateWorkerType

echo "Worker type ami to be used: '${IMAGE_ID}' - don't forget to update https://tools.taskcluster.net/aws-provisioner/#win2012r2/edit"' !!!'

This script works by exploiting the fact that when you spawn a Windows instance in AWS, using one of the AMIs that Amazon provides, you can include a Powershell snippet for additional setup. This gets executed automatically when you spawn the instance.

So we simply spawn an instance, passing through this powershell snippet, and then wait. A LONG time (an hour). And then we snapshot the image, and we have our new AMI. Simple!

Here is the Powershell snippet that it uses:

<powershell>

# needed for making http requests
$client = New-Object system.net.WebClient
$shell = new-object -com shell.application

# utility function to download a zip file and extract it
function Expand-ZIPFile($file, $destination, $url)
{
    $client.DownloadFile($url, $file)
    $zip = $shell.NameSpace($file)
    foreach($item in $zip.items())
    {
        $shell.Namespace($destination).copyhere($item)
    }
}

# allow powershell scripts to run
Set-ExecutionPolicy Unrestricted -Force -Scope Process

# install chocolatey package manager
Invoke-Expression ($client.DownloadString('https://chocolatey.org/install.ps1'))

# download mozilla-build installer
$client.DownloadFile("https://api.pub.build.mozilla.org/tooltool/sha512/03b4ca2bebede21a29f739165030bfc7058a461ffe38113452e976193e382d3ba6df8a48ac843b70429e23481e6327f43c86ffd88e4ce16263d072ef7e14e692", "C:\MozillaBuildSetup-2.0.0.exe")

# run mozilla-build installer in silent (/S) mode
$p = Start-Process "C:\MozillaBuildSetup-2.0.0.exe" -ArgumentList "/S" -wait -NoNewWindow -PassThru -RedirectStandardOutput "C:\MozillaBuild-2.0.0_install.log" -RedirectStandardError "C:\MozillaBuild-2.0.0_install.err"

# install Windows SDK 8.1
choco install -y windows-sdk-8.1

# install Visual Studio community edition 2013
choco install -y visualstudiocommunity2013
# $client.DownloadFile("https://go.microsoft.com/fwlink/?LinkId=532495&clcid=0x409", "C:\vs_community.exe")

# install June 2010 DirectX SDK for compatibility with Win XP
$client.DownloadFile("http://download.microsoft.com/download/A/E/7/AE743F1F-632B-4809-87A9-AA1BB3458E31/DXSDK_Jun10.exe", "C:\DXSDK_Jun10.exe")

# prerequisite for June 2010 DirectX SDK is to install ".NET Framework 3.5 (includes .NET 2.0 and 3.0)"
Install-WindowsFeature NET-Framework-Core -Restart

# now run DirectX SDK installer
$p = Start-Process "C:\DXSDK_Jun10.exe" -ArgumentList "/U" -wait -NoNewWindow -PassThru -RedirectStandardOutput C:\directx_sdk_install.log -RedirectStandardError C:\directx_sdk_install.err

# install PSTools
md "C:\PSTools"
Expand-ZIPFile -File "C:\PSTools\PSTools.zip" -Destination "C:\PSTools" -Url "https://download.sysinternals.com/files/PSTools.zip"

# install nssm
Expand-ZIPFile -File "C:\nssm-2.24.zip" -Destination "C:\" -Url "http://www.nssm.cc/release/nssm-2.24.zip"

# download generic-worker
md "C:\generic-worker"
$client.DownloadFile("https://github.com/taskcluster/generic-worker/releases/download/v1.0.12/generic-worker-windows-amd64.exe", "C:\generic-worker\generic-worker.exe")

# enable DEBUG logs for generic-worker install
$env:DEBUG = "*"

# install generic-worker
$p = Start-Process "C:\generic-worker\generic-worker.exe" -ArgumentList "install --config C:\\generic-worker\\generic-worker.config" -wait -NoNewWindow -PassThru -RedirectStandardOutput C:\generic-worker\install.log -RedirectStandardError C:\generic-worker\install.err

# add extra config needed
$config = [System.Convert]::FromBase64String("UEsDBAoAAAAAAA2hN0cIOIW2JwAAACcAAAAJAAAAZ2FwaS5kYXRhQUl6YVN5RC1zLW1YTDRtQnpGN0tNUmtoVENJYkcyUktuUkdYekpjUEsDBAoAAAAAACehN0cVjoCGIAAAACAAAAAVAAAAY3Jhc2gtc3RhdHMtYXBpLnRva2VuODhmZjU3ZDcxMmFlNDVkYmJlNDU3NDQ1NWZjYmNjM2VQSwMECgAAAAAANKE3RxYFa6ViAAAAYgAAABQAAABnb29nbGUtb2F1dGgtYXBpLmtleTE0NzkzNTM0MzU4Mi1qZmwwZTBwc2M3a2gxbXV0MW5mdGI3ZGUwZjFoMHJvMC5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSBLdEhDRkNjMDlsdEN5SkNqQ3dIN1pKd0cKUEsDBAoAAAAAAEShN0ctdLepZAAAAGQAAAAYAAAAZ29vZ2xlLW9hdXRoLWFwaS5rZXlfYmFr77u/MTQ3OTM1MzQzNTgyLWpmbDBlMHBzYzdraDFtdXQxbmZ0YjdkZTBmMWgwcm8wLmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tIEt0SENGQ2MwOWx0Q3lKQ2pDd0g3Wkp3R1BLAwQKAAAAAABYoTdHJ3EEFiQAAAAkAAAADwAAAG1vemlsbGEtYXBpLmtleTNiNGQyN2RkLTcwM2QtNDA5NC04Mzk4LTRkZTJjNzYzNTA1YVBLAwQKAAAAAABkoTdHMi/H2yQAAAAkAAAAHgAAAG1vemlsbGEtZGVza3RvcC1nZW9sb2MtYXBpLmtleTdlNDBmNjhjLTc5MzgtNGM1ZC05Zjk1LWU2MTY0N2MyMTNlYlBLAwQKAAAAAABxoTdHJ3EEFiQAAAAkAAAAHQAAAG1vemlsbGEtZmVubmVjLWdlb2xvYy1hcGkua2V5M2I0ZDI3ZGQtNzAzZC00MDk0LTgzOTgtNGRlMmM3NjM1MDVhUEsDBBQAAAAIAHyhN0fa715hagAAAHMAAAANAAAAcmVsZW5nYXBpLnRva0ut9MpIck/O9M/08gyt8jT0y/Sy1Eut9CpINvYFCVZGhnhm+jh7Faa4Z4P4Br4QvkFqhCOIX56ca5CZFqiXU5VoWeaSm20S6eblE+rpXJDiFxoRVBphnFFZUmrpkphd7m4aVWXsFxQeCABQSwECHgMKAAAAAAANoTdHCDiFticAAAAnAAAACQAAAAAAAAABAAAApIEAAAAAZ2FwaS5kYXRhUEsBAh4DCgAAAAAAJ6E3RxWOgIYgAAAAIAAAABUAAAAAAAAAAQAAAKSBTgAAAGNyYXNoLXN0YXRzLWFwaS50b2tlblBLAQIeAwoAAAAAADShN0cWBWulYgAAAGIAAAAUAAAAAAAAAAEAAACkgaEAAABnb29nbGUtb2F1dGgtYXBpLmtleVBLAQIeAwoAAAAAAEShN0ctdLepZAAAAGQAAAAYAAAAAAAAAAEAAACkgTUBAABnb29nbGUtb2F1dGgtYXBpLmtleV9iYWtQSwECHgMKAAAAAABYoTdHJ3EEFiQAAAAkAAAADwAAAAAAAAABAAAApIHPAQAAbW96aWxsYS1hcGkua2V5UEsBAh4DCgAAAAAAZKE3RzIvx9skAAAAJAAAAB4AAAAAAAAAAQAAAKSBIAIAAG1vemlsbGEtZGVza3RvcC1nZW9sb2MtYXBpLmtleVBLAQIeAwoAAAAAAHGhN0cncQQWJAAAACQAAAAdAAAAAAAAAAEAAACkgYACAABtb3ppbGxhLWZlbm5lYy1nZW9sb2MtYXBpLmtleVBLAQIeAxQAAAAIAHyhN0fa715hagAAAHMAAAANAAAAAAAAAAEAAACkgd8CAAByZWxlbmdhcGkudG9rUEsFBgAAAAAIAAgAEQIAAHQDAAAAAA==")
md "C:\builds"
Set-Content -Path "C:\builds\config.zip" -Value $config -Encoding Byte
$zip = $shell.NameSpace("C:\builds\config.zip")
foreach($item in $zip.items())
{
    $shell.Namespace("C:\builds").copyhere($item)
}
rm "C:\builds\config.zip"

# initial clone of mozilla-central
$p = Start-Process "C:\mozilla-build\python\python.exe" -ArgumentList "C:\mozilla-build\python\Scripts\hg clone -u null https://hg.mozilla.org/mozilla-central C:\gecko" -wait -NoNewWindow -PassThru -RedirectStandardOutput "C:\hg_initial_clone.log" -RedirectStandardError "C:\hg_initial_clone.err"

</powershell>

Hopefully this Powershell script is quite self-explanatory. It installs the required build tool chains for building Firefox Desktop, and then installs the parts it needs for running the Generic Worker on this instance. It sets up some additional config that is needed by the build process, and then takes an initial clone of mozilla-central, as an optimisation, so that future jobs only need to pull changes since the image was created.

The caching strategy is to have a clone of mozilla-central live under C:\gecko, which is updated with an hg pull from mozilla central each time a job runs. Then when a task needs to pull from try, it is only ever a few commits behind, and should pull updates very quickly.

Defining Tasks

Once we have our AMI created, and we’ve published our Worker Type, we need to submit tasks to get the Provisioner to spawn instances in AWS, and execute our tasks.

The next piece of the puzzle is working out how to get these jobs added to Try. Again, luckily for us, this is just a matter of in-tree config.

For this, most of the magic exists in testing/taskcluster/tasks/builds/firefox_windows_base.yml:

$inherits:
  from: 'tasks/windows_build.yml'
  variables:
    build_product: 'firefox'

task:
  metadata:
    name: "[TC] Firefox {{arch}} ({{build_type}})"
    description: Firefox {{arch}} {{build_type}}

  payload:
    env:
      ExtensionSdkDir: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1\\ExtensionSDKs"
      Framework40Version: "v4.0"
      FrameworkDir: "C:\\Windows\\Microsoft.NET\\Framework64"
      FrameworkDIR64: "C:\\Windows\\Microsoft.NET\\Framework64"
      FrameworkVersion: "v4.0.30319"
      FrameworkVersion64: "v4.0.30319"
      FSHARPINSTALLDIR: "C:\\Program Files (x86)\\Microsoft SDKs\\F#\\3.1\\Framework\\v4.0\\"
      INCLUDE: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\INCLUDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\INCLUDE;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\shared;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\um;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\winrt;"
      MOZBUILD_STATE_PATH: "C:\\Users\\Administrator\\.mozbuild"
      MOZ_MSVCVERSION: "12"
      MOZ_MSVCYEAR: "2013"
      MOZ_TOOLS: "C:\\mozilla-build\\moztools-x64"
      MSVCKEY: "HKLM\\SOFTWARE\\Wow6432Node\\Microsoft\\VisualStudio\\12.0\\Setup\\VC"
      SDKDIR: "C:\\Program Files (x86)\\Windows Kits\\8.1\\"
      SDKMINORVER: "1"
      SDKPRODUCTKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows Kits\\Installed Products"
      SDKROOTKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows Kits\\Installed Roots"
      SDKVER: "8"
      VCDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\"
      VCINSTALLDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\"
      VisualStudioVersion: "12.0"
      VSINSTALLDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\"
      WIN64: "1"
      WIN81SDKKEY: "{5247E16E-BCF8-95AB-1653-B3F8FBF8B3F1}"
      WINCURVERKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion"
      WindowsSdkDir: "C:\\Program Files (x86)\\Windows Kits\\8.1\\"
      WindowsSDK_ExecutablePath_x64: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\x64\\"
      WindowsSDK_ExecutablePath_x86: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\"
      MACHTYPE: "i686-pc-msys"
      MAKE_MODE: "unix"
      MOZBUILDDIR: "C:\\mozilla-build"
      MOZILLABUILD: "C:\\mozilla-build"
      MOZ_AUTOMATION: "1"
      MOZ_BUILD_DATE: "19770819000000"
      MOZ_CRASHREPORTER_NO_REPORT: "1"
      MSYSTEM: "MINGW32"

    command:
      - "time /t && set"
      - "time /t && hg -R C:\\gecko pull"
      - "time /t && hg clone C:\\gecko src"
      - "time /t && mkdir public\\build"
      - "time /t && set UPLOAD_HOST=localhost"
      - "time /t && set UPLOAD_PATH=%CD%\\public\\build"
      - "time /t && cd src"
      - "time /t && hg pull -r %GECKO_HEAD_REV% -u %GECKO_HEAD_REPOSITORY%"
      - "time /t && set MOZCONFIG=%CD%\\{{mozconfig}}"
      - "time /t && set SRCSRV_ROOT=%GECKO_HEAD_REPOSITORY%"
      - "time /t && C:\\mozilla-build\\msys\\bin\\bash --login %CD%\\mach build"

    artifacts:

      # In the next few days I plan to provide support for directory artifacts,
      # so this explicit list will no longer be needed, and you can specify the
      # following:
      # -
      #   type: "directory"
      #   path: "public\\build"
      #   expires: '{{#from_now}}1 year{{/from_now}}'
      #
      #  This will be done in early October 2015. See
      #  https://bugzilla.mozilla.org/show_bug.cgi?id=1209901

      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.checksums"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.common.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.cppunittest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.crashreporter-symbols.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.mochitest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.mozinfo.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.reftest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.talos.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.txt"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.web-platform.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.xpcshell.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\host\\bin\\mar.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\host\\bin\\mbsdiff.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\install\\sea\\firefox-43.0a1.en-US.{{arch}}.installer.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\jsshell-{{arch}}.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\test_packages.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\{{arch}}\\xpi\\firefox-43.0a1.en-US.langpack.xpi"
        expires: '{{#from_now}}1 year{{/from_now}}'

  extra:
    treeherderEnv:
      - production
      - staging
    treeherder:
      groupSymbol: "tc"
      groupName: Submitted by taskcluster
      machine:
        # from https://github.com/mozilla/treeherder/blob/9263d8432642c2ca9f68b301250af0ffbec27d83/ui/js/values.js#L3
        platform: {{platform}}

    # Rather then enforcing particular conventions we require that all build
    # tasks provide the "build" extra field to specify where the build and tests
    # files are located.
    locations:
      build: "src/{{object_dir}}/dist/bin/firefox.exe"
      tests: "src/{{object_dir}}/all-tests.json"

Reading through this, you see that with the exception of knowing the value of a few parameters ({{object_dir}}, {{platform}}, {{arch}}, {{build_type}}, {{mozconfig}}), the full set of steps that a Windows build of Firefox Desktop requires on the Worker Type we created above. In other words, you see the full system setup in the Worker Type definition, and the full set of task steps in this Task Definition - so now you know as much as I do about how to build Firefox Desktop on Windows. It all exists in-tree, and is transparent to developers.

So where do these parameters come from? Well, this is just the base config - we define opt and debug builds for win32 and win64 architectures. These live [here]:

Here I will illustrate just one of them, the win32 debug build config:

$inherits:
  from: 'tasks/builds/firefox_windows_base.yml'
  variables:
    build_type: 'debug'
    arch: 'win32'
    platform: 'windowsxp'
    object_dir: 'obj-i686-pc-mingw32'
    mozconfig: 'browser\\config\\mozconfigs\\win32\\debug'
task:
  extra:
    treeherder:
      collection:
        debug: true
  payload:
    env:
      CommandPromptType: "Cross"
      LIB: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Windows Kits\\8.1\\lib\\winv6.3\\um\\x86;"
      LIBPATH: "C:\\Windows\\Microsoft.NET\\Framework64\\v4.0.30319;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Windows Kits\\8.1\\References\\CommonConfiguration\\Neutral;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1\\ExtensionSDKs\\Microsoft.VCLibs\\12.0\\References\\CommonConfiguration\\neutral;"
      MOZ_MSVCBITS: "32"
      Path: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE\\CommonExtensions\\Microsoft\\TestWindow;C:\\Program Files (x86)\\MSBuild\\12.0\\bin\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\BIN\\amd64_x86;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\BIN\\amd64;C:\\Windows\\Microsoft.NET\\Framework64\\v4.0.30319;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\VCPackages;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\Tools;C:\\Program Files (x86)\\HTML Help Workshop;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Team Tools\\Performance Tools\\x64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Team Tools\\Performance Tools;C:\\Program Files (x86)\\Windows Kits\\8.1\\bin\\x64;C:\\Program Files (x86)\\Windows Kits\\8.1\\bin\\x86;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\x64\\;C:\\Windows\\System32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\mozilla-build\\moztools-x64\\bin;C:\\mozilla-build\\7zip;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\kdiff3;C:\\mozilla-build\\mozmake;C:\\mozilla-build\\nsis-3.0b1;C:\\mozilla-build\\nsis-2.46u;C:\\mozilla-build\\python;C:\\mozilla-build\\python\\Scripts;C:\\mozilla-build\\upx391w;C:\\mozilla-build\\wget;C:\\mozilla-build\\yasm"
      Platform: "X86"
      PreferredToolArchitecture: "x64"
      TOOLCHAIN: "64-bit cross-compile"

This file above has defined those parameters, and provided some more task specific config too, which overlays the base config we saw before.

But wait a minute… how do these tasks know to use the win2012r2 worker type we created? The answer to that is that testing/taskcluster/tasks/builds/firefox_windows_base.yml inherits from testing/taskcluster/tasks/windows_build.yml:

# This is the base windows task which contains the common values all windows builds must
# provide.
---
$inherits:
  from: 'tasks/build.yml'

task:
  workerType: win2012r2

Incidentally, this then inherits in turn from the root yaml file for all gecko builds (across all gecko platforms):

# This is the "base" task which contains the common values all builds must
# provide.
---
taskId: {{build_slugid}}

task:
  created: '{{now}}'
  deadline: '{{#from_now}}24 hours{{/from_now}}'
  metadata:
    source: http://todo.com/soon
    owner: mozilla-taskcluster-maintenance@mozilla.com

  tags:
    createdForUser: {{owner}}

  provisionerId: aws-provisioner-v1
  schedulerId: task-graph-scheduler

  routes:
    - 'index.gecko.v1.{{project}}.revision.linux.{{head_rev}}.{{build_name}}.{{build_type}}'
    - 'index.gecko.v1.{{project}}.latest.linux.{{build_name}}.{{build_type}}'
  scopes:
    - 'queue:define-task:aws-provisioner-v1/build-c4-2xlarge'
    - 'queue:create-task:aws-provisioner-v1/build-c4-2xlarge'


  payload:

    # Two hours is long but covers edge cases (and matches bb based infra)
    maxRunTime: 7200

    env:
      # Common environment variables for checking out gecko
      GECKO_BASE_REPOSITORY: '{{base_repository}}'
      GECKO_HEAD_REPOSITORY: '{{head_repository}}'
      GECKO_HEAD_REV: '{{head_rev}}'
      GECKO_HEAD_REF: '{{head_ref}}'
      TOOLTOOL_REPO: 'https://git.mozilla.org/build/tooltool.git'
      TOOLTOOL_REV: 'master'

  extra:
    build_product: '{{build_product}}'
    index:
      rank: {{pushlog_id}}
    treeherder:
      groupSymbol: tc
      groupName: Submitted by taskcluster
      symbol: B

So the complete inheritence chain looks like this:

tasks/build.yml
  tasks/windows_build.yml
    tasks/builds/firefox_windows_base.yml
      tasks/builds/firefox_win32_opt.yml
      tasks/builds/firefox_win64_debug.yml
      tasks/builds/firefox_win32_opt.yml
      tasks/builds/firefox_win64_debug.yml

Getting the new tasks added to Try pushes

This involved adding win32 and win64 as build platforms in testing/taskcluster/tasks/branches/base_job_flags.yml (previsouly taskcluster was not running any tasks for these platforms):

---
# List of all possible flags for each category of tests used in the case where
# "all" is specified.
flags:
  aliases:
    mochitests: mochitest

  builds:
    - emulator
    - emulator-jb
    - emulator-kk
    - emulator-x86-kk
    .....
    .....
    .....
    - android-api-11
    - linux64
    - macosx64
    - win32   ########## <---- added here
    - win64   ########## <---- added here

  tests:
    - cppunit
    - crashtest
    - crashtest-ipc
    - gaia-build
    .....
    .....
    .....

And then associating these new task definitions we just created, to these new build platforms. This is done in testing/taskcluster/tasks/branches/try/job_flags.yml:

---
# For complete sample of all build and test jobs,
# see <gecko>/testing/taskcluster/tasks/job_flags.yml

$inherits:
  from: tasks/branches/base_job_flags.yml

# Flags specific to this branch
flags:
  post-build:
    - upload-symbols

builds:
  win32:
    platforms:
      - win32
    types:
      opt:
        task: tasks/builds/firefox_win32_opt.yml
      debug:
        task: tasks/builds/firefox_win32_debug.yml
  win64:
    platforms:
      - win64
    types:
      opt:
        task: tasks/builds/firefox_win64_opt.yml
      debug:
        task: tasks/builds/firefox_win64_debug.yml
  linux64_gecko:
    platforms:
      - b2g
    types:
      opt:
    .....
    .....
    .....

Summary

The above hopefully has given you a taste for what you can do yourself in TaskCluster, and specifically in Gecko, regarding setting up new jobs. By following this guide, you too should be able to schedule Windows jobs in Taskcluster, including try jobs for Gecko projects.

For more information about TaskCluster, see docs.taskcluster.net.

September 30, 2015 02:08 PM

John Ford

Taskcluster Component Loader

Taskcluster is the new platform for building Automation at Mozilla.  One of the coolest design decisions is that it's composed of a bunch of limited scope, interchangeable services that have well defined and enforced apis.  Examples of services are the Queue, Scheduler, Provisioner and Index.  In practice, the server-side components roughly map to a Heroku app.  Each app can have one or more web worker processes and zero or more background workers.

Since we're building our services with the same base libraries we end up having a lot of duplicated glue code.  During a set of meetings in Berlin, Jonas and I were lamenting about how much copied, pasted and modified boilerplate was in our projects.

Between the API definition file and the command line to launch a program invariably sits a bin/server.js file for each service.  This script basically loads up our config system, loads our Azure Entity library, loads a Pulse publisher, a JSON Schema validator and a Taskcluster-base App.  Each background worker has its own bin/something.js which basically has a very similar loop.  Services with unit tests have a test/helper.js file which initializes the various components for testing.  Furthermore, we might have things initialize inside of a given before() or beforeEach().

The problem with having so much boiler plate is twofold.  First, each time we modify one services's boilerplate, we are now adding maintenance complexity and risk because of that subtle difference to the other services.  We'd eventually end up with hundreds of glue files which do roughly the same thing, but accomplish it complete differently depending on which services it's in.  The second problem is that within a single project, we might load the same component ten ways in ten places, including in tests.  Having a single codepath that we can test ensures that we're always initializing the components properly.

During a little downtime between sessions, Jonas and I came up with the idea to have a standard component loading system for taskcluster services.  Being able to rapidly iterate and discuss in person made the design go very smoothly and in the end, we were able to design something we were both happy with in about an hour or so.

The design we took is to have two 'directories' of components.  One is the project wide set of components which has all the logic about how to build the complex things like validators and entities.  These components can optionally have dependencies.  In order to support different values for different environments, we force the main directory to declare which 'virtual dependencies' it requires.  They are declared as a list of strings.  The second level of component directory is where these 'virtual dependencies' have their value.

Both Virtual and Concrete dependencies can either be 'flat' values or objects.  If a dependency is a string, number, function, Promise or an object without a create property, we just give that exact value back as a resolved Promise.  If the component is an object with a create property, we initialize the dependencies specified by the 'requires' list property, pass those values as properties on an object to the function at the 'create' property.  The value of that function's return is stored as a resolved promise.  Components can only depend on other components non-flat dependencies.

Using code is a good way to show how this loader works:

// lib/components.js

let loader = require('taskcluster-base').loader;
let fakeEntityLibrary = require('fake');

module.exports = loader({
fakeEntity: {
requires: ['connectionString'],
setup: async deps => {
let conStr = await deps.connectionString;
return fakeEntityLibrary.create(conStr);
},
},
},
['connectionString'],
);
 
In this file, we're building a really simple component directory which only contains a contrived 'fakeEntity'.  This component depends on having a connection string to fully configure.  Since we want to use this code in production, development and testing, we don't want to bake configuration into this file, so we force the thing using this to itself give us a way to configure what the connection string.

// bin/server.js
let config = require('taskcluster-base').config('development');
let loader = require('../lib/components.js');

let load = loader({
connectionString: config.entity.connectionString,
});

let configuredFakeEntity = await load('fakeEntity')
 
In this file, we're providing a simple directory that satisifies the 'virtual' dependencies we know that need to be fulfilled before initializing can happen.

Since we're creating a dependency tree, we want to avoid having cyclic dependencies.  I've implemented a cycle checker which ensures that you cannot configure a cyclical dependency.  It doesn't rely on the call stack being exceeded from infinite recursion either!

This is far from being the only thing that we figured out improvements for during this chat.  Two other problems that we were able to talk through were splitting out taskcluster-base and having a background worker framework.

Currently, taskcluster-base is a monolithic library.  If you want our Entities at version 0.8.4, you must take our config at 0.8.4 and our rest system at 0.8.4.  This is great because it forces services to move all together.  This is also awful because sometimes we might need a new stats library but can't afford the time to upgrade a bunch of Entities.  It also means that if someone wants to hack on our stats module that they'll need to learn how to get our Entities unit tests to work to get a passing test run on their stats change.

Our plan here is to make taskcluster-base a 'meta-package' which depends on a set of taskcluster components that we support working together.  Each of the libraries (entities, stats, config, api) will be split out into their own packages using git filter-branch to maintain history.  This is just a bit of simple leg work of ensuring that the splitting out goes smooth.

The other thing we decided on was a standardized background looping framework.  A lot of background workers follow the pattern "do this thing, wait one minte, do this thing again".  Instead of each service implementing this its own special way for each background worker, what we'd really like is to have a library which does all the looping magic itself.  We can even have nice things like a watch dog timer to ensure that the loop doesn't stick.

Once the PR has landed for the loader, I'm going to be converting the provisioner to use this new loader.  This is a part of a new effort to make Taskcluster components easy to implement.  Once a bunch of these improvements have landed, I intend to write up a couple blog posts on how you can write your own Taskcluster service.

September 30, 2015 12:56 PM

September 29, 2015

Ehsan Akhgari

My experience adding a new build type using TaskCluster

TaskCluster is Mozilla’s task queuing, scheduling and execution service.  It allows the user to schedule a DAG representing a task graph that describes a some tasks and their dependencies, and how to execute them, and it schedules them to run in the needed order on a number of slave machines.

As of a while ago, some of the continuous integration tasks have been runing on TaskCluster, and I recently set out to enable static analysis optimized builds on Linux64 on top of TaskCluster.  I had previously added a similar job for debug builds on OS X in buildbot, and I am amazed at how much the experience has improved!  It is truly easy to add a new type of job now as a developer without being familiar with buildbot or anything like that.  I’m writing this post to share my experience on how I did this.

The process of scheduling jobs in TaskCluster starts by a slave downloading a specific revision of a tree, and running the ./mach taskcluster-graph command to generate a task graph definition.  This is what happens in a “gecko-decision” jobs that you can see on TreeHerder.  The mentioned task graph is computed using the task definition information in testing/taskcluster.  All of the definitions are in YAML, and I found the naming of variables relatively easy to understand.  The build definitions are located in testing/taskcluster/tasks/builds and after some poking around, I found linux64_clobber.yml.

If you look closely at that file, a lot of things are clear from the names.  Here are important things that this file defines:

  • $inherits: These files have an single inheritance structure that allows you to refactor the common functionality into “base” definitions.
  • A lot of things have “linux64” in their name.  This gave me a good starting point when I was trying to add a “linux64-st-an” (a made-up name) build by copying the existing definiton.
  • payload.image contains the name of the docker image that this build runs.  This is handy to know if you want to run the build locally (yes, you can do that!).
  • It points to builds/releng_base_linux_64_builds.py which contains the actual build definition.

Looking at the build definition file, you will find the steps run in the build, whether the build should trigger unit tests or Talos jobs, the environment variables used during the build, and most importantly the mozconfig and tooltool manifest paths.  (In case you’re not familiar with Tooltool, it lets you upload your own tools to be used during the build time.  This can be new experimental toolchains, custom programs your build needs to run, which is useful for things such as performing actions on the build outputs, etc.)

This basically gave me everything I needed to define my new build type, and I did that in bug 1203390, and these builds are now visible on TreeHerder as “[Tier-2](S)” on Linux64.  This is the gist of what I came up with.

[Tier-2](S)

I think this is really powerful since it finally allows you to fully control what happens in a job.  For example, you can use this to create new build/test types on TreeHerder, do try pushes that test changes to the environment a job runs in, do highly custom tasks such as creating code coverage results, which requires a custom build step and custom test steps and uploading of custom artifacts!  Doing this under the old BuildBot system is unheard of.   Even if you went out of your way to learn how to do that, as I understand it, there was a maximum number of build types that we were getting close to which prevented us from adding new job types as needed!  And it was much much harder to iterate on (as I did when I was working on this on the try server bootstrapping a whole new build type!) as your changes to BuildBot configs needed to be manually deployed.

Another thing to note is that I found out all of the above pretty much by myself, and didn’t even have to learn every bit of what I encountered in the files that I copied and repurposed!  This was extremely straightforward.  I’m already on my way to add another build type (using Ted’s bleeding edge Linux to OS X cross compiling support)!  I did hit hurdles along the way but almost none of them were related to TaskCluster, and with the few ones that were, I was shooting myself in the foot and Dustin quickly helped me out.  (Thanks, Dustin!)

Another near feature of TaskCluster is the inspector tool.  In TreeHerder, you can click on a TaskCluster job, go to Job Details, and click on “Inspect Task”.  You’ll see a page like this.  In that tool you can do a number of neat things.  One is that it shows you a “live.log” file which is the live log of what the slave is doing.  This means that you can see what’s happening in close to real time, without having to wait for the whole job to finish before you can inspect the log.  Another neat feature is the “Run locally” commands that show you how to run the job in a local docker container.  That will allow you to reproduce the exact same environment as the ones we use on the infrastructure.

I highly encourage people to start thinking about the ways they can harness this power.   I look forward to see what we’ll come up with!

September 29, 2015 03:05 PM

August 13, 2015

Jonas Finnemann Jensen

Getting Started with TaskCluster APIs (Interactive Tutorials)

When we started building TaskCluster about a year and a half ago one of the primary goals was to provide a self-serve experience, so people could experiment and automate things without waiting for someone else to deploy new configuration. Greg Arndt (:garndt) recently wrote a blog post demystifying in-tree TaskCluster scheduling. The in-tree configuration allows developers to write new CI tasks to run on TaskCluster, and test these new tasks on try before landing them like any other patch.

This way of developing test and build tasks by adding in-tree configuration in a patch is very powerful, and it allows anyone with try access to experiment with configuration for much of our CI pipeline in a self-serve manner. However, not all tools are best triggered from a post-commit-hook, instead it might be preferable to have direct API access when:

  • Locating existing builds in our task index,
  • Debugging for intermittent issues by running a specific task repeatedly, and
  • Running tools for bisecting commits.

To facilitate tools like this TaskCluster offers a series of well-documented REST APIs that can be access with either permanent or temporary TaskCluster credentials. We also provide client libraries for Javascript (node/browser), Python, Go and Java. However, being that TaskCluster is a loosely coupled set of distributed components it is not always trivial to figure out how to piece together the different APIs and features. To make these things more approachable I’ve started a series of interactive tutorials:

All these tutorials are interactive, featuring a runtime that will transpile your code with babel.js before running it in the browser. The runtime environment also exposes the require function from a browserify bundle containing some of my favorite npm modules, making the example editors a great place to test code snippets using taskcluster or related services.

Happy hacking, and feel free submit PRs for all my spelling errors at github.com/taskcluster/taskcluster-docs.

August 13, 2015 10:25 PM

June 04, 2015

Ben Hearsum

Buildbot <-> Taskcluster Bridge Now in Production

A few weeks ago I gave a brief overview of the Buildbot <->Taskcluster Bridge that we've been developing, and Selena provided some additional details about it yesterday. Today I'm happy to announce that it is ready to take on production work. As more and more jobs from our CI infrastructure move to Taskcluster, the Bridge will coordinate between them and jobs that must remain in Buildbot for the time being.

What's next?

The Bridge itself is feature complete until our requirements change (though there's a couple of minor bugs that would be nice to fix), but most of the Buildbot Schedulers still need to be replaced with Task Graphs. Some of this work will be done at the same time as porting specific build or test jobs to run natively in Taskcluster, but it doesn't have to be. I made a proof of concept on how to integrate selected Buildbot builds into the existing "taskcluster-graph" command and disable the Buildbot schedulers that it replaces. With a bit more work this could be extended to schedule all of the Buildbot builds for a branch, which would make porting specific jobs simpler. If you'd like to help out with this, let me know!

June 04, 2015 03:11 PM

June 03, 2015

Selena Deckelmann

TaskCluster migration: about the Buildbot Bridge

Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.

I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

  • Allows a graceful transition
  • We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
  • It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
  • One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

There are three services:

  • TC Listener: responds to things happening in TC
  • BuildBot Listener: responds to BB events
  • Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.

BBB has a small database that associates build requests with TC taskids and runids.

BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.

The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).

Taskcluster Listener

Reacts to events coming from TC Pulse exchanges.

Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.

Buildbot Listener

Responds to events coming from the BB Pulse exchanges.

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

  • Runs on a timer – every 60 seconds
  • Reclaims tasks: need to do this every 30-60 minutes
  • Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

  • A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

BB creates a Build. BBListener receives buildstarted event, claims the Task.

Reflector reclaims the Task while the Build is running.

Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.

  • Build fails initially, succeeds upon retry

(500 from hg – common reason to retry)

Same through Reflector.

BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.

BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.

Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.

  • Task exceeds deadline before Build starts

Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve

QUESTIONS:

  • TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

  • How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

  • Replacing the scheduler what does that mean exactly?

    • Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
    • Remove all scheduling from BuildBot and Hg polling

Roll-out plan

  • Connected to the Alder branch currently
  • Replacing some of the Alder schedulers with TaskGraphs
  • All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!

Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.

June 03, 2015 04:59 PM

June 02, 2015

Selena Deckelmann

TaskCluster migration: a “hello, world” for worker task creator

On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

  • You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
  • There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.

June 02, 2015 02:57 PM

May 08, 2015

Ben Hearsum

Buildbot <-> Taskcluster Bridge - An Overview

Mozilla has been using Buildbot as its continuous integration system for Firefox and Fennec for many years now. It enabled us to switch from a machine-per-build model to a pool-of-slaves model, and greatly aided us in getting to our current scale. But it's not perfect - and we've known for a few years that we'll need to do an overhaul. Lucky for us, the FirefoxOS Automation team has built up a fantastic piece of infrastructure known as Taskcluster that we're eager to start moving to.

It's not going to be a small task though - it will take a lot more work than taking our existing build scripts and running them in Taskcluster. One reason for this is that many of our jobs trigger other jobs, and Buildbot manages those relationships. This means that if we have a build job that triggers a test job, we can't move one without moving the other. We don't want to be forced into moving entire job chains at once, so we need something to help us transition more slowly. Our solution to this is to make it possible to schedule jobs in Taskcluster while still implementing them in Buildbot. Once the scheduling is in Taskcluster it's possible to move individual jobs to Taskcluster one at a time. The software that makes this possible is the Buildbot Bridge.

The Bridge is responsible for synchronizing job state between Taskcluster and Buildbot. Jobs that are requested through Taskcluster will be created in Buildbot by the Bridge. When those jobs complete, the Bridge will update Taskcluster with their status. Let's look at a simple example to see see how the state changes in both systems over the course of a job being submitted and run:

Event Taskcluster state Buildbot state
Task is created Task is pending --
Bridge receives "task-pending" event, creates BuildRequest Task is pending Build is pending
Build starts in Buildbot Task is pending Build is running
Bridge receives "build started" event, claims the Task Task is running Build is running
Build completes successfully Task is running Build is completed
Bridge receives "build finished" event, reports success to Taskcluster Task is resolved Build is completed

The details of how this work are a bit more complicated - if you'd like to learn more about that I recommend watching the presentation I did about the Bridge architecture, or just have a read through my slides

May 08, 2015 04:37 PM

March 31, 2015

Rail Aliiev

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the docker.com registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.

Conclusion

There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

March 31, 2015 12:47 PM

February 23, 2015

James Lal

Taskcluster Release Part 1 : Gecko

It's been awhile since my last blog post about taskcluster and I wanted to give an update...

Taskcluster + Gecko

Taskcluster is running by default on

In Treeherder you will see jobs run by both buildbot and taskcluster. The "TC" jobs are prefixed accordingly so you can tell the difference.

This is the last big step to enabling TC as the default CI for many mozilla project. Adding new and existing branches is easily achieved with basic config changes.

Why is this a great thing? Just about everything is in the tree.

This means you can easily add new builds/tests and immediately push them to try for testing (see the configs for try

Adding new tests and builds is easier than ever but the improvements don't stop there. Other key benefits on linux include:

We use docker

Docker enables easy cloning of CI environments.

# Pull tester image
docker pull quay.io/mozilla/tester:0.0.14
# Run tester image shell
docker run -it quay.io/mozilla/tester:0.0.14 /bin/bash
# <copy/paste stuff from task defintions into this>
Tests and builds are faster

Through this entire process we have been optimizing away overhead and using faster machines which means both build (and particularly test) times are faster.

(Wins look big but more in future blog post)

What's missing ?
  • Some tests fail due to differences in machines. When we move tests things fail largely due to timing issues (there are a few cases left here).

  • Retrigger/cancel does not work (yet!) as of the time of writing this it has not yet hit production but will be deployed soon.

  • Results currently show up only on staging treeherder. We will incrementally report these to production treeherder.

February 23, 2015 12:00 AM

February 15, 2015

Rail Aliiev

Funsize hacking

Prometheus

The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

Funsize willl solve the problems listed above: to distribute load and to be flexible.

Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

Funsize is split into several pieces:

  • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
  • Celery-based workers to generate partial updates and upload them to S3.
  • SQS or RabbitMQ to coordinate Celery workers.

One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

February 15, 2015 04:32 AM

May 27, 2014

James Lal

Gaia + Taskcluster + Treeherder

What is this stuff?

(originally posted on dev-gaia)

For some time now Gaia developers have wanted the ability to scale their tests infinitely, while reporting to a dashboard that both sheriffs and devs can monitor, and yet still maintain control over the test configurations themselves.

Taskcluster & Treeherder let's us do this: http://treeherder-dev.allizom.org/ui/#/jobs?repo=gaia-master Taskcluster http://docs.taskcluster.net/ drives the tests and with a small github hook allows us to configure the jobs from a json file in the tree (this will likely be a yaml file in the end) https://github.com/mozilla-b2g/gaia/blob/master/taskgraph.json

Treeherder is the next generation "TBPL" which allows us to report results to sheriffs from external resources (meaning we can control the tests) for both a "try" interface (like pull requests) and branch landings.

Crrently, we are very close to having green runs in treeherder, with only one intermittent and the rest green ...

How is this different then gaia-try?

Taskcluster will eventually replace all buildbot run jobs (starting with linux)... we are currently in the process of moving tests over and getting treeherder ready for production.

Gaia-try is run on top of buildbot and hooks into our github pull requests.. Gaia-try gives us a single set of suites that the sheriffs can look at and help keep our tree green. This should be considered "production".

Treeherder/taskcluster are designed to solve the issues with the current buildbot/tbpl implementations:

  • in tree configuration

  • complete control over the test environment with docker (meaning you can have the exact same setup locally as on TBPL!)

  • artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)

    • in tree graph capabilities (for example "smoketests" builds by running smaller test suites or how tests depend on builds).

How is this different from travis-ci?

  • we can scale on demand on any AWS hardware we like (at very low cost thanks to spot)

  • docker is used to provide a consistent test environment that may be run locally

    • artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)
  • logs can be any size (but still mostly "live")

  • reports to TBPL2 (treeherder)

When is this production ready?

taskcluster + treeherder is not ready for production yet... while the tests are running this is not in a state where sheriffs can manage it (yet!). Our plan is to continue to add taskcluster test suites (and builds!) for all trees (yes gecko) and have them run in parallel with the buildbot jobs this month...

I will be posting weekly updates on my blog about taskcluster/treeherder http://lightsofapollo.github.io/ and how it effects gaia (and hopefully your overall happiness)

Where are the docs??

  • http://docs.taskcluster.net/
  • (More coming to gaia-taskcluster and gaia readme as we get closer to production)

WHERE IS THE CODE?

  • https://github.com/taskcluster (overall project)
  • https://github.com/lightsofapollo/gaia-taskcluster (my current gaia intergration)
  • https://github.com/mozilla/treeherder-service (treeherder backend)
  • https://github.com/mozilla/treeherder-ui (treeherder frontend)

May 27, 2014 12:00 AM

March 04, 2014

James Lal

Taskcluster - Mozilla's new test infrastructure project

Taskcluster is not one singular entity that runs a script with output in a pretty interface or a github hook listener, but rather a set of decoupled interfaces that enables us to build various test infrastructures while optimizing for cost, performance and reliability. The focus of this post is Linux. I will have more information how this works for OSX/Window soon.

Some History

Mozilla has quite a few different code bases, most depend on gecko (the heart of Firefox and FirefoxOS). Getting your project hooked up to our current CI infrastructure usually requires a multi-team process that takes days or more. Historically, simply merging projects into gecko was easier than having external repositories that depend on gecko, which our current CI cannot easily support.

It is critical to be able to see in one place (TBPL) that all the projects depend on gecko are working. Today TBPL current this process is tightly coupled to our buildbot infrastructure (which together make up our current CI). If you really care about your project not breaking when a change lands in gecko, you really only have one option: hosting your testing infrastructure under buildbot (which feeds TBPL).

Where Taskcluster comes in

Treeherder resolves the tight coupling problem by separating the reporting from the test running process. This enables us to re-imagine our workflow and how it's optimized. We can run tests anywhere using any kind of utility/library assuming it gives us the proper hooks (really just logs and some revision information) to plug results into our development workflow.

A high level workflow with taskcluster looks like this:

You submit some code (this can be patch or a pull request, etc...) to a "scheduler" ( I have started on one for gaia ) which submits a set of tasks. Each task is run inside a docker container the container's image is specified as part of your task. This means anything you can imagine running on linux you can directly specify in your container (no more waiting for vm reimaging, etc...) this also means we directly control the resources that container uses (less variance in test) AND if something goes wrong you can download the entire environment that test ran on locally to debug it.

As tasks are completed the task cluster queue emits events over AMQP (think pulse) so anyone interested in the status of tests, etc.. can hook directly into this... This enables us to post results as they happen directly to treeherder.

The initial taskcluster provisions AWS spot nodes on demand (we have it capped to a fixed number right now) so during peaks we can burst to an almost unlimited number of nodes. During idle times workers shut themselves down to reduce costs. We have additional plans for different clouds (and physical hardware on open stack).

Each component can be easily replaced (and multiple types of workers and provisioners can be added on demand. Jonas Finnemann Jensen has done a awesome job documenting how taskcluster works in the docs at the API level.

What the future looks like

My initial plan is to hook everything up for gaia the FirefoxOS frontend. This will replace our current travis CI setup.

As pull requests come in we will run tests on taskcluster and report status to both treeherder and github (the beloved github status api). The ability to hook up new types of tests from the tree itself (and test new types from the tree itself) will continue on in the form of a task template (another blog post coming). Developers can see the status of their tests from treeherder.

Code landing in master follows the same practice and results will report into a gaia specific treeherder view.

Most importantly immediately after treeherder is launched we can run all gaia testing on the same exact infrastructure for both gaia and gecko commits Jonas Sicking (b2g overload) has some great ideas about locking gecko <-> gaia versions to reduce another kind of failure which occurs when developing against the ever changing landscape of gecko / gaia commits.

When is the future? We have implemented the "core" of taskcluster already and have the ability to run tests. By the end of the month (March) we will have the capability to replace the entire gaia workflow with taskcluster.

Why not X CI solution

Building a brand new CI solution is non-trivial why are we doing this?

  • To leverage LXC containers (docker): One of the big problems we hit when trying to debug test failures is the vairance of testing locally and remotely. With LXC containers you can download the entire container (the entire environment which your test runs in) and run it with the same cpu/memory/swap/filesystem as it would run remotely.

  • On demand scaling. We have (somewhat predictable) bursts throughout the day and the ability to spin up (and down) on demand is required to keep up with our changing needs throughout the day.

  • Make in tree configuration easy. Pull requests + in tree configuration enable developers to quickly iterate on tests and testing infrastructure

  • Modular extensible components with public facing APIs. Want run tasks to do things other then test/build or report to something other then treeherder? We have or will build an api for that.

    Hackability is imporant... The parts you don't want to solve (running aws nodes, keeping them up, pricing them, etc...) are solved for you so you can focus on building the next great mozilla related thing (better bisection tools, etc...).

  • More flexibility to test/deploy optimizations... We have something like a compute year of tests and 10-30+ minute chunks of testing is normal. We need to iterate on our test infrastructure quickly to try to reduce this where possible with CI changes.

Here are a few potential alternatives below... I list out the pros & cons of each from my perspective (and a short description of each).

Travis [hosted]

TravisCI is an awesome [free] open source testing service that we use for many of our smaller projects.

Travis works really well for the 90% webdev usecase. Gaia does not fit well into that use case and gecko does so even less.

Pros:

  • Dead simple setup.
  • Iterate on test frameworks, etc... on every pull request without any issue.
  • Nice simple UI which reports live logging.
  • Adding tests and configuring tests is trivial.

Cons:

  • Difficult to debug failures locally.
  • No public facing API for creating jobs.
  • No build artifacts on pull requests.
  • Cannot store arbitrarily long logs (this is only an issue for open source IIRC).
  • On demand scaling.

Buildbot [build on top of it]

We currently use buildbot at scale thousands~ of machines for all gecko testing on multiple platforms. If you are using firefox it was built by our buildbot setup.

(NOTE: This is a critique of how we currently use buildbot not the entire project). If I am missing something or you think a CI solution could fit the bill contact me!

Pros:

  • We have it working at a large scale already.

Cons:

  • Adding tests and configuring tests is fairly difficult and involves long lead times.
  • Difficult to debug failures locally.
  • Configuration files live outside of the tree.
  • Persistent connection master/slave model.
  • Its one monolithic project which is difficult to replace components of.
  • Slow rollout of new machine requirements & configurations.

Jenkins

We are using Jenkins for our on device testing.

Pros:

  • Easy to configure jobs from the UI (decent ability to do configuration yourself).
  • Configuration (by default) does not live in the tree.
  • Tons of plugins (with varying quality).

Cons:

  • By default difficult to debug failures locally.
  • Persistent connection master/slave model.
  • Configuration files live outside of the tree.

Drone.io [hosted/not hosted]

Drone.io recently open sourced... It's docker based and shows promise. Out of all the options above it looks the closest to the to what we want for linux testing.

I am going to omit the Pros/Cons here the basics look good for drone and it requires some more investigation. Some missing things here are:

  • A long term plan for supporting multiple operating systems.
  • A public api for scheduling tasks/jobs.
  • On demand scaling.

March 04, 2014 12:00 AM

January 31, 2014

James Lal

Using docker volumes for rapid development of containers

Its fairly obvious how to use docker for shipping an immutable image that is great for deployment.. It was less obvious (to me) how to use docker to iterate on the image, run tests in it, etc...

Lets say you have a node project and your writing some web service thing:

// server.js
var http = require('http');
...
// server_test.js
suite('my tests', function() {
});
# Dockerfile
FROM lightsofapollo/node:0.10.24
ADD . /service
WORKDIR /service
CMD node server.js

Before Volumes

Without using volumes your workflow is like this:

docker build -t image_name
docker run image_name ./node_modules/.bin/mocha server_test.js
# .. make some changes and repeat...

While this is certainly not awful its a lot of extra steps you probably don't want to do...

After Volumes

While iterating ideally we could just "shell in" to the container and make changes on the fly then run some tests (like lets say vagrant).

You can do this with volumes:

# Its important that you only use the -v command during development it
# will override the contents of whatever you specify and you should also
# keep in mind you want to run the final tests on the image without this
# volume at the end of your development to make sure you didn't forget to
# build or somthing.

# Mount the current directory in your service folder (override the add
# above) then open an interactive shell
docker run -v $PWD:/service -i -t /bin/bash

From here you can hack like normal making changes and running tests on the fly like you would with vagrant or on your host.

When your done!

I usually have a makefile... I would setup the "make test" target something like this to ensure your tests are running on the contents of your image rather then using the volume

.PHONY: test
test:
  docker build -t my_image
  docker run my_image npm test

.PHONY: push
push: test
  docker push my_image

January 31, 2014 12:00 AM