Planet Taskcluster

February 09, 2017

Dustin Mitchell

TaskCluster-Github Improvements

Repositories on Github can use TaskCluster to automate build, test, and release processes. The service that enables this is called, appropriately enough, taskcluster-github.

This week, Irene Storozhko, Brian Stack, and I gathered in Toronto to land some big improvements to this service.

First, the service now supports “release” events, which means it can trigger tasks when a new release is added to github, such as building and uploading binaries or making release announcements.

Second, we have re-deployed the service as an integration Irene has developed. This makes the set-up process much easier – just go to the integration page and click “Install”. No messing with web hooks, adding users to teams, etc.

The integration gives users a great deal more control over our access to repositories: it can be installed organization-wide, or only for specific repositories. The permissions required are much more restricted than the old arrangement, too. On the backend, the integration also gives us much better access to debugging information that was previously only available to organization administrators.

Finally, Irene has developed a quickstart page to guide new users through setting up a repository to use TaskCluster-Github. With this tool, we hope to see many more Mozilla projects building automation in TaskCluster, even if that’s as simple as running tests.

February 09, 2017 06:54 PM

November 30, 2016

Pete Moore

Task Execution on Windows™

Objectives

Windows 7

November 30, 2016 07:25 PM

August 03, 2016

Selena Deckelmann

TaskCluster 2016Q2 Retrospective

The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.

We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.

We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.

We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.

A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.

This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.

Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.

I can’t wait to see what this team accomplishes in Q3!

Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!

Things we did this quarter:

  • initial investigation and timing data around using sccache for linux builds
  • released update for sccache to allow working in a more modern python environment
  • created taskcluster managed s3 buckets with appropriate policies
  • tested linux builds with patched version of sccache
  • tested docker-worker on packet.net for on hardware testing
  • worked with jmaher on talos testing with docker-worker on releng hardware
  • created livelog plugin for taskcluster-worker (just requires tests now)
  • added reclaim logic to taskcluster-worker
  • converted gecko and gaia in-tree tasks to use new v2 treeherder routes
  • Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting
  • move docs, schemas, references to https
  • refactor documentation site into tutorial / manual / reference
  • add READMEs to reference docs
  • switch from a * certificate to a SAN certificate for taskcluster.net
  • increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration
  • use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes
  • allow non-employees to login with Okta, improve authentication experience
  • named temporary credentials
  • use npm shrinkwrap everywhere
  • enable coalescing
  • reduce the artifact retention time for try jobs (to reduce S3 usage)
  • support retriggering via the treeherder API
  • document azure-entities
  • start using queue dependencies (big-graph-scheduler)
  • worked with NSS team to have tasks scheduled and displayed within treeherder
  • Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)
  • added hg fingerprint verification to decision task
  • Responded and deployed patches to security incidents discovered in q2
  • taskcluster-stats-collector running with signalfx
  • most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor
  • Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine
  • QEMU/KVM engine for taskcluster-worker
  • Implemented Task Group Inspector
  • Organized efforts around front-end tooling
  • Re-wrote and generalized the build process for taskcluster-tools and future front-end sites
  • Created the Migration Dashboard
  • Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site
  • First Windows tasks in production – NSS builds running on Windows 2012 R2
  • Windows Firefox desktop builds running in production (currently shown on staging treeherder)
  • new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)
  • many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs
  • CI cleanup https://travis-ci.org/taskcluster
  • support for relative definitions in jsonschema2go
  • schema/references cleanup

Paying down technical debt

  • Fixed numerous issues/requests within mozilla-taskcluster
  • properly schedule and retrigger tasks using new task dependency system
  • add more supported repositories
  • Align job state between treeherder and taskcluster better (i.e cancels)
  • Add support for additional platform collection labels (pgo/asan/etc)
  • fixed retriggering of github tasks in treeherder
  • Reduced space usage on workers using docker-worker by removing temporary images
  • fixed issues with gaia decision task that prevented it from running since March 30th.
  • Improved robustness of image creation image
  • Fixed all linter issues for taskcluster-queue
  • finished rolling out shrinkwrap to all of our services
  • began trial of having travis publish our libraries (rolled out to 2 libraries now. talking to npm to fix a bug for a 3rd)
  • turned on greenkeeper everywhere then turned it off again for the most part (it doesn’t work with shrinkwrap, etc)
  • “modernized” (newer node, lib-loader, newest config, directory structure, etc) most of our major services
  • fix a lot of subtle background bugs in tc-gh and improve logging
  • shared eslint and babel configs created and used in most services/libraries
  • instrumented taskcluster-queue with statistics and error reporting
  • fixed issue where task dependency resolver would hang
  • Improved error message rendering on taskcluster-tools
  • Web notifications for one-click-loaner UI on taskcluster-tools
  • Migrated stateless-dns server from tutum.co to docker cloud
  • Moved provisioner off azure storage development account
  • Moved our npm package to a single npm organization

August 03, 2016 07:56 PM

June 30, 2016

Ben Hearsum

Building and Pushing Docker Images with Taskcluster-Github

Earlier this year I spent some time modernizing and improving Balrog's toolchain. One of my goals in doing so was to switch from Travis CI to Taskcluster both to give us more flexibility in our CI configuration, as well as help dogfood Taskcluster-Github. One of the most challenging aspects of this was how to build and push our Docker image, and I'm hoping this post will make it easier for other folks who want to do the same in the future.

The Task Definition

Let's start by breaking down Task definition from Balrog's .taskcluster.yml. Like other Taskcluster-Github jobs, we use the standard taskcluster.docker provisioner and worker.

  - provisionerId: "{{ taskcluster.docker.provisionerId }}"
    workerType: "{{ taskcluster.docker.workerType }}"

Next, we have something a little different. This section grants the Task access to a secret (managed by the Secrets Service). More on this later.

    scopes:
      - secrets:get:repo:github.com/mozilla/balrog:dockerhub

The payload has a few things of note. Because we're going to be building Docker images it makes sense to use Taskcluster's image_builder Docker image as well as enabling the docker-in-docker feature. The taskclusterProxy feature is needed to access the Secrets Service.

    payload:
      maxRunTime: 3600
      image: "taskcluster/image_builder:0.1.3"
      features:
        dind: true
        taskclusterProxy: true
      command:
        - "/bin/bash"
        - "-c"
        - "git clone $GITHUB_HEAD_REPO_URL && cd balrog && git checkout $GITHUB_HEAD_BRANCH && scripts/push-dockerimage.sh"

The extra section has some metadata for Taskcluster-Github. Unlike CI tasks, we limit this to only running on pushes (not pull requests) to the master branch of the repository. Because only a few people can push to this branch, it means that only these can trigger Docker builds.

    extra:
      github:
        env: true
        events:
          - push
        branches:
          - master

Finally, we have the metadata, which is just standard Taskcluster stuff.

    metadata:
      name: Balrog Docker Image Creation
      description: Balrog Docker Image Creation
      owner: "{{ event.head.user.email }}"
      source: "{{ event.head.repo.url }}"

Secrets

I mentioned the "Secrets Service" earlier, and it's the key piece that enables us to securely push Docker images. Putting our Dockerhub password in it means access is limited to those who have the right scopes. We store it in a secret with the key "repo:github.com/mozilla/balrog:dockerhub", which means that anything with the "secrets:get:repo:github.com/mozilla/balrog:dockerhub" scope is granted access to it. My own personal Taskcluster account has it, which lets me set or change the password:

We also have a Role called "repo:github.com/mozilla/balrog:branch:master" which has that scope:

You can see from its name that this Role is associated with the Balrog repository's master branch. Because of this, any Tasks created for as a result of pushes to that branch in that repository and branch may assign the scopes that Role has - like we did above in the "scopes" section of the Task.

Building and Pushing

The last piece of the puzzle here is the actual script that does the building and pushing. Let's look at a few specific parts of it.

To start with, we deal with retrieving the Dockerhub password from the Secrets Service. Because we enabled the taskclusterProxy earlier, "taskcluster" resolves to the hosted Taskcluster services. Had we forgotten to grant the Task the necessary scope, this would return a 403 error.

password_url="taskcluster/secrets/v1/secret/repo:github.com/mozilla/balrog:dockerhub"
dockerhub_password=$(curl ${password_url} | python -c 'import json, sys; a = json.load(sys.stdin); print a["secret"]["dockerhub_password"]')

We build, tag, and push the image, which is very similar to building it locally. If we'd forgotten to enable the dind feature, this would throw errors about not being able to run Docker.

docker build -t mozilla/balrog:${branch_tag} .
docker tag mozilla/balrog:${branch_tag} "mozilla/balrog:${date_tag}"
docker login -e $dockerhub_email -u $dockerhub_username -p $dockerhub_password
docker push mozilla/balrog:${branch_tag}
docker push mozilla/balrog:${date_tag}

Finally, we attach an artifact to our Task containing the sha256 of the Docker images. This allows consumers of the Docker image to verify that they're getting exactly what we built, and not something that may have been tampered on Dockerhub or in transit.

sha256=$(docker images --no-trunc mozilla/balrog | grep "${date_tag}" | awk '/^mozilla/ {print $3}')
put_url=$(curl --retry 5 --retry-delay 5 --data "{\"storageType\": \"s3\", \"contentType\": \"text/plain\", \"expires\": \"${artifact_expiry}\"}" ${artifact_url} | python -c 'import json; import sys; print json.load(sys.stdin)["putUrl"]')
curl --retry 5 --retry-delay 5 -X PUT -H "Content-Type: text/plain" --data "${sha256}" "${put_url}"

The Result

Now that you've seen how it's put together, let's have a look at the end result. This is the most recent Balrog Docker build Task. You can see the sha256 artifact created on it:

And of course, the newly built image has shown up on the Balrog Dockerhub repo:

June 30, 2016 07:29 PM

June 27, 2016

Wander Lairson Costa

The taskcluster-worker Mac OSX engine

In this quarter, I worked on implementing the taskcluster-worker Mac OSX engine. Before talking about this specific implementation, let me explain what a worker is and how taskcluster-worker differs from docker-worker, the currently main worker in Taskcluster.

The role of a Taskcluster worker

When a user submits a task graph to Taskcluster, contrary to the common sense (at least if you are used on how OSes schedulers usually work), these tasks are submitted to the scheduler first, which is responsible to process dependencies and enqueue them. In the Taskcluster manual page there is a clear picture ilustrating this concept.

The provisioner is responsible for looking at the queue and determine how many pending tasks exist and, based on that, it launches worker instances to run these tasks.

Then comes the figure of the worker. The worker is responsible for actually executing the task. It claims a task from the queue, runs it, upload the generated artifacts and submits the status of the finished task, using the Taskcluster APIs.

docker-worker is a worker that runs task command inside a docker container. The task payload specifies a docker image as well as a command line to run, among other environment parameters. docker-worker pulls the specified docker image and runs task commands inside it.

taskcluster-worker and the OSX engine

taskcluster-worker is a generic and modularized worker under active development by the Taskcluster team. The worker delegates the task execution to one of the available engines. An engine is a component of taskcluster-worker responsible for running a task under a specific system environment. Other features, like environment variable setting, live logging, artifact uploading, etc., are handled by worker plugins.

I am implementing the Mac OSX engine, which will mainly be used to run Firefox automated tests in the Mac OSX environment. There is a macosx branch in my personal Github taskcluster-worker fork in which I push my commits.

One specific aspect of the engine implementation is the ability to run more than one task at the same time. For this, we need to implement some kind of task isolation. For docker-worker, each task ran in its own docker container so tasks were isolated by definition. But there is no such thing as a container for OSX engine. Our earlier tries with chroot failed miserably, due to incompatibilities with OSX graphic system. Our final solution was to create a new user on the fly and run the task with this user's credentials. This not only provides some task isolation, but also prevents privilege escalation attacks by running tasks with different user than the worker.

Instead of dealing with the poorly documented Open Directory Framework, we chose to spawn the dscl command to create and configure users. Tasks usually takes a long time to execute, spawning loads of subprocess, so a few spawns of the dscl command won't have any practical performance impact.

One final aspect is how we bootstrap task execution. A tasks boils down to a script that executes task duties. But where does this script come from? It doesn't live in the machine that executes the worker. OSX engine provides a link field in task payload that a task can specify an executable to download and execute.

Running the worker

OSX engine will primarily be used to execute Firefox tests on Mac OSX, and the environment is expected to have a very specific tools and configurations set. Because of that, I am testing the code on a loaner machine. To start the worker, it is just a matter of opening a terminal and typing:

$ ./taskcluster-worker work macosx --logging-level debug

The worker connects to the Taskcluster queue, claims and execute the tasks available. At the time I am writing, all tests but Firefox UI functional tests" were green, running on optimized Firefox OSX builds. We intend to land Firefox tests in taskcluster-worker as Tier-2 on next quarter, running them in parallel with Buildbot.

June 27, 2016 12:00 AM

May 02, 2016

Maja Frydrychowicz

Not Testing a Firefox Build (Generic Tasks in TaskCluster)

A few months ago I wrote about my tentative setup of a TaskCluster task that was neither a build nor a test. Since then, gps has implemented “generic” in-tree tasks so I adapted my initial work to take advantage of that.

Triggered by file changes

All along I wanted to run some in-tree tests without having them wait around for a Firefox build or any other dependencies they don’t need. So I originally implemented this task as a “build” so that it would get scheduled for every incoming changeset in Mozilla’s repositories.

But forget “builds”, forget “tests” — now there’s a third category of tasks that we’ll call “generic” and it’s exactly what I need.

In base_jobs.yml I say, “hey, here’s a new task called marionette-harness — run it whenever there’s a change under (branch)/testing/marionette/harness”. Of course, I can also just trigger the task with try syntax like try: -p linux64_tc -j marionette-harness -u none -t none.

When the task is triggered, a chain of events follows:

For Tasks that Make Sense in a gecko Source Checkout

As you can see, I made the build.sh script in the desktop-build docker image execute an arbitrary in-tree JOB_SCRIPT, and I created harness-test-linux.sh to run mozharness within a gecko source checkout.

Why not the desktop-test image?

But we can also run arbitrary mozharness scripts thanks to the configuration in the desktop-test docker image! Yes, and all of that configuration is geared toward testing a Firefox binary, which implies downloading tools that my task either doesn’t need or already has access to in the source tree. Now we have a lighter-weight option for executing tests that don’t exercise Firefox.

Why not mach?

In my lazy work-in-progress, I had originally executed the Marionette harness tests via a simple call to mach, yet now I have this crazy chain of shell scripts that leads all the way mozharness. The mach command didn’t disappear — you can run Marionette harness tests with ./mach python-test .... However, mozharness provides clearer control of Python dependencies, appropriate handling of return codes to report test results to Treeherder, and I can write a job-specific script and configuration.

May 02, 2016 04:00 AM

April 29, 2016

Dustin Mitchell

Loading TaskCluster Docker Images

When TaskCluster builds a push to a Gecko repository, it does so in a docker image defined in that very push. This is pretty cool for developers concerned with the build or test environment: instead of working with releng to deploy a change, now you can experiment with that change in try, get review, and land it like any other change. However, if you want to actually download that docker image, docker pull doesn’t work anymore.

The image reference in the task description looks like this now:

"image": {
    "path": "public/image.tar",
    "taskId": "UDZUwkJWQZidyoEgVfFUKQ",
    "type": "task-image"
},

This is referring to an artifact of the task that built the docker image. If you want to pull that exact image, there’s now an easier way:

./mach taskcluster-load-image --task-id UDZUwkJWQZidyoEgVfFUKQ

will download that docker image:

dustin@dustin-moz-devel ~/p/m-c (central) $ ./mach taskcluster-load-image --task-id UDZUwkJWQZidyoEgVfFUKQ
Task ID: UDZUwkJWQZidyoEgVfFUKQ
Downloading https://queue.taskcluster.net/v1/task/UDZUwkJWQZidyoEgVfFUKQ/artifacts/public/image.tar
######################################################################## 100.0%
Determining image name
Image name: mozilla-central:f7b4831774960411275275ebc0d0e598e566e23dfb325e5c35bf3f358e303ac3
Loading image into docker
Deleting temporary file
Loaded image is named mozilla-central:f7b4831774960411275275ebc0d0e598e566e23dfb325e5c35bf3f358e303ac3
dustin@dustin-moz-devel ~/p/m-c (central) $ docker images
REPOSITORY          TAG                                                                IMAGE ID            CREATED             VIRTUAL SIZE
mozilla-central     f7b4831774960411275275ebc0d0e598e566e23dfb325e5c35bf3f358e303ac3   51e524398d5c        4 weeks ago         1.617 GB

But if you just want to pull the image corresponding to the codebase you have checked out, things are even easier: give the image name (the directory under testing/docker), and the tool will look up the latest build of that image in the TaskCluster index:

dustin@dustin-moz-devel ~/p/m-c (central) $ ./mach taskcluster-load-image desktop-build
Task ID: TjWNTysHRCSfluQjhp2g9Q
Downloading https://queue.taskcluster.net/v1/task/TjWNTysHRCSfluQjhp2g9Q/artifacts/public/image.tar
######################################################################## 100.0%
Determining image name
Image name: mozilla-central:f5e1b476d6a861e35fa6a1536dde2a64daa2cc77a4b71ad685a92096a406b073
Loading image into docker
Deleting temporary file
Loaded image is named mozilla-central:f5e1b476d6a861e35fa6a1536dde2a64daa2cc77a4b71ad685a92096a406b073

April 29, 2016 05:23 PM

April 01, 2016

Wander Lairson Costa

Overcoming browser same origin policy

One of my goals for 2016 Q1 was to write a monitoring dashboard for Taskcluster. It basically pings Taskcluster services to check if they are alive and also acts as a feed aggregator for services Taskcluster depends on. One problem with this approach is the same origin policy, in which web pages are only allowed to make requests to their own domain. For web servers which is safe to make these cross domain requests, they can either implement jsonp or CORS. CORS is the preferred way so we will focus on it for this post.

Cross-origin resource sharing

CORS is a mechanism that allows the web server tell the browser that is safe to accomplish a cross domain request. It consists of a set of HTTP headers with details for the conditions to accomplish the request. The main response header is Access-Control-Allow-Origin, which contains either a list of allowed domains or a *, indicating any domain can make a cross request to this server. In a CORS request, only a small set of headers is exposed to the response object. The server can tell the browser to expose additional headers through the Access-Control-Expose-Headers response header.

But what if the web server doesn't implement CORS? The only solution is to provide a proxy that will make the actual request and add the CORS headers.

cors-proxy

To allow the monitoring dashboard make requests for status state on remote services that do not implement CORS, we created the cors-proxy. It exports a /request endpoint that allows you to make requests to any remote host. cors-proxy redirects it to the remote URL and sends the responses back, with appropriate CORS headers set.

Let's see an example:

$.ajax({
  url: 'http://cors-proxy.taskcluster.net/request',
  method: 'POST',
  contentType: 'application/json',
  data: {
    url: 'https://queue.taskcluster.net/v1/ping',
  }
}).done(function(res) {
  console.log(res);
});

The information about the remote request is sent in the proxy request body. All parameter fields are shown in the project page.

Before you think on using the hosted server to bypass your own requests, cors-proxy only honors requests from a whitelist. So, only some subdomains under Taskcluster domain can use cors-proxy.

April 01, 2016 12:00 AM

March 30, 2016

Pete Moore

Walkthrough installing Cygwin SSH Daemon on AWS EC2 instances

One of the challenges we face at Mozilla is supporting Windows in an organisational environment which is predominantly *nix oriented. Furthermore, historically our build and test infrastructure has only provided a very limited ssh daemon, with an antiquated shell, and outdated unix tools.

With the move to hosting Windows environments in AWS EC2, the opportunity arose to review our current SSH daemon, and see if we couldn’t do something a little bit better.

When creating Windows environments in EC2, it is possible to launch a “vanilla” Windows instance, from an AMI created by Amazon. This instance is based on a standard installation of a given version of Windows, with a couple of AWS EC2 tools preinstalled.

One of the features of the preinstalled tools, is that they allow you to specify powershell and/or batch script snippets inside the instance User Data, that will be executed upon launch.

This makes it quite trivial to customise a Windows environment, by providing all of the customisation steps as a PowerShell snippet in the instance User Data.

In this Walkthrough, we will set up a Windows 2012 R2 Windows machine, with the cygwin ssh daemon preinstalled. In order to follow this walkthrough, you will need an AWS account, and the ability to spawn an instance.

Install AWS CLI

Although all of these steps can be performed via the web console, typically we would want to automate them. Therefore in this walkthrough, I’m using the AWS CLI to perform all of the actions, to make it easier should you want to script any of the setup.

Windows installation

Download and run the 64-bit or 32-bit Windows installer.

Mac and Linux installation

Requires Python 2.6.5 or higher.

Install using pip.

$ pip install awscli
Further help

See the AWS CLI guide if you get stuck.

Configuring AWS credentials

If this is your first time running the AWS CLI tool, configure your credentials with:

$ aws configure

See the AWS credentials configuration guide if you need more help.

Locate latest Windows Server 2012 R2 AMI (64bit)

The following command line will find you the latest Windows 2012 R2 stock image, provided by AWS, in your default region.

$ AMI="$(aws ec2 describe-images --owners self amazon --filters \
"Name=platform,Values=windows" \
"Name=name,Values=Windows_Server-2012-R2_RTM-English-64Bit-Base*" \
--query 'Images[*].{A:CreationDate,B:ImageId}' --output text \
| sort -u | tail -1 | cut -f2)"

Now we can see what the current AMI is, in our default region, with:

$ echo "Windows AMI: ${AMI}"
Windows AMI: ami-1719f677

Note, the actual AMI provided by AWS changes from week to week, and from region to region, so don’t be surprised if you get a different result to the one above.

Create a Security Group

We need our instance to be in a security group that allows us to SSH onto it.

First create a security group:

$ SECURITY_GROUP="$(aws ec2 create-security-group --group-name ssh-only \
--description "SSH only" --output text)"

And then update it to only allow inbound SSH traffic:

$ [ -n "${SECURITY_GROUP}" ] && aws ec2 authorize-security-group-ingress \
--group-id "${SECURITY_GROUP}" \
--ip-permissions '[{"IpProtocol": "tcp", "FromPort": 22, "ToPort": 22,
"IpRanges": [{"CidrIp": "0.0.0.0/0"}]}]'

Create a unique Client Token

We should create a unique client token that will allow us to make idempotent requests, should there be any failures. We will also use this as our “name” for the instance until we get the real instance name back.

$ TOKEN="$(date +%s)"

Create a dedicated Key Pair

We’ll need to specify a key pair in order to retrieve the Windows Password. Let’s create a dedicated one just for this instance.

$ aws ec2 create-key-pair --key-name "${TOKEN}" --query 'KeyMaterial' \
--output text > "${TOKEN}.pem" && chmod 400 "${TOKEN}.pem"

Create custom post-installation script

Typically, you’ll want to customise the cygwin environment, for example:

  • Changing the bash prompt
  • Setting vim options
  • Adding ssh authorized keys
  • ….

Let’s do this in a post installation bash script, which we can download as part of the installation.

In order to be able to authenticate with our new key, we’ll need to get the public part. Note, we could generate separate keys for ssh’ing to our machine, but we might as well reuse the key we just created.

$ PUB_KEY="$(ssh-keygen -y -f "${TOKEN}.pem")"

Create User Data

The AWS Windows Guide advises us that Windows PowerShell commands can be executed if supplied as part of the EC2 User Data. We’ll use this userdata to install cygwin and the ssh daemon from scratch.

Create a file userdata to store the User Data:

$ cat > userdata << 'EOF'
<powershell>

# needed for making http requests
$client = New-Object system.net.WebClient

# download cygwin
$client.DownloadFile("https://www.cygwin.com/setup-x86_64.exe", `
"C:\cygwin-setup-x86_64.exe")

# install cygwin
# complete package list: https://cygwin.com/packages/package_list.html
Start-Process "C:\cygwin-setup-x86_64.exe" -ArgumentList ("--quiet-mode " +
"--wait --root C:\cygwin --site http://cygwin.mirror.constant.com " +
"--packages openssh,vim,curl,tar,wget,zip,unzip,diffutils,bzr") -wait `
-NoNewWindow -PassThru -RedirectStandardOutput "C:\cygwin_install.log" `
-RedirectStandardError "C:\cygwin_install.err"

# open up firewall for ssh daemon
New-NetFirewallRule -DisplayName "Allow SSH inbound" -Direction Inbound `
-LocalPort 22 -Protocol TCP -Action Allow

# workaround for https://www.cygwin.com/ml/cygwin/2015-10/msg00036.html
# see:
#   1) https://www.cygwin.com/ml/cygwin/2015-10/msg00038.html
#   2) https://goo.gl/EWzeVV
$env:LOGONSERVER = "\\" + $env:COMPUTERNAME

# configure sshd
Start-Process "C:\cygwin\bin\bash.exe" -ArgumentList "--login
-c `"ssh-host-config -y -c 'ntsec mintty' -u 'cygwinsshd' \
-w 'qwe123QWE!@#'`"" -wait -NoNewWindow -PassThru -RedirectStandardOutput `
"C:\cygrunsrv.log" -RedirectStandardError "C:\cygrunsrv.err"

# start sshd
Start-Process "net" -ArgumentList "start sshd" -wait -NoNewWindow -PassThru `
-RedirectStandardOutput "C:\net_start_sshd.log" `
-RedirectStandardError "C:\net_start_sshd.err"

# download bash setup script
$client.DownloadFile(
"https://raw.githubusercontent.com/petemoore/myscrapbook/master/setup.sh",
"C:\cygwin\home\Administrator\setup.sh")

# run bash setup script
Start-Process "C:\cygwin\bin\bash.exe" -ArgumentList `
"--login -c 'chmod a+x setup.sh; ./setup.sh'" -wait -NoNewWindow -PassThru `
-RedirectStandardOutput "C:\Administrator_cygwin_setup.log" `
-RedirectStandardError "C:\Administrator_cygwin_setup.err"

# add SSH key
Add-Content "C:\cygwin\home\Administrator\.ssh\authorized_keys" "%{SSH-PUB-KEY}%"
</powershell>
EOF

Fix SSH key

We need to replace the SSH public key placeholder we just referenced in userdata with the actual public key

$ USERDATA="$(cat userdata | sed "s_%{SSH-PUB-KEY}%_${PUB_KEY}_g")"

Launch new instance

We’re now finally ready to launch the instance. We can do this with the following commands:

$ {
echo "Please be patient, this can take a long time."
INSTANCE_ID="$(aws ec2 run-instances --image-id "${AMI}" --key-name "${TOKEN}" \
--security-groups 'ssh-only' --user-data "${USERDATA}" \
--instance-type c4.2xlarge --block-device-mappings \
DeviceName=/dev/sda1,Ebs='{VolumeSize=75,DeleteOnTermination=true,VolumeType=gp2}' \
--instance-initiated-shutdown-behavior terminate --client-token "${TOKEN}" \
--output text --query 'Instances[*].InstanceId')"
PUBLIC_IP="$(aws ec2 describe-instances --instance-id "${INSTANCE_ID}" --query \
'Reservations[*].Instances[*].NetworkInterfaces[*].Association.PublicIp' \
--output text)"
unset PASSWORD
until [ -n "$PASSWORD" ]; do
    PASSWORD="$(aws ec2 get-password-data --instance-id "${INSTANCE_ID}" \
    --priv-launch-key "${TOKEN}.pem" --output text \
    --query PasswordData)"
    sleep 10
    echo -n "."
done
echo
echo "SSH onto your new instance (${INSTANCE_ID}) with:"
echo "    ssh -i '${TOKEN}.pem' Administrator@${PUBLIC_IP}"
echo
echo "Note, the Administrator password is \"${PASSWORD}\", but it"
echo "should not be needed when connecting with the ssh key."
echo
}

You should get some output similar to this:

Please be patient, this can take a long time.
................
SSH onto your new instance (i-0fe79e45ffb2c34db) with:
    ssh -i '1459795270.pem' Administrator@54.200.218.155

Note, the Administrator password is "PItDM)Ph*U", but it
should not be needed when connecting with the ssh key.

March 30, 2016 11:33 AM

March 24, 2016

Greg Arndt

Birth of a new worker

One of the first projects I worked on when I joined Mozilla was to learn Node.js and help with a crazy thing called docker-worker. This worker has become the de facto platform for Linux tasks for over a year now. It has come a long way since we first began working on it, but over time it has shown its age and as we bring on other workers into the TaskCluster ecosystem the time has come to reevaluate our direction for the workers.

In this past year docker-worker went into production and a Windows based worker was being developed. Hard work is happening from multiple teams to start converting our existing Buildbot related Windows jobs to TaskCluster using this Windows based worker.

While work continued on both the Linux docker and Windows based workers, it became clear that they started to fall out of parity with each other and made it confusing to those wanting to use either. Supporting two distinct workers was going to become a challenge as well.

Both of these workers will continue to run tasks for the foreseeable future but work has begun on a new worker, taskcluster-worker. This new worker will adhere to a set of goals based on the knowledge we now have of running production workers across platforms and involve the entire team throughout the process.

Cross-platform

taskcluster-worker must have the ability to be run on multiple operating systems and should use a language which makes this possible and easier. The team has already worked on other services within TaskCluster that use Go, including the generic Windows worker, so moving towards Go felt like a natural choice.

Shared Functionality Abstraction

One of the things that’s clear when writing workers is that there is a considerable amount of shared functionality (host management, task claim/resolution, configuration management) between all of them. This functionality will be incorporated into the worker in an engine/plugin agnostic way so that engine/plugin writers can focus on that particular piece rather than reimplementing the same functionality each time for a unique worker experience.

Multiple Engine Support

Engines should be swappable within the worker and loaded based on configuration settings. These engines will define the configuration and task payload data that is necessary to run a targeted task.

In future posts we’ll go into how to write an engine, but for now understand that engines are given a contract with the worker that they must adhere to. This interface is documented in the engines package of the worker and defines the methods that all engines must either implement or raise an error explaining that the feature is not supported.

These engines will provide methods that are used to build a sandboxed task execution environment (think docker container or isolated process), execute this sandbox, process results, and allow plugins to hook into this environment and manipulate it.

Engines could be as simple as taking a string in the payload and writing “Hello, {string}” to the task log, or spinning up a virtual machine environment to run the task in. It’s entirely up to the engine on what inputs it accepts and how a task is executed.

Plugin Architecture

While the worker will support only one running engine at a time, it will support 0 or more plugins per task execution. These plugins can be written independently of the worker and incorporated in at build time.

At each stage of the task life cycle, a method will be called on every loaded task plugin and allow the plugin to provide additional services and hook into the task execution environment.

Some examples of plugins would be a stats collector parsing task logs, an artifact handler that will parse the task payload and upload each specified artifact upon task completion, or provide https access to the log of the running task (live logging).

These plugins can be written in a general way to work across engines allowing them to be reused for all engines without duplicating functionality and logic.

What’s Next?

While it’s sad to see docker-worker deprecated in the future, it has served its purpose and I’m excited to see this new direction for taskcluster-worker. Active development will continue this year on Linux, Windows, and OS X so that we have multiple platform support while migrating tasks.

Take a look at the taskcluster-worker repo and the growing documentation on godoc.

Come by and say hi in #taskcluster on the Mozilla IRC network.

March 24, 2016 12:00 AM

March 11, 2016

Selena Deckelmann

[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

March 11, 2016 11:48 PM

March 08, 2016

Selena Deckelmann

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

March 08, 2016 10:20 PM

March 07, 2016

Selena Deckelmann

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.

March 07, 2016 11:51 PM

Dustin Mitchell

TaskCluster Login: Credential Management

In my last post about TaskCluster Login, I described improvements to allow any Mozillian to sign in to TaskCluster with an appropriate access level.

The next step, now in place, is to allow everyone to manage their own credentials, and those of the projects they work on.

New Features

First, credentials now have names, which helps us humans to tell them apart. For example, my temporary credential is named mozilla-ldap/dmitchell@mozilla.com. When I sign in, the tools site helpfully shows the name of my credential in the upper-right corner.

Next, everyone can create clients, as long as they begin with your credential name. For example, Armen can create a client named mozilla-ldap/armenzg@mozilla.com/mozci-testing for testing MozCI. Before today, doing so required pinging someone in #taskcluster and asking nicely. These clients are automatically disabled when the owner’s priveleges change (e.g., by leaving Mozilla or changing groups).

Finally, using some nice namespaces, individual teams can now manage everything related to their project. For example, a person in the releng LDAP group automatically has the scope project:releng:*, which governs Release Engineering tools such as Buildbot Bridge. She also controls clientIds beginning with permanent/releng/, which are credentials used by Release Engineering services. A number of other per-project namespaces are included, such as secrets, hooks, and index routes.

Questions and Future

There’s still work to do, as mentioned in the last post. For example, when credentials expire, the tools page doesn’t show any indication until you try to perform some operation and get an error. I would also like to add support for sharing TaskCluster credentials with other sites – for example, wouldn’t it be great if you logged into RelengAPI via TaskCluster?

As with any change, I’m sure there will be rough edges and issues I haven’t anticipated. Please file any bugs in the TaskCluster :: Login component, or ping me (dustin) in IRC.

Cleaning House

With this change, all clients should have nice long names, either associated with a person or with a team. However, we have a plethora of clients that do not fit this pattern. These fall into three categories:

Many of these have slugid’s for names – strings that are as ugly as the name suggests!

To clean all of this up, we will be scheduling the permacreds to expire on March 31 and contacting each owner to suggest simply signing in (using temporary credentials) or creating a properly-named client to replace the permacred. We will be replacing credentials in the last two categories with credentials named project/taskcluster/*.

March 07, 2016 10:29 AM

March 01, 2016

Jonas Finnemann Jensen

One-Click Loaners with TaskCluster

Last summer Edgar Chen (air.mozilla.org) built on an interactive shell for TaskCluster Linux workers, so developers can get a SSH-like session into a task container from their browser. We’ve slowly been improving this, and prior to Mozlando I added support for opening a VNC-like session connecting to an X-session inside a task container. I’ll admit I was mostly motivated by the prospect of giving an impressive demo, and the implementation details are likely to change as we improve it further. Consequently, we haven’t got many guides on how to use these features in their current state.

However, with people asking for TaskCluster “loaners” on IRC, I figure now is a good time to explain how these interactive features can be used to provide a loaner-on-demand flow for TaskCluster workers. At least on Linux, but hopefully we can do a similar thing on other platforms too. Before we dive in, I want to note that all of our Linux tasks runs under docker with one container per tasks. Hence, you can pull down the docker image and play with it locally, the process and caveats such as setting up loopback video and audio devices is beyond the scope of this post. But feel free to ask on IRC (#taskcluster), I’m sure Greg Arndt has all the details, some of them are already present in “Run Locally” script displayed in the task-inspector.

Quick Start

If you can’t wait to play, here are the bullet points:

  1. You’ll need a commit-level 1 access (and LDAP login)
  2. Go to treeherder.mozilla.org pick a task that runs on TaskCluster (I tried “[TC] Linux64 reftest-3”, build tasks don’t have X.org)
  3. Under “Job details” click the “Inspect Task” (this will open the task-inspector)
  4. In the top right corner in the task-inspector click “Login” (this opens login.taskcluster.net on a new tab)
    1. “Sign-in with LDAP” or  “Sign-in with Okta” (Okta only works for employees)
    2. Click the “Grant Access” button (to grant tools.taskcluster.net access)
  5. In the task-inspector under the “Task” tab, scroll down and click the “One-Click Loaner” button
  6. Click again to confirm and create a one-click loaner task (this takes you to a “Waiting for Loaner” page)
    1. Just wait… 30s to 5 min (you can open the task-inspector for your loaner task to see the live log, if you are impatient)
    2. Eventually you should see two big buttons to open an interactive shell or display
  7. You should now have an interactive terminal (and display) into a running task container.

Warning: These loaners runs on EC2 spot-nodes, they may disappear at any time. Use them for quickly trying something, not for writing patches.

Given all these steps, in particular the “Click again” in step (6), I recognize that it might take more than one click to get a “One-Click Loaner”. But we are just getting started, and all of this should be considered a moving target. The instructions above can also be found on MDN, where we will try to keep them up to date.

Implementation Details

To support interactive shell sessions the worker has an end-point that accepts websocket connections. For each new websocket the worker spawns a sh or bash inside the task container and pipes stdin, stdout and stderr over the websocket. In browser we use then have the websocket reading from and writing to hterm (from the chromium project) giving us a nice terminal emulator in the browser. There is still a few issues with the TTY emulation in docker, but it works reasonably for small things.

shell

For interactive display sessions (VNC-like sessions in the browser) the worker has an end-point which accepts both websocket connections and ordinary GET requests for listing displays. For each GET request the worker will run a small statically linked binary that lists all the X-sessions inside the task container, the result is then transformed to JSON and returned in the request. Once the user has picked a display, a websocket connection is opened with the display identifier in query-string. On the worker the websocket is piped to a statically linked instance of x11vnc running inside the task container. In the browser we then use noVNC to give the user an interactive remote display right in the browser.

novnc

As with the shell, there is also a few quirks to the interactive display. Some graphical artifacts and other “interesting” issues. When streaming a TCP connection over a websocket we might not be handling buffering all too well. Which I suspect introduces additional latency and possible bugs. I hope these things will get better in future iterations of the worker, which is currently undergoing an experimental rewrite from node to go.

Future Work

As mentioned in the “Quick Start” section, all of this is still a bit of a moving target. Access is to any loaner is effectively granted to anyone with commit level 1 or any employee. So your friends can technically hijack the interactive task you created. Obviously, we have to make that more fine-grained. At the moment, the “one-click loaner” button is also very specific to our Linux worker. As we add more platforms will have to extend support and find a way to abstract the platform dependent aspects. S it’s very likely that this will break on occasion.

We also recently introduced a hack defining the environment variable TASKCLUSTER_INTERACTIVE when a loaner task is created. A quick hack that we might refactor later, but for now it’s enabling Armen Zambrano to customize how the docker image used for tests runs in loaner-mode. In bug 1250904 there is on-going work to ensure that a loaner will setup the test environment, but not start running tests until a user connects and types the right command. I’m sure there are many other things we can do to make the task environment more useful in loaner-mode, but this is certainly a good start.

Anyways, much of this is still quick hacks, with rough edges that needs to be resolved. So don’t be surprised if it breaks while we improve stability and attempt to add support for multiple platforms. With a bit of time and resources I’m fairly confident that the “one-click loaner” flow could become the preferred method for debugging issues specific to the test environment.

March 01, 2016 06:02 AM

February 24, 2016

John Ford

cloud-mirror – Platform Engineering Operations Project of the Month

Hello from Platform Engineering Operations! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is our cloud-mirror.

The cloud-mirror is something that we've written to reduce costs and time of inter-region S3 transfers. Cloud-mirror was designed for use in the Taskcluster system, but is possible to run independently. Taskcluster, which is the new automation environment for Mozilla, can support passing artifacts between dependent tasks. An example of this is that when we do a build, we want to make the binaries available to the test machines. We originally hosted all of our artifacts in a single AWS region. This meant that every time a test was done in a region outside of the main region, we would incur an inter-region transfer for each test run. This is expensive and slow compared to in-region transfers.

We decided that a better idea would be to transfer the data from the main region to the other regions the first time it was requested in that region and then have all subsequent requests be inside of the region. This means that for the small overhead of an extra in-region copy of the file, we lose the cost and time overhead of doing inter-region transfers every single time.

Here's an example. We use us-west-2 as our main region for storing artifacts. A test machine in eu-central-1 requires "firefox-50.tar.bz2" for use in a test. The test machine in eu-central-1 will ask cloud mirror for this file. Since this is the first test to request this artifact in eu-central-1, cloud mirror will first copy "firefox-50.tar.bz2" into eu-central-1 then redirect to the copy of that file in eu-central-1. The second test machine in eu-central-1 will then ask for a copy of "firefox-50.tar.bz2" and because it's already in the region, the cloud mirror will immediately redirect to the eu-central-1 copy.

We expire artifacts from the destination regions so that we don't incur too high storage costs. We also use a redis cache configured to expire keys which have been used least recently first. Cloud mirror is written with Node 5 and uses Redis for storage. We use the upstream aws-sdk library for doing our S3 operations.

We're in the process of deploying this system to replace our original implementation called 's3-copy-proxy'. This earlier version was a much simpler version of this idea which we've been using in production. One of the main reasons for the rewrite was to be able to abstract the core concepts to allow anyone to write a backend for their storage type as well as being able to support more aws regions and move towards a completely HTTPS based chain.

If this is a project that's interesting to you, we have lots of ways that you could contribute! Here are some:
  • switch polling for pending copy operations to use redis's pub/sub features
  • write an Azure or GCE storage backend
  • Modify the API to determine which cloud storage pool a request should be redirected to instead of having to encode that into the route
  • Write a localhost storage backend for testing that serves content on 127.0.0.1
If you have any ideas or find some bugs in this system, please open an issue https://github.com/taskcluster/cloud-mirror/issues. For the time being, you will need to have an AWS account to run our integration tests (`npm test`). We would love to have a storage backend that allows running the non-service specific portions of the system without any extra permissions.
If you're interested in contributing, please ping me (jhford) in #taskcluster on irc.mozilla.org.

For more information about all Platform Ops projects, visit our wiki. If you're interested in helping out, http://ateam-bootcamp.readthedocs.org/en/latest/guide/index.html has resources for getting started.

February 24, 2016 03:13 PM

February 22, 2016

Dustin Mitchell

TaskCluster Login: Now With LDAP!

TaskCluster has a sophisticated access-control mechanism based on “scopes” that governs every API call. A push to try requires 42 scopes!

As a guiding principle, and a convenience to users, the TaskCluster team has tried to align users’ scopes with their commit privileges. That is, if you can make some API call via a push to try, you should be able to make that same API call directly. Typically, users want to copy a task, modify it, and re-run it via https://tools.taskcluster.net. The TaskCluster-Login service supported logins via either Okta (Mozilla, Inc.’s single-signon provider) or Mozillians. However, Mozillians does not track commit privileges, and Okta is only available to Mozilla employees, so non-employee contributors were left without a means to log in with the full set of scopes they deserved.

Well, no more. As of last week, the login service supports authentication with an LDAP username and password for those who cannot access Okta. The practical result is, if you have permission to push to try (SCM level 1 or higher), but no Okta account, you can now access TaskCluster and do anything your try pushes can.

Over the coming month we will be deploying a number of additional improvements to the TaskCluster login experience:

  • Create your own credentials (clientId and accessToken) with limited scopes
    • Temporary credentials for a one-off project
    • Permanent credentials for a command-line tool
  • Better credential management on the tools page
    • See your clientId, what scopes you have, and when the scopes expire
    • Switch between different sets of credentials
    • Grant another site some of your scopes (via an OAuth-like flow)

February 22, 2016 01:30 PM

February 16, 2016

Maja Frydrychowicz

First Experiment with TaskCluster

TaskCluster is a new-ish continuous integration system made at Mozilla. It manages the scheduling and execution of tasks based on a graph of their dependencies. It’s a general CI tool, and could be used for any kind of job, not just Mozilla things.

However, the example I describe here refers to a Mozilla-centric use case of TaskCluster1: tasks are run per check-in on the branches of Mozilla’s Mercurial repository and then results are posted to Treeherder. For now, the tasks can be configured to run in Docker images (Linux), but other platforms are in the works2.

So, I want to schedule a task! I need to add a new task to the task graph that’s created for each revision submitted to hg.mozilla.org. (This is part of my work on deploying a suite of tests for the Marionette Python test runner, i.e. testing the test harness itself.)

The rest of this post describes what I learned while making this work-in-progress.

There are builds and there are tests

mozilla-taskcluster operates based on the info under testing/taskcluster/tasks in Mozilla’s source tree, where there are yaml files that describe tasks. Specific tasks can inherit common configuration options from base yaml files.

The yaml files are organized into two main categories of tasks: builds and tests. This is just a convention in mozilla-taskcluster about how to group task configurations; TC itself doesn’t actually know or care whether a task is a build or a test.

The task I’m creating doesn’t quite fit into either category: it runs harness tests that just exercise the Python runner code in marionette_client, so I only need a source checkout, not a Firefox build. I’d like these tests to run quickly without having to wait around for a build. Another example of such a task is the recently-created ESLint task.

Scheduling a task

Just adding a yaml file that describes your new task under testing/taskcluster/tasks isn’t enough to get it scheduled: you must also add it to the list of tasks in base_jobs.yml, and define an identifier for your task in base_job_flags.yml. This identifier is used in base_jobs.yml, and also by people who want to run your task when pushing to try.

How does scheduling work? First a decision task generates a task graph, which describes all the tasks and their relationships. More precisely, it looks at base_jobs.yml and other yaml files in testing/taskcluster/tasks and spits out a json artifact, graph.json3. Then, graph.json gets sent to TC’s createTask endpoint, which takes care of the actual scheduling.

In the excerpt below, you can see a task definition with a requires field and you can recognize a lot of fields that are in common with the ‘task’ section of the yaml files under testing/taskcluster/tasks/.

{
"tasks": [
    {
      "requires": [
        // id of a build task that this task depends on
        "fZ42HVdDQ-KFFycr9PxptA"  
      ], 
      "task": {
        "taskId": "c2VD_eCgQyeUDVOjsmQZSg"
        "extra": {
          "treeherder": {
              "groupName": "Reftest", 
              "groupSymbol": "tc-R", 
          }, 
        }, 
        "metadata": {
          "description": "Reftest test run 1", 
          "name": "[TC] Reftest", 
        //...
  ]
}

For now at least, a major assumption in the task-graph creation process seems to be that test tasks can depend on build tasks and build tasks don’t really4 depend on anything. So:

  • If you want your tasks to run for every push to a Mozilla hg branch, add it to the list of builds in base_jobs.yml.
  • If you want your task to run after certain build tasks succeed, add it to the list of tests in base_jobs.yml and specify which build tasks it depends on.
  • Other than the above, I don’t see any way to specify a dependency between task A and task B in testing/taskcluster/tasks.

So, I added marionette-harness under builds. Recall, my task isn’t a build task, but it doesn’t depend on a build, so it’s not a test, so I’ll treat it like a build.

# in base_job_flags.yml
builds:
  # ...
  - marionette-harness

# in base_jobs.yml
builds:
  # ...
  marionette-harness:
    platforms:
      - Linux64
    types:
      opt:
        task: tasks/tests/harness_marionette.yml

This will allow me to trigger my task with the following try syntax: try: -b o -p marionette-harness. Cool.

Make your task do stuff

Now I have to add some stuff to tasks/tests/harness_marionette.yml. Many of my choices here are based on the work done for the ESLint task. I created a base task called harness_test.yml by mostly copying bits and pieces from the basic build task, build.yml and making a few small changes. The actual task, harness_marionette.yml inherits from harness_test.yml and defines specifics like Treeherder symbols and the command to run.

The command

The heart of the task is in task.payload.command. You could chain a bunch of shell commands together directly in this field of the yaml file, but it’s better not to. Instead, it’s common to call a TaskCluster-friendly shell script that’s available in your task’s environment. For example, the desktop-test docker image has a script called test.sh through which you can call the mozharness script for your tests. There’s a similar build.sh script on desktop-build. Both of these scripts depend on environment variables set elsewhere in your task definition, or in the Docker image used by your task. The environment might also provide utilities like tc-vcs, which is used for checking out source code.

# in harness_marionette.yml
payload:
  command:
    + bash
    + -cx
    + >
        tc-vcs checkout ./gecko {{base_repository}} {{head_repository}} {{head_rev}} {{head_ref}} &&
        cd gecko &&
        ./mach marionette-harness-test

My task’s payload.command should be moved into a custom shell script, but for now it just chains together the source checkout and a call to mach. It’s not terrible of me to use mach in this case because I expect my task to work in a build environment, but most tests would likely call mozharness.

Configuring the task’s environment

Where should the task run? What resources should it have access to? This was probably the hardest piece for me to figure out.

docker-worker

My task will run in a docker image using a docker-worker5. The image, called desktop-build, is defined in-tree under testing/docker. There are many other images defined there, but I only considered desktop-build versus desktop-test. I opted for desktop-build because desktop-test seems to contain mozharness-related stuff that I don’t need for now.

# harness_test.yml
image:
   type: 'task-image'
   path: 'public/image.tar'
   taskId: '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}'

The image is stored as an artifact of another TC task, which makes it a ‘task-image’. Which artifact? The default is public/image.tar. Which task do I find the image in? The magic incantation '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}' somehow6 obtains the correct ID, and if I look at a particular run of my task, the above snippet does indeed get populated with an actual taskId.

"image": {
  "path": "public/image.tar",
  // Mystery task that makes a desktop-build image for us. Thanks, mystery task!
  "taskId": "aqt_YdmkTvugYB5b-OvvJw", 
  "type": "task-image"
}

Snooping around in the handy Task Inspector, I found that the magical mystery task is defined in image.yml and runs build_image.sh. Fun. It’s also quite convenient to define and test your own custom image.

Other details that I mostly ignored

# in harness_test.yml
scopes:
  # Nearly all of our build tasks use tc-vcs
  - 'docker-worker:cache:level-{{level}}-{{project}}-tc-vcs'
cache:
   # The taskcluster-vcs tooling stores the large clone caches in this
   # directory and will reuse them for new requests this saves about 20s~
   # and is the most generic cache possible.
   level-{{level}}-{{project}}-tc-vcs: '/home/worker/.tc-vcs'
  • Routes allow your task to be looked up in the task index. This isn’t necessary in my case so I just omitted routes altogether.
  • Scopes are permissions for your tasks, and I just copied the scope that is used for checking out source code.
  • workerType is a configuration for managing the workers that run tasks. To me, this was a choice between b2gtest and b2gbuild, which aren’t specific to b2g anyway. b2gtest is more lightweight, I hear, which suits my harness-test task fine.
  • I had to include a few dummy values under extra in harness_test.yml, like build_name, just because they are expected in build tasks. I don’t use these values for anything, but my task fails to run if I don’t include them.

Yay for trial and error

  • If you have syntax errors in your yaml, the Decision task will fail. If this happens during a try push, look under Job Details > Inspect Task to fine useful error messages.
  • Iterating on your task is pretty easy. Aside from pushing to try, you can run tasks locally using vagrant and you can build a task graph locally as well with mach taskcluster-graph.

Resources

Blog posts from other TaskCluster users at Mozilla:

There is lots of great documentation at docs.taskcluster.net, but these sections were especially useful to me:

Acknowledgements

Thanks to dustin, pmoore and others for corrections and feedback.


  1. This is accomplished in part thanks to mozilla-taskcluster, a service that links Mozilla’s hg repo to TaskCluster and creates each decision task. More at TaskCluster at Mozilla 

  2. Run tasks on any platform thanks to generic worker 

  3. To look at a graph.json artifact, go to Treeherder, click a green ‘D’ job, then Job details > Inspect Task, where you should find a list of artifacts. 

  4. It’s not really true that build tasks don’t depend on anything. Any task that uses a task-image depends on the task that creates the image. I’m sorry for saying ‘task’ five times in every sentence, by the way. 

  5. …as opposed to a generic worker

  6. {{#task_id_for_image}} is an example of a predefined variable that we can use in our TC yaml files. Where do they come from? How do they get populated? I don’t know. 

February 16, 2016 05:00 AM

December 14, 2015

Dustin Mitchell

Taskcluster Security Exercise

During Mozlando last week, I organized a TaskCluster “security game”. The goals of this exercise were:

  • Learn to think like an attacker
  • Develop ideas for monitoring for, preventing, and reacting to attacks
  • Share awareness of the security considerations around TaskCluster

The format was fairly simple: participants were given with a number of “tasks” and a set of credentials with a relatively low access level (SCM level 1, or permission to push to try). I added some ground rules to prevent mayhem and to keep the difficulty level reasonable. Several members of the Infosec team participated, along with most of the TaskCluster team and a few Release Engineering folks.

Rules

Most of us have “administrator” credentials which would allow us to accomplish any of these tasks easily. Those credentials are off-limits for the duration of the exercise: no heroku access, no github pushes, no use of your AWS credentials. Only public, read-only access to taskcluster/* Docker Hub repos is allowed, although you are free to push to personal repos, public or private.

What you do have is the client-id red-team with an access key that will be provided on the day of the exercise. It has scope assume:moz-tree:level:1, which is try-level access. If you manage to reveal other credentials during the course of the exercise, you are of course free to use them.

You are permitted to push to try (gaia or gecko) under your own LDAP account. Pushes to sheriffed trees are not allowed.

Do not perform any actions intended to or reasonably likely to cause a denial of service for other TaskCluster users. If something breaks accidentally, we will end the exercise and set about fixing it.

We can consider removing some of these restrictions next time, to model rogue-insider, credential-disclosure, or DoS scenarios.

Tasks

  • Make an API request with clientId red-team-target.

  • Display the relengapi token used by the relengapi proxy in a task log.

  • Submit a task that adds an artifact to a different task.

  • Claim, “execute” by logging “OWNED!”, and resolve a task with a provisionerId/workerType corresponding to existing infrastructure (docker-worker, generic-worker, funsize, signing, buildbot bridge, etc.)

  • From a task, create a process that sends traffic to and continues to do so _after_ the task has ended.

  • From a task, cause another host within the AWS VPC to execute arbitrary code.

  • Harvest a secret value from a private docker image.

  • Via a task, start a shell server on the docker-worker host (outside of the container) and connect to it.

  • Create a “malicious” binary (not necessarily Firefox) and sign it with a real Mozilla signing key.

Results

I won’t go into detail here, but we were able to accomplish a few of these tasks in the 3 hours or so we spent trying! Most began by extracting secrets from a private docker image – one of the oldest and most-discouraged ways of using secrets within TaskCluster.

Next Time

I’d like to run an exercise like this at every coincidental work-week (so, every 6 months). We wrote down some ideas for next time.

First, we need to provide better training in advance for people not already familiar with TaskCluster – Infosec in particular, as they bring a great deal of security and penetration-testing experience to the table. Even for an expert, three hours is not a lot of time to craft and carry out a complicated attack. Next time, we could spread the exercise over the entire week, with the ground rules and targets announced on Monday, some shorter hacking sessions organized during the week, and a summation scheduled for Friday. This would allow ample time for study of the TaskCluster implementation, and for long-running attacks (e.g., trying to exploit a race condition) to execute.

We practice security-in-depth, so some of the vunlerabilities we found could not be further exploited due to an additional layer of security. Next time, we may hypothesize that one of those layers is already broken. For example, we may hypothesize that there is a bug in Docker allowing read access to the host’s filesystem, and emulate this by mounting the host’s / at /host in docker images for a particular workerType. What attacks might this enable, and how could we protect against them?

Finally, some members of the Infosec team are interested in running this kind of exercise much more frequently, for other services. Imagine spending a day breaking into your new web service with the pros – it might be easier than you think!

December 14, 2015 12:00 PM

December 09, 2015

Greg Arndt

Build Images on Push

When looking at a docker task definition it’s not shocking to find some similarities between the definition and some of the inputs for starting a container with the docker remote api.

At the core of our docker based worker is a wrapper around the docker engine that allows tasks to be executed in an isolated reproducible environment. Tasks just need to specify an image to use, the command to run, and any possible environment variables. The worker will execute those in much the same way as running docker run [...].

TaskCluster has benefited by the hard work the Docker team has put towards their products, especially the docker registry which hosts most of our images that we use in production. In the past 30 days 1.4 million tasks have been completed in this environment which has resulted in over 2.8 million containers being started and around 270,000 images pulled from the registry.

Of the images that run within TaskCluster, most are defined in our gecko repository.

Changing these images requires editing in-tree image configurations, bumping version numbers in a specific file, running a convenience wrapper around docker build, and pushing to the docker registry. These steps must be completed before pushing to any branch that triggers tasks that use those images. On more than one occasion image tags have either been overwritten resulting in images being used that were not intended, or images not being pushed prior to pushing commits that required those versions of the image, resulting in a failed task.

For me, one of the larger disruptive disadvantages of this workflow is that it requires pushing the images to a registry. Anyone that has pushed an image knows that even on a decent upstream connection, this takes time. Even more painful when you want to test these images on Try and must push multiple revisions.

Looking at the current state of handling images, it was clear that the workflow needed to become simpler.

When reviewing our current state of handling docker images there were two problems we needed to solve. Simplifying the workflow of editing and pushing images and not relying on an external service to host images for production jobs.

Task Artifacts

All tasks have the ability to upload public or private files (artifacts) that can be referenced after the task has completed.

These files are uploaded to s3 and access control is provided by our TaskCluster auth component. Using a system like this for production tasks has been battle tested and seemed like a great fit for storing docker images.

Docker Image Tarball

Docker has the ability to save docker images and metadata as tarballs, which can then be transported to other hosts and loaded. The one downside to moving images around like this is that it does not take advantage of the layer caching system of docker. Using this within AWS in CI is an acceptable trade-off for the simplicity that this solution provides.

Task Image Artifacts

Marrying these two solutions seemed like a great fit for solving some of our problems. Storing the image tarballs within s3 as a task artifact is a proven concept, and relying on s3 as an external resource has been an acceptable practice for production tasks.

Images can now be built using the dind feature of our docker based workers by enabling the flag task.payload.feature.dind. When this feature is enabled and used by an image that contains the docker client, images can be built and saved without needing a docker daemon or privileged capabilities within the task container itself. These images can then be put wherever one would like, including saving it as an artifact.

Using this image for another task requires specifying the task ID of the image along with the artifact path to find the image.

Here is one such example:

payload: {
    image: {
        type: 'task-image',
        taskId: '1234',
        path: 'public/image.tar'
    }
}

Example

The workflow for on-push image building will typically be:

  • edit image context
  • commit changes and push to vcs
  • Decision task triggers image building task if any task requires that image

Note: A task must specify the in-tree image that it uses and this image should be treated as a task artifact. Consult the in-tree documentation about this feature for more information.

Looking at one of our existing dockerfiles for a production image, we identify a package that needs to be installed but as a developer I would like to test this out on try before using it in production.

Using this new workflow, I can take the dockerfile and add anywhere to it the installation of the package like so:

RUN apt-get update && apt-get install -y wget

Once the change has been made, I commit it, push to try with flags that will trigger the task that requires that image and watch the image being built automatically.

You can see this job on treeherder as a “taskcluster-images” job. In the snippet below from treeherder, you can see that a “taskcluster-images” job was scheduled (‘I’ symbol). Here the b2g desktop tasks depend on the image building task to complete and were scheduled once that task completed.

Alt Text

I hope everyone that has ever needed to build an in-tree image gets a chance to try this out. Come chat in #taskcluster on the Mozilla IRC network if you have questions or want to try this out.

December 09, 2015 12:00 AM

October 12, 2015

John Ford

Splitting out taskcluster-base into component libraries

The intended audience of this post is people who either work with taskcluster-base now or are interested in implementing taskcluster services in the future.

Taskcluster serverside components are currently built using the suite of libraries in the taskcluster-base npm package. This package is many things: config parsing, data persistence, statistics, json schema validators, pulse publishers, a rest api framework and some other useful tools. Having these all in one single package means that each time a contributor wants to hack on one part of our platform, she'll have to figure out how to install and run all of our dependencies. This is annoying when it's waiting for a libxml.so library build, but just about impossible for contributors who aren't on the Taskcluster platform team. You need ​Azure, Influx and AWS accounts to be able to run the full test suite. You also might experience confusing errors in a part of the library you're not even touching.

Additionally, we are starting to get to the point where some services must upgrade one part of taskcluster-base without using other parts. This is generally frowned upon, but sometimes we just need to put a bandaid on a broken system that's being turned off soon. We deal with this currently by exporting base.Entity and base.LegacyEntity. I'd much rather we just export a single base.Entity and have people who need to keep using the old Entity library use taskcluster-lib-legacyentity directly

We're working on fixing this! The structure of taskcluster-base is really primed and ready to be split up since it's already a bunch of independent libraries that just so happen to be collocated. The new component loader that landed was the first library to be included in taskcluster-base this way and I converted our configs and stats libraries last week.

The naming convention that we've settled on is that taskcluster libraries will be prefix with taskcluster-lib-X. This means we have taskcluster-lib-config, taskcluster-lib-stats. We'll continue to name services as taskcluster-Y, like taskcluster-auth or taskcluster-confabulator.  The best way to get the current supported set of taskcluster libraries is still going to be to install the taskcluster-base npm module.

Some of our libraries are quiet large and have a lot of history in them. I didn't really want to just create a new repository and copy in the files we care about and destroy the history. Instead, I wrote a simple and ugly tool (https://github.com/jhford/taskcluster-base-split) which does the pedestrian tasks involved in this split up by filtering out irrelevant history for each project, moving files around and doing some preliminary cleanup work on the new library.

This tooling gets us 90% of the way to a split out repository, but as always, a human is required to take it the last step of the way. Imports need to be fixed, dependencies must be verified and tests need to be fixed. I'm also taking this opportunity to implement babel-transpiling support in as many libraries as I can. We use babel everywhere in our application code, so it'll be nice to have it available in our platform libraries as well. I'm using the babel-runtime package instead of requiring the direct use of babel. The code produced by our babel setup is tested in tested using the node 0.12 binary without any wrappers at all.

Having different libraries will introduce the risk of our projects having version number hell. We're still going to have a taskcluster-base npm package. This package will simply be a package.json file which specifies the supported versions of the taskcluster-lib-* packages we ship as a release and an index.js file which imports and re-exports the libraries that we provide. If we have two libraries that have codependent changes, we can land new versions in those repositories and use taskcluster-base as the synchronizing mechanism.

A couple of open questions that I'd love to get input on are how we should share package.json snippets and babel configurations. We mostly have a solution for eslint, but we'd love to be able to share as much as possible in our .babelrc configuration files. If you have a good idea for how we can do that, please get in touch!

One of the goals in doing this is to make writing taskcluster components easier to write. We'd love to see components written by other teams use our framework since we know it's tested to work with Taskcluster well. It also makes it easier for the task cluster team to advise on design and maintenance concerns.

Once a few key changes have landed, I will write a series of blog posts explaining how core taskcluster services are structured.

October 12, 2015 01:18 PM

October 09, 2015

Wander Lairson Costa

In tree tasks configuration

This post is about our plans for representing Taskcluster tasks inside the gecko tree. Jonas, Dustin and I had a discussion in Berlin about this, here I summarize what we have so far. We currently store tasks in an yaml file and they translate to json format using the mach command. The syntax we have now is not the most flexible one, it is hard to parameterize the task and very difficulty to represents tasks relationships.

Let us illustrate the shortcomings with two problems we currently have. Both apply to B2G.

B2G (as in Android) has three different build variants: user, userdebug and eng. Each one has slightly different task configurations. As there is no flexible way to parameterize tasks, we end up with one different task file for each build variant.

When doing nightly builds, we must send update data to the OTA server. We have plans to run a build task, then run the test tasks on this build, and if all tests pass, we run a task responsible to update the OTA server. The point is that today we have no way to represent this relationship inside the task files.

For the first problem Jonas has a prototype for json parameterization. There were discussions on Berlin work week either we should stick with yaml files or use Python files for task configuration. We do want to keep the syntax declarative, which favors yaml, but storing configurations in Python files brings much more expressiveness and flexibility, but this can result in the same configuration hell we have with Buildbot.

The second problem is more complex, and we still haven't reached a final design. The first question is how we describe task dependencies, top-down, i.e., we specify which task(s) should run after a completed task, or ground up, a task specifies which tasks it depends on. In general, we all agreed to go to a top-down syntax, since most scenarios beg for a top down approach. Other either should put the description of tasks relationship inside the task files or in a separated configuration file. We would like to represent task dependencies inside the task file, the problem is how to check what's the root task for the task graph. One suggestion is having a task file called root.yml which only contain root tasks.

October 09, 2015 12:00 AM

October 05, 2015

Selena Deckelmann

[berlin] TaskCluster Platform: A Year of Development

Back in September, the TaskCluster Platform team held a workweek in Berlin to discuss upcoming feature development, focus on platform stability and monitoring and plan for the coming quarter’s work related to Release Engineering and supporting Firefox Release. These posts are documenting the many discussions we had there.

Jonas kicked off our workweek with a brief look back on the previous year of development.

Prototype to Production

In the last year, TaskCluster went from an idea with a few tasks running to running all of FirefoxOS aka B2G continuous integration, which is about 40 tasks per minute in the current environment.

Architecture-wise, not a lot of major changes were made. We went from CloudAMQP to Pulse (in-house RabbitMQ). And shortly, Pulse itself will be moving it’s backend to CloudAMQP! We introduced task statuses, and then simplified them.

On the implementation side, however, a lot changed. We added many features and addressed a ton of docker worker bugs. We killed Postgres and added Azure Table Storage. We rewrote the provisioner almost entirely, and moved to ES6. We learned a lot about babel-node.

We introduced the first alternative to the Docker worker, the Generic worker. We for the first time had Release Engineering create a worker, the Buildbot Bridge.

We have several new users of TaskCluster! Brian Anderson from Rust created a system for testing all Cargo packages for breakage against release versions. We’ve had a number of external contributors create builds for FirefoxOS devices. We’ve had a few Github-based projects jump on taskcluster-github.

Features that go beyond BuildBot

One of the goals of creating TaskCluster was to not just get feature parity, but go beyond and support exciting, transformative features to make developer use of the CI system easier and fun.

Some of the features include:

Features coming in the near future to support Release

Release is a special use case that we need to support in order to take on Firefox production worload. The focus of development work in Q4 and beyond includes:

  • Secrets handling to support Release and ops workflows. In Q4, we should see secrets.taskcluster.net go into production and UI for roles-based management.
  • Scheduling support for coalescing, SETA and cache locality. In Q4, we’re focusing on an external data solution to support coalescing and SETA.
  • Private data hosting. In Q4, we’ll be using a roles-based solution to support these.

October 05, 2015 06:38 PM

TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.

Publications

Our team published a few blog posts and videos this quarter:

October 05, 2015 05:39 PM

Wander Lairson Costa

Running phone builds on Taskcluster

In this post I am going to talk about my work for phone builds inside the Taskcluster infrastructure. Mozilla is slightly moving from Buildbot to Taskcluster. Here I am going to give a survivor guide on Firefox OS phone builds.

Submitting tasks

A task is nothing more than a json file containing the description of the job to execute. But you don't need to handle the json directly, all tasks are written in YAML, and it is then processed by the mach command. The in tree tasks are located at testing/taskcluster/tasks and the build tasks are inside the builds/ directory.

My favorite command to try out a task is the mach taskcluster-build command. It allows you to process a single task and output the json formatted task ready for Taskcluster submission.

$ ./mach taskcluster-build \
    --head-repository=http://hg.mozilla.org/mozilla-central 
    --head-rev=tip \
    --owner=foobar@mozilla.com \
    tasks/builds/b2g_desktop_opt.yml

Although we specify a Mercurial repository, Taskcluster also accepts git repositories interchangeably.

This command will print out the task to the console output. To run the task, you can copy the generated task and paste it in the task creator tool. Then just click on Create Task to schedule it to run. Remember that you need Taskcluster Credentials to run Taskcluster tasks. If you have taskcluster-cli installed, you can the pipe the mach output to taskcluster run-task.

The tasks are effectively executed inside a docker image.

Mozharness

Mozharness is what we use for effectively build stuff. Mozharness architecture, despite its code size, is quite simple. Under the scripts directory you find the harness scripts. We are specifically interested in the b2g_build.py script. As the script name says, it is responsible for B2G builds. The B2G harness configuration files are located at the b2g/config directory. Not surprisingly, all files starting with "taskcluster" are for Taskcluster related builds.

Here are the most common configurations:

default_vcs
This is the default vcs used to clone repositories when no other is given. [tc_vcs](http://tc-vcs.readthedocs.org/en/latest/) allows mozharness to clone either git or mercurial repositories transparently, with repository caching support.
default_actions
The actions to execute. They must be present and in the same order as in the build class `all_actions` attribute.
balrog_credentials_file
The credentials to send update data to the OTA server.
nightly_build
`True` if this is a nightly build.
upload
Upload info. Not used for Taskcluster.
repo_remote_mappings
Maps externals repository to [mozilla domain](https://git.mozilla.org).
env
Environment variables for commands executed inside mozharness.

The listed actions map to Python methods inside the build class, with - replaced by _. For example, the action checkout-sources maps to the method checkout_sources. That's where the mozharness simplicity comes from: everything boils down to a sequence of method calls, just it, no secret.

For example, here is how you run mozharness to build a flame image:

python <gecko-dir>/testing/mozharness/scripts/b2g_build.py \
  --config b2g/taskcluster-phone.py \
  --disable-mock \
  --variant=user \
  --work-dir=B2G \
  --gaia-languages-file locales/languages_all.json \
  --log-level=debug \
  --target=flame-kk \
  --b2g-config-dir=flame-kk \
  --repo=http://hg.mozilla.org/mozilla-central \

Remember you need your flame connected to the machine so the build system can extract the blobs.

In general you don't need to worry about mozharness command line because it is wrapped by the build scripts.

Hacking Taskcluster B2G builds

All Taskcluster tasks run inside a docker container. Desktop and emulator B2G builds run inside the builder docker image. Phone builds are more complex, because:

  1. Mozilla is not allowed to publicly redistribute phone binaries.

  2. Phone build tasks need to access the Balrog server to send OTA update data.

  3. Phone build tasks need to upload symbols to the crash reporter.

Due to (1), only users authenticated with a @mozilla account are allowed to download phone binaries (this works the same way as private builds). And because of (1), (2) and (3), the phone-builder docker image is secret, so only authorized users can submit tasks to it.

If you need to create a build task for a new phone, most of the time you will starting from an existing task (Flame and Aries tasks are preferred) and then make your customizations. You might need to add new features to the build scripts, which currently are not the most flexible scripts around.

If you need to customize mozharness, make sure your changes are Python 2.6 compatible, because mozharness is used to run Buildbot builds too, and the Buildbot machines run Python 2.6. The best way to minimize risk of breaking stuff is to submit your patches to try with "-p all -b do" flags.

Need help? Ask at the #taskcluster channel.

October 05, 2015 12:00 AM

October 01, 2015

Greg Arndt

Monitoring TaskCluster infrastructure

Responding to incidents for TaskCluster has been a bit of good logging, helpful people, and black magic. As the platform matures, the monitoring must also grow with it.

Throughout the year, the TaskCluster team has been integrating more logging and metrics into the platform by aggregating logs into Papertrail and storing metrics in InfluxDB. This data has proven to be invaluable and has allowed us to make informed choices as to capacity and the health of TaskCluster. However, as great as these metrics and logs are, one large drawback is that they mainly have been only useful when someone is actively looking at them. Logging alerts are also only as good as the messages that are logged and do not detect the absence of a message (at least when using a service like Papertrail).

Recognizing the gap in our operations processes around TaskCluster, the team brainstormed some ideas while at the Berlin work week and decided on a phased approach over the next few quarters. Our primary goal is to increase the stability of the platform and also be the first responders to issues rather than our end-users.

So, where we’re at:

  • Metrics from TaskCluster Platform as well as some TaskCluster Services are recorded in InfluxDB and graphed on a Grafana instance.
  • Logging from various services is redirected to Papertrail
  • Alerts are configured within Papertrail for certain scenarios
  • High level status page for TaskCluster Heroku services reported

In the coming quarters, we will be looking at adding additional operational monitoring in a few phases.

Phase 1 (Q4/2015) - Metrics Alerting

We have done a lot of work to get useful metrics into InfluxDB and it has proven to be valuable with capacity planning and platform troubleshooting. However, these metrics do not help us detect issues before they happen unless we have some form of monitoring in place to focus our attention.

Some of the things to watch out for are:

  • Decision tasks not running
  • Pending backlogs growing at an unusual rate
  • Services not reporting aliveness checks
  • API call response times/statuses

To monitor these queries and get alerted, a combination of services will be used. First, a service will be implemented that will query Influx and send pulse messages when abnormalities are detected. Once the pulse messages are being published, a service can receive those and act upon them. In this phase that will be handled by a bot that can post informational messages within a channel.

Phase 2 (Q4/2015) - Services Monitoring

TaskCluster currently has a status page that is useful for getting a high level overview of the health of various platform services we deploy in Heroku. Unfortunately this does not get a clear picture of where a problem might reside.

In Q4 work will be done to enhance this status page to include an overall TaskCluster status based on some heuristics of all TaskCluster services, as well as services they depend on.

Not only will this give a clearer picture of the overall health, but also will allow one to see the individual services that might be degrading TaskCluster as a whole.

The hopes of making these changes will be to inform people to issues with various services as well as empower them to know where they should direct efforts to resolve it.

Phase 3 (TBD) - Log Alerting

This phase is an extension to things we have already been doing with Papertrail. Some of our components follow the convention of prefixing log statements with “[alert-operator]” for events that are exceptional and should be reported. In this phase the way that we log these should make use of a standard logging library used across all components. The events that we find useful should be configured and documented for those to discover the various events we are concerned about.

Also in this phase alternative logging vendors should be evaluated. One of the downsides of Papertrail currently is that it cannot parse the actual messages we are sending and allow us to alert based on information within the message (such as alert when a value logged is > N).

Phase 4 (TBD) - Task Alerting

Our tasks can get very complex with the series of operations that they performed. Within tasks, outside of just building a product, the task is responsible for pulling down source code, configuring the environment, and updating various other services. Sometimes it can be useful for these tasks to alert based on some situations. One case that comes to mind is when there are complications communicating with different VCS systems that could alert us of a larger problem that is growing. Enabling tasks to provide a stream of this information that is aggregating somewhere and alerts based on that information can be very informative and powerful.

Phase 5 (TBD) - Self healing TaskCluster

There are times where components of TaskCluster need to be cycled, destroyed, or scaled based. An interesting idea is for TaskCluster to be able to detect these situations, or some kind of system monitoring TaskCluster, and respond to these changes. This work still needs some thought and requirements drawn up, but the idea could solve some of the headaches that result from Heroku apps that do not auto restart, or apps that need to be scaled temporarily based on demand.

October 01, 2015 12:00 AM

September 30, 2015

Pete Moore

Building Firefox for Windows™ on Try using TaskCluster

Firefox on Windows screenshot

Try them out for yourself!

Here are the try builds we have created. They were built from the official in-tree mozconfigs that we use for the builds running in Buildbot.

Set up your own Windows™ Try tasks

We are porting over all of Mozilla’s CI tasks to TaskCluster, including Windows™ builds and tests.

Currently Windows™ and OS X tasks still run on our legacy Buildbot infrastructure. This is about to change.

In this post, I am going to talk you through how I set up Firefox Desktop builds in TaskCluster on Try. In future, the TaskCluster builds should replace the existing Buildbot builds, even for releases. Getting them running on Try was the first in a long line of many steps.

Spoiler alert: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc4b30cc56fb

Using the right Worker

In TaskCluster, Linux tasks run in a docker container. This doesn’t work on Windows, so we needed a different strategy.

TaskCluster defines the role of a Worker as component that is able to claim tasks from the Queue, execute them, publish artifacts, and report back status to the Queue.

For Linux, we have the Docker Worker. This is the component that takes care of executing Linux tasks inside a docker container. Since everything takes place in a container, consecutive tasks cannot interfere with each other, and you are guaranteed a clean environment.

This year I have been working on the Generic Worker. This takes care of running TaskCluster tasks on other platforms.

For Windows, we have a different isolation strategy: since we cannot yet easily run inside a container, the Generic Worker will create a new Windows user for each task it runs.

This user will have its own home directory, and will not have privileged access to the host OS. This means, it should not be able to make any persistent changes to the host OS that will outlive the lifetime of the task. The user only is able to affect HKEY_CURRENT_USER registry settings, and write to its home folder, which are both purged after task completion.

In other words, although not running in a container, the Generic Worker offers isolation to TaskCluster tasks by virtue of running each task as a different, custom created OS user with limited privileges.

Creating a Worker Type

TaskCluster considers a Worker Type as an entity which belongs to a Provisioner, and represents a host environment and hardware context for running one or more Workers. This is the Worker Type that I set up:

{
  "workerType": "win2012r2",
  "minCapacity": 0,
  "maxCapacity": 4,
  "scalingRatio": 0,
  "minPrice": 0.5,
  "maxPrice": 2,
  "canUseOndemand": false,
  "canUseSpot": true,
  "instanceTypes": [
    {
      "instanceType": "m3.2xlarge",
      "capacity": 1,
      "utility": 1,
      "secrets": {},
      "scopes": [],
      "userData": {},
      "launchSpec": {}
    }
  ],
  "regions": [
    {
      "region": "us-west-2",
      "secrets": {},
      "scopes": [],
      "userData": {},
      "launchSpec": {
        "ImageId": "ami-db657feb"
      }
    }
  ],
  "lastModified": "2015-09-30T10:15:30.349Z",
  "userData": {},
  "launchSpec": {
    "SecurityGroups": [
      "rdp-only"
    ]
  },
  "secrets": {},
  "scopes": [
    "*"
  ]
}

Not everybody has permission to create worker types - but there again, you only really need to do this if you are:

  • using Windows (or anything else non-linux)
  • not able to use an existing worker type

If you would like to create a new Worker Type, please contact the taskcluster team on irc.mozilla.org in #taskcluster channel.

The Worker Type above boils down to some AWS hardware specs, and an ImageId ami-db657feb. But where did this come from?

Generating the AMI for the Worker Type

It is a Windows 2012 R2 AMI, and it was generated with this code checked in to the try branch. This is not automatically run, but is checked in for reference purposes.

Here is the code. The first is a script that creates the AMI:

#!/bin/bash -exv

# cd into directory containing script...
cd "$(dirname "${0}")"

# generate a random slugid for aws client token...
# you need either go installed (https://golang.org/) and $GOPATH configured to run this,
# or alternatively download the 'slug' binary; see
# http://taskcluster.github.io/slugid-go/#installing-command-line-tool
go get github.com/taskcluster/slugid-go/slug
SLUGID=$("${GOPATH}/bin/slug")

# aws cli docs lie, they say userdata must be base64 encoded, but cli encodes for you, so just cat it...
USER_DATA="$(cat aws_userdata)"

# create base ami, and apply user-data
# filter output, to get INSTANCE_ID
# N.B.: ami-4dbcb67d referenced below is *the* Windows 2012 Server R2 ami offered by Amazon in us-west-2 - it is nothing we have made
# note, you'll need aws tool installed, access to the taskcluster AWS account, and your own private key file
INSTANCE_ID="$(aws --region us-west-2 ec2 run-instances --image-id ami-4dbcb67d --key-name pmoore-oregan-us-west-2 --security-groups "RDP only" --user-data "${USER_DATA}" --instance-type c4.2xlarge --block-device-mappings DeviceName=/dev/sda1,Ebs='{VolumeSize=75,DeleteOnTermination=true,VolumeType=gp2}' --instance-initiated-shutdown-behavior terminate --client-token "${SLUGID}" | sed -n 's/^ *"InstanceId": "\(.*\)", */\1/p')"

# sleep an hour, the installs take forever...
sleep 3600

# now capture the AMI - feel free to change the tags
IMAGE_ID="$(aws --region us-west-2 ec2 create-image --instance-id "${INSTANCE_ID}" --name "win2012r2 mozillabuild pmoore version ${SLUGID}" --description "firefox desktop builds on windows - taskcluster worker - version ${SLUGID}" | sed -n 's/^ *"ImageId": *"\(.*\)" *$/\1/p')"

# TODO: now update worker type...
# You must update the existing win2012r2 worker type with the new ami id generated ($IMAGE_ID var above)
# At the moment this is a manual step! It can be automated following the docs:
# http://docs.taskcluster.net/aws-provisioner/api-docs/#workerType
# http://docs.taskcluster.net/aws-provisioner/api-docs/#updateWorkerType

echo "Worker type ami to be used: '${IMAGE_ID}' - don't forget to update https://tools.taskcluster.net/aws-provisioner/#win2012r2/edit"' !!!'

This script works by exploiting the fact that when you spawn a Windows instance in AWS, using one of the AMIs that Amazon provides, you can include a Powershell snippet for additional setup. This gets executed automatically when you spawn the instance.

So we simply spawn an instance, passing through this powershell snippet, and then wait. A LONG time (an hour). And then we snapshot the image, and we have our new AMI. Simple!

Here is the Powershell snippet that it uses:

<powershell>

# needed for making http requests
$client = New-Object system.net.WebClient
$shell = new-object -com shell.application

# utility function to download a zip file and extract it
function Expand-ZIPFile($file, $destination, $url)
{
    $client.DownloadFile($url, $file)
    $zip = $shell.NameSpace($file)
    foreach($item in $zip.items())
    {
        $shell.Namespace($destination).copyhere($item)
    }
}

# allow powershell scripts to run
Set-ExecutionPolicy Unrestricted -Force -Scope Process

# install chocolatey package manager
Invoke-Expression ($client.DownloadString('https://chocolatey.org/install.ps1'))

# download mozilla-build installer
$client.DownloadFile("https://api.pub.build.mozilla.org/tooltool/sha512/03b4ca2bebede21a29f739165030bfc7058a461ffe38113452e976193e382d3ba6df8a48ac843b70429e23481e6327f43c86ffd88e4ce16263d072ef7e14e692", "C:\MozillaBuildSetup-2.0.0.exe")

# run mozilla-build installer in silent (/S) mode
$p = Start-Process "C:\MozillaBuildSetup-2.0.0.exe" -ArgumentList "/S" -wait -NoNewWindow -PassThru -RedirectStandardOutput "C:\MozillaBuild-2.0.0_install.log" -RedirectStandardError "C:\MozillaBuild-2.0.0_install.err"

# install Windows SDK 8.1
choco install -y windows-sdk-8.1

# install Visual Studio community edition 2013
choco install -y visualstudiocommunity2013
# $client.DownloadFile("https://go.microsoft.com/fwlink/?LinkId=532495&clcid=0x409", "C:\vs_community.exe")

# install June 2010 DirectX SDK for compatibility with Win XP
$client.DownloadFile("http://download.microsoft.com/download/A/E/7/AE743F1F-632B-4809-87A9-AA1BB3458E31/DXSDK_Jun10.exe", "C:\DXSDK_Jun10.exe")

# prerequisite for June 2010 DirectX SDK is to install ".NET Framework 3.5 (includes .NET 2.0 and 3.0)"
Install-WindowsFeature NET-Framework-Core -Restart

# now run DirectX SDK installer
$p = Start-Process "C:\DXSDK_Jun10.exe" -ArgumentList "/U" -wait -NoNewWindow -PassThru -RedirectStandardOutput C:\directx_sdk_install.log -RedirectStandardError C:\directx_sdk_install.err

# install PSTools
md "C:\PSTools"
Expand-ZIPFile -File "C:\PSTools\PSTools.zip" -Destination "C:\PSTools" -Url "https://download.sysinternals.com/files/PSTools.zip"

# install nssm
Expand-ZIPFile -File "C:\nssm-2.24.zip" -Destination "C:\" -Url "http://www.nssm.cc/release/nssm-2.24.zip"

# download generic-worker
md "C:\generic-worker"
$client.DownloadFile("https://github.com/taskcluster/generic-worker/releases/download/v1.0.12/generic-worker-windows-amd64.exe", "C:\generic-worker\generic-worker.exe")

# enable DEBUG logs for generic-worker install
$env:DEBUG = "*"

# install generic-worker
$p = Start-Process "C:\generic-worker\generic-worker.exe" -ArgumentList "install --config C:\\generic-worker\\generic-worker.config" -wait -NoNewWindow -PassThru -RedirectStandardOutput C:\generic-worker\install.log -RedirectStandardError C:\generic-worker\install.err

# add extra config needed
$config = [System.Convert]::FromBase64String("UEsDBAoAAAAAAA2hN0cIOIW2JwAAACcAAAAJAAAAZ2FwaS5kYXRhQUl6YVN5RC1zLW1YTDRtQnpGN0tNUmtoVENJYkcyUktuUkdYekpjUEsDBAoAAAAAACehN0cVjoCGIAAAACAAAAAVAAAAY3Jhc2gtc3RhdHMtYXBpLnRva2VuODhmZjU3ZDcxMmFlNDVkYmJlNDU3NDQ1NWZjYmNjM2VQSwMECgAAAAAANKE3RxYFa6ViAAAAYgAAABQAAABnb29nbGUtb2F1dGgtYXBpLmtleTE0NzkzNTM0MzU4Mi1qZmwwZTBwc2M3a2gxbXV0MW5mdGI3ZGUwZjFoMHJvMC5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSBLdEhDRkNjMDlsdEN5SkNqQ3dIN1pKd0cKUEsDBAoAAAAAAEShN0ctdLepZAAAAGQAAAAYAAAAZ29vZ2xlLW9hdXRoLWFwaS5rZXlfYmFr77u/MTQ3OTM1MzQzNTgyLWpmbDBlMHBzYzdraDFtdXQxbmZ0YjdkZTBmMWgwcm8wLmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tIEt0SENGQ2MwOWx0Q3lKQ2pDd0g3Wkp3R1BLAwQKAAAAAABYoTdHJ3EEFiQAAAAkAAAADwAAAG1vemlsbGEtYXBpLmtleTNiNGQyN2RkLTcwM2QtNDA5NC04Mzk4LTRkZTJjNzYzNTA1YVBLAwQKAAAAAABkoTdHMi/H2yQAAAAkAAAAHgAAAG1vemlsbGEtZGVza3RvcC1nZW9sb2MtYXBpLmtleTdlNDBmNjhjLTc5MzgtNGM1ZC05Zjk1LWU2MTY0N2MyMTNlYlBLAwQKAAAAAABxoTdHJ3EEFiQAAAAkAAAAHQAAAG1vemlsbGEtZmVubmVjLWdlb2xvYy1hcGkua2V5M2I0ZDI3ZGQtNzAzZC00MDk0LTgzOTgtNGRlMmM3NjM1MDVhUEsDBBQAAAAIAHyhN0fa715hagAAAHMAAAANAAAAcmVsZW5nYXBpLnRva0ut9MpIck/O9M/08gyt8jT0y/Sy1Eut9CpINvYFCVZGhnhm+jh7Faa4Z4P4Br4QvkFqhCOIX56ca5CZFqiXU5VoWeaSm20S6eblE+rpXJDiFxoRVBphnFFZUmrpkphd7m4aVWXsFxQeCABQSwECHgMKAAAAAAANoTdHCDiFticAAAAnAAAACQAAAAAAAAABAAAApIEAAAAAZ2FwaS5kYXRhUEsBAh4DCgAAAAAAJ6E3RxWOgIYgAAAAIAAAABUAAAAAAAAAAQAAAKSBTgAAAGNyYXNoLXN0YXRzLWFwaS50b2tlblBLAQIeAwoAAAAAADShN0cWBWulYgAAAGIAAAAUAAAAAAAAAAEAAACkgaEAAABnb29nbGUtb2F1dGgtYXBpLmtleVBLAQIeAwoAAAAAAEShN0ctdLepZAAAAGQAAAAYAAAAAAAAAAEAAACkgTUBAABnb29nbGUtb2F1dGgtYXBpLmtleV9iYWtQSwECHgMKAAAAAABYoTdHJ3EEFiQAAAAkAAAADwAAAAAAAAABAAAApIHPAQAAbW96aWxsYS1hcGkua2V5UEsBAh4DCgAAAAAAZKE3RzIvx9skAAAAJAAAAB4AAAAAAAAAAQAAAKSBIAIAAG1vemlsbGEtZGVza3RvcC1nZW9sb2MtYXBpLmtleVBLAQIeAwoAAAAAAHGhN0cncQQWJAAAACQAAAAdAAAAAAAAAAEAAACkgYACAABtb3ppbGxhLWZlbm5lYy1nZW9sb2MtYXBpLmtleVBLAQIeAxQAAAAIAHyhN0fa715hagAAAHMAAAANAAAAAAAAAAEAAACkgd8CAAByZWxlbmdhcGkudG9rUEsFBgAAAAAIAAgAEQIAAHQDAAAAAA==")
md "C:\builds"
Set-Content -Path "C:\builds\config.zip" -Value $config -Encoding Byte
$zip = $shell.NameSpace("C:\builds\config.zip")
foreach($item in $zip.items())
{
    $shell.Namespace("C:\builds").copyhere($item)
}
rm "C:\builds\config.zip"

# initial clone of mozilla-central
$p = Start-Process "C:\mozilla-build\python\python.exe" -ArgumentList "C:\mozilla-build\python\Scripts\hg clone -u null https://hg.mozilla.org/mozilla-central C:\gecko" -wait -NoNewWindow -PassThru -RedirectStandardOutput "C:\hg_initial_clone.log" -RedirectStandardError "C:\hg_initial_clone.err"

</powershell>

Hopefully this Powershell script is quite self-explanatory. It installs the required build tool chains for building Firefox Desktop, and then installs the parts it needs for running the Generic Worker on this instance. It sets up some additional config that is needed by the build process, and then takes an initial clone of mozilla-central, as an optimisation, so that future jobs only need to pull changes since the image was created.

The caching strategy is to have a clone of mozilla-central live under C:\gecko, which is updated with an hg pull from mozilla central each time a job runs. Then when a task needs to pull from try, it is only ever a few commits behind, and should pull updates very quickly.

Defining Tasks

Once we have our AMI created, and we’ve published our Worker Type, we need to submit tasks to get the Provisioner to spawn instances in AWS, and execute our tasks.

The next piece of the puzzle is working out how to get these jobs added to Try. Again, luckily for us, this is just a matter of in-tree config.

For this, most of the magic exists in testing/taskcluster/tasks/builds/firefox_windows_base.yml:

$inherits:
  from: 'tasks/windows_build.yml'
  variables:
    build_product: 'firefox'

task:
  metadata:
    name: "[TC] Firefox {{arch}} ({{build_type}})"
    description: Firefox {{arch}} {{build_type}}

  payload:
    env:
      ExtensionSdkDir: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1\\ExtensionSDKs"
      Framework40Version: "v4.0"
      FrameworkDir: "C:\\Windows\\Microsoft.NET\\Framework64"
      FrameworkDIR64: "C:\\Windows\\Microsoft.NET\\Framework64"
      FrameworkVersion: "v4.0.30319"
      FrameworkVersion64: "v4.0.30319"
      FSHARPINSTALLDIR: "C:\\Program Files (x86)\\Microsoft SDKs\\F#\\3.1\\Framework\\v4.0\\"
      INCLUDE: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\INCLUDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\INCLUDE;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\shared;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\um;C:\\Program Files (x86)\\Windows Kits\\8.1\\include\\winrt;"
      MOZBUILD_STATE_PATH: "C:\\Users\\Administrator\\.mozbuild"
      MOZ_MSVCVERSION: "12"
      MOZ_MSVCYEAR: "2013"
      MOZ_TOOLS: "C:\\mozilla-build\\moztools-x64"
      MSVCKEY: "HKLM\\SOFTWARE\\Wow6432Node\\Microsoft\\VisualStudio\\12.0\\Setup\\VC"
      SDKDIR: "C:\\Program Files (x86)\\Windows Kits\\8.1\\"
      SDKMINORVER: "1"
      SDKPRODUCTKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows Kits\\Installed Products"
      SDKROOTKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows Kits\\Installed Roots"
      SDKVER: "8"
      VCDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\"
      VCINSTALLDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\"
      VisualStudioVersion: "12.0"
      VSINSTALLDIR: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\"
      WIN64: "1"
      WIN81SDKKEY: "{5247E16E-BCF8-95AB-1653-B3F8FBF8B3F1}"
      WINCURVERKEY: "HKLM\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion"
      WindowsSdkDir: "C:\\Program Files (x86)\\Windows Kits\\8.1\\"
      WindowsSDK_ExecutablePath_x64: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\x64\\"
      WindowsSDK_ExecutablePath_x86: "C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\"
      MACHTYPE: "i686-pc-msys"
      MAKE_MODE: "unix"
      MOZBUILDDIR: "C:\\mozilla-build"
      MOZILLABUILD: "C:\\mozilla-build"
      MOZ_AUTOMATION: "1"
      MOZ_BUILD_DATE: "19770819000000"
      MOZ_CRASHREPORTER_NO_REPORT: "1"
      MSYSTEM: "MINGW32"

    command:
      - "time /t && set"
      - "time /t && hg -R C:\\gecko pull"
      - "time /t && hg clone C:\\gecko src"
      - "time /t && mkdir public\\build"
      - "time /t && set UPLOAD_HOST=localhost"
      - "time /t && set UPLOAD_PATH=%CD%\\public\\build"
      - "time /t && cd src"
      - "time /t && hg pull -r %GECKO_HEAD_REV% -u %GECKO_HEAD_REPOSITORY%"
      - "time /t && set MOZCONFIG=%CD%\\{{mozconfig}}"
      - "time /t && set SRCSRV_ROOT=%GECKO_HEAD_REPOSITORY%"
      - "time /t && C:\\mozilla-build\\msys\\bin\\bash --login %CD%\\mach build"

    artifacts:

      # In the next few days I plan to provide support for directory artifacts,
      # so this explicit list will no longer be needed, and you can specify the
      # following:
      # -
      #   type: "directory"
      #   path: "public\\build"
      #   expires: '{{#from_now}}1 year{{/from_now}}'
      #
      #  This will be done in early October 2015. See
      #  https://bugzilla.mozilla.org/show_bug.cgi?id=1209901

      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.checksums"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.common.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.cppunittest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.crashreporter-symbols.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.mochitest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.mozinfo.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.reftest.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.talos.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.txt"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.web-platform.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.xpcshell.tests.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\firefox-43.0a1.en-US.{{arch}}.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\host\\bin\\mar.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\host\\bin\\mbsdiff.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\install\\sea\\firefox-43.0a1.en-US.{{arch}}.installer.exe"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\jsshell-{{arch}}.zip"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\test_packages.json"
        expires: '{{#from_now}}1 year{{/from_now}}'
      -
        type: "file"
        path: "public\\build\\{{arch}}\\xpi\\firefox-43.0a1.en-US.langpack.xpi"
        expires: '{{#from_now}}1 year{{/from_now}}'

  extra:
    treeherderEnv:
      - production
      - staging
    treeherder:
      groupSymbol: "tc"
      groupName: Submitted by taskcluster
      machine:
        # from https://github.com/mozilla/treeherder/blob/9263d8432642c2ca9f68b301250af0ffbec27d83/ui/js/values.js#L3
        platform: {{platform}}

    # Rather then enforcing particular conventions we require that all build
    # tasks provide the "build" extra field to specify where the build and tests
    # files are located.
    locations:
      build: "src/{{object_dir}}/dist/bin/firefox.exe"
      tests: "src/{{object_dir}}/all-tests.json"

Reading through this, you see that with the exception of knowing the value of a few parameters ({{object_dir}}, {{platform}}, {{arch}}, {{build_type}}, {{mozconfig}}), the full set of steps that a Windows build of Firefox Desktop requires on the Worker Type we created above. In other words, you see the full system setup in the Worker Type definition, and the full set of task steps in this Task Definition - so now you know as much as I do about how to build Firefox Desktop on Windows. It all exists in-tree, and is transparent to developers.

So where do these parameters come from? Well, this is just the base config - we define opt and debug builds for win32 and win64 architectures. These live [here]:

Here I will illustrate just one of them, the win32 debug build config:

$inherits:
  from: 'tasks/builds/firefox_windows_base.yml'
  variables:
    build_type: 'debug'
    arch: 'win32'
    platform: 'windowsxp'
    object_dir: 'obj-i686-pc-mingw32'
    mozconfig: 'browser\\config\\mozconfigs\\win32\\debug'
task:
  extra:
    treeherder:
      collection:
        debug: true
  payload:
    env:
      CommandPromptType: "Cross"
      LIB: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Windows Kits\\8.1\\lib\\winv6.3\\um\\x86;"
      LIBPATH: "C:\\Windows\\Microsoft.NET\\Framework64\\v4.0.30319;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Windows Kits\\8.1\\References\\CommonConfiguration\\Neutral;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1\\ExtensionSDKs\\Microsoft.VCLibs\\12.0\\References\\CommonConfiguration\\neutral;"
      MOZ_MSVCBITS: "32"
      Path: "C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE\\CommonExtensions\\Microsoft\\TestWindow;C:\\Program Files (x86)\\MSBuild\\12.0\\bin\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\BIN\\amd64_x86;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\BIN\\amd64;C:\\Windows\\Microsoft.NET\\Framework64\\v4.0.30319;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\VCPackages;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\Tools;C:\\Program Files (x86)\\HTML Help Workshop;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Team Tools\\Performance Tools\\x64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Team Tools\\Performance Tools;C:\\Program Files (x86)\\Windows Kits\\8.1\\bin\\x64;C:\\Program Files (x86)\\Windows Kits\\8.1\\bin\\x86;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v8.1A\\bin\\NETFX 4.5.1 Tools\\x64\\;C:\\Windows\\System32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\mozilla-build\\moztools-x64\\bin;C:\\mozilla-build\\7zip;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\kdiff3;C:\\mozilla-build\\mozmake;C:\\mozilla-build\\nsis-3.0b1;C:\\mozilla-build\\nsis-2.46u;C:\\mozilla-build\\python;C:\\mozilla-build\\python\\Scripts;C:\\mozilla-build\\upx391w;C:\\mozilla-build\\wget;C:\\mozilla-build\\yasm"
      Platform: "X86"
      PreferredToolArchitecture: "x64"
      TOOLCHAIN: "64-bit cross-compile"

This file above has defined those parameters, and provided some more task specific config too, which overlays the base config we saw before.

But wait a minute… how do these tasks know to use the win2012r2 worker type we created? The answer to that is that testing/taskcluster/tasks/builds/firefox_windows_base.yml inherits from testing/taskcluster/tasks/windows_build.yml:

# This is the base windows task which contains the common values all windows builds must
# provide.
---
$inherits:
  from: 'tasks/build.yml'

task:
  workerType: win2012r2

Incidentally, this then inherits in turn from the root yaml file for all gecko builds (across all gecko platforms):

# This is the "base" task which contains the common values all builds must
# provide.
---
taskId: {{build_slugid}}

task:
  created: '{{now}}'
  deadline: '{{#from_now}}24 hours{{/from_now}}'
  metadata:
    source: http://todo.com/soon
    owner: mozilla-taskcluster-maintenance@mozilla.com

  tags:
    createdForUser: {{owner}}

  provisionerId: aws-provisioner-v1
  schedulerId: task-graph-scheduler

  routes:
    - 'index.gecko.v1.{{project}}.revision.linux.{{head_rev}}.{{build_name}}.{{build_type}}'
    - 'index.gecko.v1.{{project}}.latest.linux.{{build_name}}.{{build_type}}'
  scopes:
    - 'queue:define-task:aws-provisioner-v1/build-c4-2xlarge'
    - 'queue:create-task:aws-provisioner-v1/build-c4-2xlarge'


  payload:

    # Two hours is long but covers edge cases (and matches bb based infra)
    maxRunTime: 7200

    env:
      # Common environment variables for checking out gecko
      GECKO_BASE_REPOSITORY: '{{base_repository}}'
      GECKO_HEAD_REPOSITORY: '{{head_repository}}'
      GECKO_HEAD_REV: '{{head_rev}}'
      GECKO_HEAD_REF: '{{head_ref}}'
      TOOLTOOL_REPO: 'https://git.mozilla.org/build/tooltool.git'
      TOOLTOOL_REV: 'master'

  extra:
    build_product: '{{build_product}}'
    index:
      rank: {{pushlog_id}}
    treeherder:
      groupSymbol: tc
      groupName: Submitted by taskcluster
      symbol: B

So the complete inheritence chain looks like this:

tasks/build.yml
  tasks/windows_build.yml
    tasks/builds/firefox_windows_base.yml
      tasks/builds/firefox_win32_opt.yml
      tasks/builds/firefox_win64_debug.yml
      tasks/builds/firefox_win32_opt.yml
      tasks/builds/firefox_win64_debug.yml

Getting the new tasks added to Try pushes

This involved adding win32 and win64 as build platforms in testing/taskcluster/tasks/branches/base_job_flags.yml (previsouly taskcluster was not running any tasks for these platforms):

---
# List of all possible flags for each category of tests used in the case where
# "all" is specified.
flags:
  aliases:
    mochitests: mochitest

  builds:
    - emulator
    - emulator-jb
    - emulator-kk
    - emulator-x86-kk
    .....
    .....
    .....
    - android-api-11
    - linux64
    - macosx64
    - win32   ########## <---- added here
    - win64   ########## <---- added here

  tests:
    - cppunit
    - crashtest
    - crashtest-ipc
    - gaia-build
    .....
    .....
    .....

And then associating these new task definitions we just created, to these new build platforms. This is done in testing/taskcluster/tasks/branches/try/job_flags.yml:

---
# For complete sample of all build and test jobs,
# see <gecko>/testing/taskcluster/tasks/job_flags.yml

$inherits:
  from: tasks/branches/base_job_flags.yml

# Flags specific to this branch
flags:
  post-build:
    - upload-symbols

builds:
  win32:
    platforms:
      - win32
    types:
      opt:
        task: tasks/builds/firefox_win32_opt.yml
      debug:
        task: tasks/builds/firefox_win32_debug.yml
  win64:
    platforms:
      - win64
    types:
      opt:
        task: tasks/builds/firefox_win64_opt.yml
      debug:
        task: tasks/builds/firefox_win64_debug.yml
  linux64_gecko:
    platforms:
      - b2g
    types:
      opt:
    .....
    .....
    .....

Summary

The above hopefully has given you a taste for what you can do yourself in TaskCluster, and specifically in Gecko, regarding setting up new jobs. By following this guide, you too should be able to schedule Windows jobs in Taskcluster, including try jobs for Gecko projects.

For more information about TaskCluster, see docs.taskcluster.net.

September 30, 2015 02:08 PM

John Ford

Taskcluster Component Loader

Taskcluster is the new platform for building Automation at Mozilla.  One of the coolest design decisions is that it's composed of a bunch of limited scope, interchangeable services that have well defined and enforced apis.  Examples of services are the Queue, Scheduler, Provisioner and Index.  In practice, the server-side components roughly map to a Heroku app.  Each app can have one or more web worker processes and zero or more background workers.

Since we're building our services with the same base libraries we end up having a lot of duplicated glue code.  During a set of meetings in Berlin, Jonas and I were lamenting about how much copied, pasted and modified boilerplate was in our projects.

Between the API definition file and the command line to launch a program invariably sits a bin/server.js file for each service.  This script basically loads up our config system, loads our Azure Entity library, loads a Pulse publisher, a JSON Schema validator and a Taskcluster-base App.  Each background worker has its own bin/something.js which basically has a very similar loop.  Services with unit tests have a test/helper.js file which initializes the various components for testing.  Furthermore, we might have things initialize inside of a given before() or beforeEach().

The problem with having so much boiler plate is twofold.  First, each time we modify one services's boilerplate, we are now adding maintenance complexity and risk because of that subtle difference to the other services.  We'd eventually end up with hundreds of glue files which do roughly the same thing, but accomplish it complete differently depending on which services it's in.  The second problem is that within a single project, we might load the same component ten ways in ten places, including in tests.  Having a single codepath that we can test ensures that we're always initializing the components properly.

During a little downtime between sessions, Jonas and I came up with the idea to have a standard component loading system for taskcluster services.  Being able to rapidly iterate and discuss in person made the design go very smoothly and in the end, we were able to design something we were both happy with in about an hour or so.

The design we took is to have two 'directories' of components.  One is the project wide set of components which has all the logic about how to build the complex things like validators and entities.  These components can optionally have dependencies.  In order to support different values for different environments, we force the main directory to declare which 'virtual dependencies' it requires.  They are declared as a list of strings.  The second level of component directory is where these 'virtual dependencies' have their value.

Both Virtual and Concrete dependencies can either be 'flat' values or objects.  If a dependency is a string, number, function, Promise or an object without a create property, we just give that exact value back as a resolved Promise.  If the component is an object with a create property, we initialize the dependencies specified by the 'requires' list property, pass those values as properties on an object to the function at the 'create' property.  The value of that function's return is stored as a resolved promise.  Components can only depend on other components non-flat dependencies.

Using code is a good way to show how this loader works:

// lib/components.js

let loader = require('taskcluster-base').loader;
let fakeEntityLibrary = require('fake');

module.exports = loader({
fakeEntity: {
requires: ['connectionString'],
setup: async deps => {
let conStr = await deps.connectionString;
return fakeEntityLibrary.create(conStr);
},
},
},
['connectionString'],
);
 
In this file, we're building a really simple component directory which only contains a contrived 'fakeEntity'.  This component depends on having a connection string to fully configure.  Since we want to use this code in production, development and testing, we don't want to bake configuration into this file, so we force the thing using this to itself give us a way to configure what the connection string.

// bin/server.js
let config = require('taskcluster-base').config('development');
let loader = require('../lib/components.js');

let load = loader({
connectionString: config.entity.connectionString,
});

let configuredFakeEntity = await load('fakeEntity')
 
In this file, we're providing a simple directory that satisifies the 'virtual' dependencies we know that need to be fulfilled before initializing can happen.

Since we're creating a dependency tree, we want to avoid having cyclic dependencies.  I've implemented a cycle checker which ensures that you cannot configure a cyclical dependency.  It doesn't rely on the call stack being exceeded from infinite recursion either!

This is far from being the only thing that we figured out improvements for during this chat.  Two other problems that we were able to talk through were splitting out taskcluster-base and having a background worker framework.

Currently, taskcluster-base is a monolithic library.  If you want our Entities at version 0.8.4, you must take our config at 0.8.4 and our rest system at 0.8.4.  This is great because it forces services to move all together.  This is also awful because sometimes we might need a new stats library but can't afford the time to upgrade a bunch of Entities.  It also means that if someone wants to hack on our stats module that they'll need to learn how to get our Entities unit tests to work to get a passing test run on their stats change.

Our plan here is to make taskcluster-base a 'meta-package' which depends on a set of taskcluster components that we support working together.  Each of the libraries (entities, stats, config, api) will be split out into their own packages using git filter-branch to maintain history.  This is just a bit of simple leg work of ensuring that the splitting out goes smooth.

The other thing we decided on was a standardized background looping framework.  A lot of background workers follow the pattern "do this thing, wait one minte, do this thing again".  Instead of each service implementing this its own special way for each background worker, what we'd really like is to have a library which does all the looping magic itself.  We can even have nice things like a watch dog timer to ensure that the loop doesn't stick.

Once the PR has landed for the loader, I'm going to be converting the provisioner to use this new loader.  This is a part of a new effort to make Taskcluster components easy to implement.  Once a bunch of these improvements have landed, I intend to write up a couple blog posts on how you can write your own Taskcluster service.

September 30, 2015 12:56 PM

September 29, 2015

Ehsan Akhgari

My experience adding a new build type using TaskCluster

TaskCluster is Mozilla’s task queuing, scheduling and execution service.  It allows the user to schedule a DAG representing a task graph that describes a some tasks and their dependencies, and how to execute them, and it schedules them to run in the needed order on a number of slave machines.

As of a while ago, some of the continuous integration tasks have been runing on TaskCluster, and I recently set out to enable static analysis optimized builds on Linux64 on top of TaskCluster.  I had previously added a similar job for debug builds on OS X in buildbot, and I am amazed at how much the experience has improved!  It is truly easy to add a new type of job now as a developer without being familiar with buildbot or anything like that.  I’m writing this post to share my experience on how I did this.

The process of scheduling jobs in TaskCluster starts by a slave downloading a specific revision of a tree, and running the ./mach taskcluster-graph command to generate a task graph definition.  This is what happens in a “gecko-decision” jobs that you can see on TreeHerder.  The mentioned task graph is computed using the task definition information in testing/taskcluster.  All of the definitions are in YAML, and I found the naming of variables relatively easy to understand.  The build definitions are located in testing/taskcluster/tasks/builds and after some poking around, I found linux64_clobber.yml.

If you look closely at that file, a lot of things are clear from the names.  Here are important things that this file defines:

  • $inherits: These files have an single inheritance structure that allows you to refactor the common functionality into “base” definitions.
  • A lot of things have “linux64” in their name.  This gave me a good starting point when I was trying to add a “linux64-st-an” (a made-up name) build by copying the existing definiton.
  • payload.image contains the name of the docker image that this build runs.  This is handy to know if you want to run the build locally (yes, you can do that!).
  • It points to builds/releng_base_linux_64_builds.py which contains the actual build definition.

Looking at the build definition file, you will find the steps run in the build, whether the build should trigger unit tests or Talos jobs, the environment variables used during the build, and most importantly the mozconfig and tooltool manifest paths.  (In case you’re not familiar with Tooltool, it lets you upload your own tools to be used during the build time.  This can be new experimental toolchains, custom programs your build needs to run, which is useful for things such as performing actions on the build outputs, etc.)

This basically gave me everything I needed to define my new build type, and I did that in bug 1203390, and these builds are now visible on TreeHerder as “[Tier-2](S)” on Linux64.  This is the gist of what I came up with.

[Tier-2](S)

I think this is really powerful since it finally allows you to fully control what happens in a job.  For example, you can use this to create new build/test types on TreeHerder, do try pushes that test changes to the environment a job runs in, do highly custom tasks such as creating code coverage results, which requires a custom build step and custom test steps and uploading of custom artifacts!  Doing this under the old BuildBot system is unheard of.   Even if you went out of your way to learn how to do that, as I understand it, there was a maximum number of build types that we were getting close to which prevented us from adding new job types as needed!  And it was much much harder to iterate on (as I did when I was working on this on the try server bootstrapping a whole new build type!) as your changes to BuildBot configs needed to be manually deployed.

Another thing to note is that I found out all of the above pretty much by myself, and didn’t even have to learn every bit of what I encountered in the files that I copied and repurposed!  This was extremely straightforward.  I’m already on my way to add another build type (using Ted’s bleeding edge Linux to OS X cross compiling support)!  I did hit hurdles along the way but almost none of them were related to TaskCluster, and with the few ones that were, I was shooting myself in the foot and Dustin quickly helped me out.  (Thanks, Dustin!)

Another near feature of TaskCluster is the inspector tool.  In TreeHerder, you can click on a TaskCluster job, go to Job Details, and click on “Inspect Task”.  You’ll see a page like this.  In that tool you can do a number of neat things.  One is that it shows you a “live.log” file which is the live log of what the slave is doing.  This means that you can see what’s happening in close to real time, without having to wait for the whole job to finish before you can inspect the log.  Another neat feature is the “Run locally” commands that show you how to run the job in a local docker container.  That will allow you to reproduce the exact same environment as the ones we use on the infrastructure.

I highly encourage people to start thinking about the ways they can harness this power.   I look forward to see what we’ll come up with!

September 29, 2015 03:05 PM

August 13, 2015

Jonas Finnemann Jensen

Getting Started with TaskCluster APIs (Interactive Tutorials)

When we started building TaskCluster about a year and a half ago one of the primary goals was to provide a self-serve experience, so people could experiment and automate things without waiting for someone else to deploy new configuration. Greg Arndt (:garndt) recently wrote a blog post demystifying in-tree TaskCluster scheduling. The in-tree configuration allows developers to write new CI tasks to run on TaskCluster, and test these new tasks on try before landing them like any other patch.

This way of developing test and build tasks by adding in-tree configuration in a patch is very powerful, and it allows anyone with try access to experiment with configuration for much of our CI pipeline in a self-serve manner. However, not all tools are best triggered from a post-commit-hook, instead it might be preferable to have direct API access when:

  • Locating existing builds in our task index,
  • Debugging for intermittent issues by running a specific task repeatedly, and
  • Running tools for bisecting commits.

To facilitate tools like this TaskCluster offers a series of well-documented REST APIs that can be access with either permanent or temporary TaskCluster credentials. We also provide client libraries for Javascript (node/browser), Python, Go and Java. However, being that TaskCluster is a loosely coupled set of distributed components it is not always trivial to figure out how to piece together the different APIs and features. To make these things more approachable I’ve started a series of interactive tutorials:

All these tutorials are interactive, featuring a runtime that will transpile your code with babel.js before running it in the browser. The runtime environment also exposes the require function from a browserify bundle containing some of my favorite npm modules, making the example editors a great place to test code snippets using taskcluster or related services.

Happy hacking, and feel free submit PRs for all my spelling errors at github.com/taskcluster/taskcluster-docs.

August 13, 2015 10:25 PM

August 05, 2015

Greg Arndt

Demystifying in-tree TaskCluster scheduling

Since earlier this year Firefox OS tasks have been defined in-tree and scheduled within TaskCluster. Things are in progress for porting Android and Firefox Desktop builds as well.

There are a few interactions that need to take place when scheduling tasks and reporting them to treeherder. These interactions typically are handled by an integration component named mozilla-taskcluster.

mozilla-taskcluster

tl;dr mozilla-taskcluster makes sure those nicely colored letters appear for each taskcluster task scheduled for each push on treeherder.

mozilla-taskcluster monitors the push log every few seconds for changes for a given set of gecko repositories and will create a task graph when new pushes are detected. The initial task within this graph is typically referred to as the decision task. Its responsibility is to decide what tasks should be added to the task graph for a given branch/project/repository (the names are used interchangeably in many places) using some in-tree logic.

mozilla-taskcluster is responsible for creating the resultset within Treeherder, creating the task graph with decision task, and also responsible for posting job collections to Treeherder when tasks complete.

mach taskcluster-graph

The heart of deciding what tasks will be included in the graph for a push is the ‘mach taskcluster-graph’ target. This target when called will read in-tree branch specific configurations and determine what task definition files to parse and compose into a json blob that will be used to extend the taskcluster graph.

The decision for what jobs to include is based on if it was a Try push, or a push to any other branch. For Try pushes, the commit message will be parsed and used for determining which tasks to run.

It’s worth noting that this target only prints out json. It’s the responsibility of the consumer of this to extend the task graph or use it to create an entirely new graph.

In TaskCluster, once the json is created, the worker used to complete this task has features in place to automatically extend the original task graph with the contents of this json blob as long as the original task graph has the scopes encompassing all scopes used within those additional tasks.

In-tree branch configurations (job_flags.yml)

The in-tree scheduling for a given branch is specified in a job_flags.yml located at <gecko>/testing/taskcluster/tasks/branches/<branch>/job_flags.yml. This is what the mach target will use for determining what should be scheduled (along with some logic within the mach target itself).

These configurations are composed of keys that define the build/tests that are enabled for that given branch as well as their relationships.

Taking a look at a snippet of sample branch config, you can see that there are some familiar keys under builds and tests. These might remind you of try flags…and that’s because they are! But you might ask yourself why we are using try flags for a branch that is not Try. Simple, it’s a (kind of) well understood syntax for specifying builds and tests that should be run, so we treat every branch configuration the same and reuse Try flags within the configurations. Commit messages for Try pushes are parsed by mach taskcluster-graph, and all other branches are defaulted to using the try message try: -b do -p all -u all.

After parsing either the try commit message, or the default ‘all’ message, all other logic is the same for composing the task graph json.

Example configuration:

---
builds:
  linux64_gecko:
    platforms:
      - b2g
    types:
      opt:
        task: tasks/builds/b2g_desktop_opt.yml
      debug:
        task: tasks/builds/b2g_desktop_debug.yml
  linux64-mulet:
    platforms:
      - Mulet Linux
    types:
      opt:
        task: tasks/builds/mulet_linux.yml
tests:
  gaia-build:
    allowed_build_tasks:
      tasks/builds/b2g_desktop_opt.yml:
        task: tasks/tests/b2g_build_test.yml
      tasks/builds/mulet_linux.yml:
        task: tasks/tests/mulet_build_test.yml

The flags under builds and tests can be specified individually in a try commit message, such as:

try -b o -p linux64_gecko -u gaia-build

or included if ‘all’ is used. Tasks that are included in ‘all’ are specified in the base_job_flags.yml file.

####Builds

######builds.platforms

This value is for when restricting test suites to a given platform. For example, this will cause the gaia-build tests to only run for the Mulet Linux build, and not b2g desktop:

try: -b do -p all -u gaia-build[Mulet Linux]

######builds.<build flag>.types

All builds have at least an ‘opt’ build as that is what will be used by default. opt.task defines where to find the task definition for that particular build. For ‘try’ this will used for try: -b [d|o|do]. All other branches will use ‘-b do’.

####Tests

Tests are broken up into their ‘try’ flags and define not only the task definition to use (tests.<flag>.allowed_build_tasks.<build task file>.task) but also what builds that test flag applies to (tests.<flag>.allowed_build_tasks.<build task file>)

August 05, 2015 12:00 AM

July 02, 2015

Dustin Mitchell

Ad-hoc Tasks in TaskCluster

You may have heard of TaskCluster, and you may have heard that it’s the bees’ knees. All of that is true, but I won’t reiterate here.

Ad-Hoc Tasks

We have a number of things we need to build from time to time that aren’t part of the normal CI process. Typically, these have been built on someone’s laptop and uploaded as-is, perhaps with some copy-paste into a wiki page or the like. This can lead to some unhappy surprises when the dependencies change due to differences in the build environment, or when the re-creation instructions aren’t quite accurate.

Well, TaskCluster can help!

First, you can run arbitrary things in TaskCluster very easily. Just head over to the Task Creator and click “Create Task”. The default echoes “Hello World”, but it’s easy to see how to proceed from there. So if a task is simple enough to be embedded in a shell one-liner, you’re already done. Just paste the task description into an in-tree comment or the relevant bug, and the next person to replicate your work can just re-run that task.

But most tasks are a little more complicated. Consider:

None of these are especially difficult, but none are as simple as a one-liner.

Example

For these cases, we have a means to run arbitrary in-tree scripts. It starts by adding a script under testing/taskcluster/scripts/misc to do what you need done. For example, I’ve written this script to repackage the Ubuntu build of OpenJDK for use in ToolTool. Note that this script drops its results in ~/artifacts.

Then, push the commit containing that script somewhere public, like your user repo, and submit a docker-worker task. In this case, the payload looks like this:

{
  "image": "quay.io/djmitche/desktop-build:0.0.19",
  "command": [
    "/bin/bash",
    "-c",
    "cd /home/worker/ && ./bin/checkout-sources.sh && ./workspace/build/src/testing/taskcluster/scripts/misc/repackage-jdk.sh"
  ],
  "env": {
    "GECKO_HEAD_REPOSITORY": "https://bitbucket.org/djmitche/mozilla-central",
    "GECKO_HEAD_REV": "be2867e357f7",
    "VERSION": "7u79-2.5.5-0ubuntu0.14.04.2"
  },
  "artifacts": {
    "public": {
      "type": "directory",
      "path": "/home/worker/artifacts",
      "expires": "2015-07-02T14:58:41.058Z"
    }
  },
  "maxRunTime": 600
}

Running this is as simple as pasting it into the Task Creator.

The image given here is the current docker image used for desktop builds. The command is also similar to what’s used for desktop builds – it checks out the tree, then runs the script. I provide arguments as environment variables – the gecko repository and version (pointing to the user repo) and the OpenJDK version to package.

The “artifacts” portion is how we get the files out of the task. It specifies the in-container directory containing the files we want to make available. Anything in that directory on completion of the task will be available for download in the task inspector (or via automated means, but for ad-hoc tasks like this the UI is easiest).

The task description is fairly generic, but it’s still useful to include the payload in the bug where you run the script for future archaeologists to find.

Summary

So there you have it. TaskCluster is useful not only for performing massive numbers of continuous-integration tasks, but for running one-off tasks in a reproducible, inspectible, secure fashion.

More!

Ted has noted that checkout-sources.sh is pretty heavy-weight: it checks out gecko, mozharness, and build/tools! For many scripts, we can probably do much better with a simpler single script bootstrap script.

July 02, 2015 12:00 PM

June 04, 2015

Ben Hearsum

Buildbot <-> Taskcluster Bridge Now in Production

A few weeks ago I gave a brief overview of the Buildbot <->Taskcluster Bridge that we've been developing, and Selena provided some additional details about it yesterday. Today I'm happy to announce that it is ready to take on production work. As more and more jobs from our CI infrastructure move to Taskcluster, the Bridge will coordinate between them and jobs that must remain in Buildbot for the time being.

What's next?

The Bridge itself is feature complete until our requirements change (though there's a couple of minor bugs that would be nice to fix), but most of the Buildbot Schedulers still need to be replaced with Task Graphs. Some of this work will be done at the same time as porting specific build or test jobs to run natively in Taskcluster, but it doesn't have to be. I made a proof of concept on how to integrate selected Buildbot builds into the existing "taskcluster-graph" command and disable the Buildbot schedulers that it replaces. With a bit more work this could be extended to schedule all of the Buildbot builds for a branch, which would make porting specific jobs simpler. If you'd like to help out with this, let me know!

June 04, 2015 03:11 PM

June 03, 2015

Selena Deckelmann

TaskCluster migration: about the Buildbot Bridge

Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.

I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

  • Allows a graceful transition
  • We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
  • It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
  • One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

There are three services:

  • TC Listener: responds to things happening in TC
  • BuildBot Listener: responds to BB events
  • Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.

BBB has a small database that associates build requests with TC taskids and runids.

BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.

The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).

Taskcluster Listener

Reacts to events coming from TC Pulse exchanges.

Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.

Buildbot Listener

Responds to events coming from the BB Pulse exchanges.

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

  • Runs on a timer – every 60 seconds
  • Reclaims tasks: need to do this every 30-60 minutes
  • Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

  • A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

BB creates a Build. BBListener receives buildstarted event, claims the Task.

Reflector reclaims the Task while the Build is running.

Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.

  • Build fails initially, succeeds upon retry

(500 from hg – common reason to retry)

Same through Reflector.

BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.

BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.

Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.

  • Task exceeds deadline before Build starts

Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve

QUESTIONS:

  • TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

  • How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

  • Replacing the scheduler what does that mean exactly?

    • Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
    • Remove all scheduling from BuildBot and Hg polling

Roll-out plan

  • Connected to the Alder branch currently
  • Replacing some of the Alder schedulers with TaskGraphs
  • All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!

Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.

June 03, 2015 04:59 PM

June 02, 2015

Selena Deckelmann

TaskCluster migration: a “hello, world” for worker task creator

On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

  • You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
  • There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.

June 02, 2015 02:57 PM

May 08, 2015

Ben Hearsum

Buildbot <-> Taskcluster Bridge - An Overview

Mozilla has been using Buildbot as its continuous integration system for Firefox and Fennec for many years now. It enabled us to switch from a machine-per-build model to a pool-of-slaves model, and greatly aided us in getting to our current scale. But it's not perfect - and we've known for a few years that we'll need to do an overhaul. Lucky for us, the FirefoxOS Automation team has built up a fantastic piece of infrastructure known as Taskcluster that we're eager to start moving to.

It's not going to be a small task though - it will take a lot more work than taking our existing build scripts and running them in Taskcluster. One reason for this is that many of our jobs trigger other jobs, and Buildbot manages those relationships. This means that if we have a build job that triggers a test job, we can't move one without moving the other. We don't want to be forced into moving entire job chains at once, so we need something to help us transition more slowly. Our solution to this is to make it possible to schedule jobs in Taskcluster while still implementing them in Buildbot. Once the scheduling is in Taskcluster it's possible to move individual jobs to Taskcluster one at a time. The software that makes this possible is the Buildbot Bridge.

The Bridge is responsible for synchronizing job state between Taskcluster and Buildbot. Jobs that are requested through Taskcluster will be created in Buildbot by the Bridge. When those jobs complete, the Bridge will update Taskcluster with their status. Let's look at a simple example to see see how the state changes in both systems over the course of a job being submitted and run:

Event Taskcluster state Buildbot state
Task is created Task is pending --
Bridge receives "task-pending" event, creates BuildRequest Task is pending Build is pending
Build starts in Buildbot Task is pending Build is running
Bridge receives "build started" event, claims the Task Task is running Build is running
Build completes successfully Task is running Build is completed
Bridge receives "build finished" event, reports success to Taskcluster Task is resolved Build is completed

The details of how this work are a bit more complicated - if you'd like to learn more about that I recommend watching the presentation I did about the Bridge architecture, or just have a read through my slides

May 08, 2015 04:37 PM

March 31, 2015

Rail Aliiev

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the docker.com registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.

Conclusion

There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

March 31, 2015 12:47 PM

February 23, 2015

James Lal

Taskcluster Release Part 1 : Gecko

It's been awhile since my last blog post about taskcluster and I wanted to give an update...

Taskcluster + Gecko

Taskcluster is running by default on

In Treeherder you will see jobs run by both buildbot and taskcluster. The "TC" jobs are prefixed accordingly so you can tell the difference.

This is the last big step to enabling TC as the default CI for many mozilla project. Adding new and existing branches is easily achieved with basic config changes.

Why is this a great thing? Just about everything is in the tree.

This means you can easily add new builds/tests and immediately push them to try for testing (see the configs for try

Adding new tests and builds is easier than ever but the improvements don't stop there. Other key benefits on linux include:

We use docker

Docker enables easy cloning of CI environments.

# Pull tester image
docker pull quay.io/mozilla/tester:0.0.14
# Run tester image shell
docker run -it quay.io/mozilla/tester:0.0.14 /bin/bash
# <copy/paste stuff from task defintions into this>
Tests and builds are faster

Through this entire process we have been optimizing away overhead and using faster machines which means both build (and particularly test) times are faster.

(Wins look big but more in future blog post)

What's missing ?
  • Some tests fail due to differences in machines. When we move tests things fail largely due to timing issues (there are a few cases left here).

  • Retrigger/cancel does not work (yet!) as of the time of writing this it has not yet hit production but will be deployed soon.

  • Results currently show up only on staging treeherder. We will incrementally report these to production treeherder.

February 23, 2015 12:00 AM

February 15, 2015

Rail Aliiev

Funsize hacking

Prometheus

The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

Funsize willl solve the problems listed above: to distribute load and to be flexible.

Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

Funsize is split into several pieces:

  • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
  • Celery-based workers to generate partial updates and upload them to S3.
  • SQS or RabbitMQ to coordinate Celery workers.

One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

February 15, 2015 04:32 AM

May 27, 2014

James Lal

Gaia + Taskcluster + Treeherder

What is this stuff?

(originally posted on dev-gaia)

For some time now Gaia developers have wanted the ability to scale their tests infinitely, while reporting to a dashboard that both sheriffs and devs can monitor, and yet still maintain control over the test configurations themselves.

Taskcluster & Treeherder let's us do this: http://treeherder-dev.allizom.org/ui/#/jobs?repo=gaia-master Taskcluster http://docs.taskcluster.net/ drives the tests and with a small github hook allows us to configure the jobs from a json file in the tree (this will likely be a yaml file in the end) https://github.com/mozilla-b2g/gaia/blob/master/taskgraph.json

Treeherder is the next generation "TBPL" which allows us to report results to sheriffs from external resources (meaning we can control the tests) for both a "try" interface (like pull requests) and branch landings.

Crrently, we are very close to having green runs in treeherder, with only one intermittent and the rest green ...

How is this different then gaia-try?

Taskcluster will eventually replace all buildbot run jobs (starting with linux)... we are currently in the process of moving tests over and getting treeherder ready for production.

Gaia-try is run on top of buildbot and hooks into our github pull requests.. Gaia-try gives us a single set of suites that the sheriffs can look at and help keep our tree green. This should be considered "production".

Treeherder/taskcluster are designed to solve the issues with the current buildbot/tbpl implementations:

  • in tree configuration

  • complete control over the test environment with docker (meaning you can have the exact same setup locally as on TBPL!)

  • artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)

    • in tree graph capabilities (for example "smoketests" builds by running smaller test suites or how tests depend on builds).

How is this different from travis-ci?

  • we can scale on demand on any AWS hardware we like (at very low cost thanks to spot)

  • docker is used to provide a consistent test environment that may be run locally

    • artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)
  • logs can be any size (but still mostly "live")

  • reports to TBPL2 (treeherder)

When is this production ready?

taskcluster + treeherder is not ready for production yet... while the tests are running this is not in a state where sheriffs can manage it (yet!). Our plan is to continue to add taskcluster test suites (and builds!) for all trees (yes gecko) and have them run in parallel with the buildbot jobs this month...

I will be posting weekly updates on my blog about taskcluster/treeherder http://lightsofapollo.github.io/ and how it effects gaia (and hopefully your overall happiness)

Where are the docs??

  • http://docs.taskcluster.net/
  • (More coming to gaia-taskcluster and gaia readme as we get closer to production)

WHERE IS THE CODE?

  • https://github.com/taskcluster (overall project)
  • https://github.com/lightsofapollo/gaia-taskcluster (my current gaia intergration)
  • https://github.com/mozilla/treeherder-service (treeherder backend)
  • https://github.com/mozilla/treeherder-ui (treeherder frontend)

May 27, 2014 12:00 AM

March 04, 2014

James Lal

Taskcluster - Mozilla's new test infrastructure project

Taskcluster is not one singular entity that runs a script with output in a pretty interface or a github hook listener, but rather a set of decoupled interfaces that enables us to build various test infrastructures while optimizing for cost, performance and reliability. The focus of this post is Linux. I will have more information how this works for OSX/Window soon.

Some History

Mozilla has quite a few different code bases, most depend on gecko (the heart of Firefox and FirefoxOS). Getting your project hooked up to our current CI infrastructure usually requires a multi-team process that takes days or more. Historically, simply merging projects into gecko was easier than having external repositories that depend on gecko, which our current CI cannot easily support.

It is critical to be able to see in one place (TBPL) that all the projects depend on gecko are working. Today TBPL current this process is tightly coupled to our buildbot infrastructure (which together make up our current CI). If you really care about your project not breaking when a change lands in gecko, you really only have one option: hosting your testing infrastructure under buildbot (which feeds TBPL).

Where Taskcluster comes in

Treeherder resolves the tight coupling problem by separating the reporting from the test running process. This enables us to re-imagine our workflow and how it's optimized. We can run tests anywhere using any kind of utility/library assuming it gives us the proper hooks (really just logs and some revision information) to plug results into our development workflow.

A high level workflow with taskcluster looks like this:

You submit some code (this can be patch or a pull request, etc...) to a "scheduler" ( I have started on one for gaia ) which submits a set of tasks. Each task is run inside a docker container the container's image is specified as part of your task. This means anything you can imagine running on linux you can directly specify in your container (no more waiting for vm reimaging, etc...) this also means we directly control the resources that container uses (less variance in test) AND if something goes wrong you can download the entire environment that test ran on locally to debug it.

As tasks are completed the task cluster queue emits events over AMQP (think pulse) so anyone interested in the status of tests, etc.. can hook directly into this... This enables us to post results as they happen directly to treeherder.

The initial taskcluster provisions AWS spot nodes on demand (we have it capped to a fixed number right now) so during peaks we can burst to an almost unlimited number of nodes. During idle times workers shut themselves down to reduce costs. We have additional plans for different clouds (and physical hardware on open stack).

Each component can be easily replaced (and multiple types of workers and provisioners can be added on demand. Jonas Finnemann Jensen has done a awesome job documenting how taskcluster works in the docs at the API level.

What the future looks like

My initial plan is to hook everything up for gaia the FirefoxOS frontend. This will replace our current travis CI setup.

As pull requests come in we will run tests on taskcluster and report status to both treeherder and github (the beloved github status api). The ability to hook up new types of tests from the tree itself (and test new types from the tree itself) will continue on in the form of a task template (another blog post coming). Developers can see the status of their tests from treeherder.

Code landing in master follows the same practice and results will report into a gaia specific treeherder view.

Most importantly immediately after treeherder is launched we can run all gaia testing on the same exact infrastructure for both gaia and gecko commits Jonas Sicking (b2g overload) has some great ideas about locking gecko <-> gaia versions to reduce another kind of failure which occurs when developing against the ever changing landscape of gecko / gaia commits.

When is the future? We have implemented the "core" of taskcluster already and have the ability to run tests. By the end of the month (March) we will have the capability to replace the entire gaia workflow with taskcluster.

Why not X CI solution

Building a brand new CI solution is non-trivial why are we doing this?

  • To leverage LXC containers (docker): One of the big problems we hit when trying to debug test failures is the vairance of testing locally and remotely. With LXC containers you can download the entire container (the entire environment which your test runs in) and run it with the same cpu/memory/swap/filesystem as it would run remotely.

  • On demand scaling. We have (somewhat predictable) bursts throughout the day and the ability to spin up (and down) on demand is required to keep up with our changing needs throughout the day.

  • Make in tree configuration easy. Pull requests + in tree configuration enable developers to quickly iterate on tests and testing infrastructure

  • Modular extensible components with public facing APIs. Want run tasks to do things other then test/build or report to something other then treeherder? We have or will build an api for that.

    Hackability is imporant... The parts you don't want to solve (running aws nodes, keeping them up, pricing them, etc...) are solved for you so you can focus on building the next great mozilla related thing (better bisection tools, etc...).

  • More flexibility to test/deploy optimizations... We have something like a compute year of tests and 10-30+ minute chunks of testing is normal. We need to iterate on our test infrastructure quickly to try to reduce this where possible with CI changes.

Here are a few potential alternatives below... I list out the pros & cons of each from my perspective (and a short description of each).

Travis [hosted]

TravisCI is an awesome [free] open source testing service that we use for many of our smaller projects.

Travis works really well for the 90% webdev usecase. Gaia does not fit well into that use case and gecko does so even less.

Pros:

  • Dead simple setup.
  • Iterate on test frameworks, etc... on every pull request without any issue.
  • Nice simple UI which reports live logging.
  • Adding tests and configuring tests is trivial.

Cons:

  • Difficult to debug failures locally.
  • No public facing API for creating jobs.
  • No build artifacts on pull requests.
  • Cannot store arbitrarily long logs (this is only an issue for open source IIRC).
  • On demand scaling.

Buildbot [build on top of it]

We currently use buildbot at scale thousands~ of machines for all gecko testing on multiple platforms. If you are using firefox it was built by our buildbot setup.

(NOTE: This is a critique of how we currently use buildbot not the entire project). If I am missing something or you think a CI solution could fit the bill contact me!

Pros:

  • We have it working at a large scale already.

Cons:

  • Adding tests and configuring tests is fairly difficult and involves long lead times.
  • Difficult to debug failures locally.
  • Configuration files live outside of the tree.
  • Persistent connection master/slave model.
  • Its one monolithic project which is difficult to replace components of.
  • Slow rollout of new machine requirements & configurations.

Jenkins

We are using Jenkins for our on device testing.

Pros:

  • Easy to configure jobs from the UI (decent ability to do configuration yourself).
  • Configuration (by default) does not live in the tree.
  • Tons of plugins (with varying quality).

Cons:

  • By default difficult to debug failures locally.
  • Persistent connection master/slave model.
  • Configuration files live outside of the tree.

Drone.io [hosted/not hosted]

Drone.io recently open sourced... It's docker based and shows promise. Out of all the options above it looks the closest to the to what we want for linux testing.

I am going to omit the Pros/Cons here the basics look good for drone and it requires some more investigation. Some missing things here are:

  • A long term plan for supporting multiple operating systems.
  • A public api for scheduling tasks/jobs.
  • On demand scaling.

March 04, 2014 12:00 AM

January 31, 2014

James Lal

Using docker volumes for rapid development of containers

Its fairly obvious how to use docker for shipping an immutable image that is great for deployment.. It was less obvious (to me) how to use docker to iterate on the image, run tests in it, etc...

Lets say you have a node project and your writing some web service thing:

// server.js
var http = require('http');
...
// server_test.js
suite('my tests', function() {
});
# Dockerfile
FROM lightsofapollo/node:0.10.24
ADD . /service
WORKDIR /service
CMD node server.js

Before Volumes

Without using volumes your workflow is like this:

docker build -t image_name
docker run image_name ./node_modules/.bin/mocha server_test.js
# .. make some changes and repeat...

While this is certainly not awful its a lot of extra steps you probably don't want to do...

After Volumes

While iterating ideally we could just "shell in" to the container and make changes on the fly then run some tests (like lets say vagrant).

You can do this with volumes:

# Its important that you only use the -v command during development it
# will override the contents of whatever you specify and you should also
# keep in mind you want to run the final tests on the image without this
# volume at the end of your development to make sure you didn't forget to
# build or somthing.

# Mount the current directory in your service folder (override the add
# above) then open an interactive shell
docker run -v $PWD:/service -i -t /bin/bash

From here you can hack like normal making changes and running tests on the fly like you would with vagrant or on your host.

When your done!

I usually have a makefile... I would setup the "make test" target something like this to ensure your tests are running on the contents of your image rather then using the volume

.PHONY: test
test:
  docker build -t my_image
  docker run my_image npm test

.PHONY: push
push: test
  docker push my_image

January 31, 2014 12:00 AM