May 14, 2012
Automation and Testing : Overhaul of Talos Configuration
Last week I pushed a fix to
bug 704654
that fixes a number of issues, conceptual and user-facing, with how
Talos handles configuration. I've had an idea on how I wanted to do
this for a few months now, but it has always been tabled. But with my
(joking, sorry) pledge to Bob Moss to fix all bugs in
Talos
by the end of quarter.
I had a free weekend so instead of killing the prerequisite bugs as I
usually do I decided to tackle the problem in one go.
My goals:
- remove the need to edit several different configuration to change a
configuration basis. Most .config edits needed to happen in
5 places (formerly 6). This is not only prone to human error (which
I and others have been guilty of many times), it is
a discouragement to change default configuration.
- consistent and declarative serialization/deserialization. Serialization in
PerfConfigurator was mostly awful, scanning through line by line and
looking for particular strings in (basically) an if-else tree, often
depending on particular whitespace or other subtle (and
undocumented) formatting issues. While the .config files conform to
YAML, we don't make use of this for de/serialization. In addition,
while in run_tests.py we allow command line overrides for the YAML
items, we do not post-process them as we would in
PerfConfigurator.
- consistent error checking. Currently some of our config-checking is
in PerfConfigurator and some is done in run_tests. This opens the
possibility that either case may miss cases where the other one
would find it. If you call run_tests.py with a .yml file, you will
not get the checking done for the combination of command line items
and the .yml configuration that is done in PerfConfigurator. Since
we process a lot of command line items into resulting configuration,
this can lead to interesting results (e.g. while --activeTests is a
command line item for run_tests.py, it is not used, anywhere).
In general, configuration should be checked in one place before any
program logic takes place. While this patch doesn't completely
address this issue, it a big step forward and should pave the way
for future improvement.
- configuration should be declarative. You should get what you expect
from configuration, not inconsistent results. If you edit a (e.g.)
.yml file with the existing Talos, you have no real way to know if
the keys you add or edit are going to be used by run_tests.py (and
what format they should be in, etc.) Having a basis for
configuration gives a single place to denote what is expected (and
thereby what isn't allowed) and the form that it is supposed to be
in. It is also nice to have all configuration in a single place
instead of having to look at a bunch of config files for the basis
as well as all over the code to see what is expected and how it is
processed.
- allow running directly from run_test.py . For particular
(e.g. production) systems, it may be advisable to use tuned
(.yml) configuration files to have highly customized runs (note that
we don't do this and use (remote)PerfConfigurator in all cases for
reasons that may be infered from the above). However, for a typical
developer, there is little reason to run
PerfConfigurator -e `which firefox` -a ts --develop -o ts.yml && talos -n -d ts.yml
for a particular run. Instead, the entirety of this may be invoked
with this patch as
talos -n -d -e `which firefox` -a ts --develop -o ts.yml
in a one-step process. (Note that we're still dumping to ts.yml
though one wouldn't have to if the result is intended as ephemeral).
I hear people prefer blog posts with pictures, so with no reason here
is a bunch of cute foxes:
I've moved the basis of the Talos configuration to
PerfConfigurator.py
instead of some combination of .config files, PerfConfigurator.py, and
run_tests.py.
This gets rid of the duplication between the various config files as
well as the command line options. In fact, there isn't much left of
the configuration files
I don't like configuration to live in code, and so empathize with
those who look at this cautiously from that point of view. However,
PerfConfigurator following my rework isn't so much configuration, but
a configuration basis. Given the goals above, some piece of code has
to validate a given configuration, has to know what data is in a
configuration, and has to provide whatever command line options are
used to front-end the configuration. The previous incarnation of
Talos and PerfConfigurator had a significant amount of code to this
end, but it was both spread out and incomplete. So I don't think
putting it all in one place is a big conceptual change. Having a
piece of code that knows the allowable form of configuration gives
great power and having the code all in one place just makes it more
human-readable.
The unofficial history of Talos configuration, as I understand it,
goes something like this: Initially, there was one configuration
file. You copied it, edited it by hand, and ran your tests on it. At
some point, this became cumbersome, and PerfConfigurator was created
to automatically fill in values from a set of command-line choices,
and in addition allow the values to be marked up a bit. The road was
already paved for some part of configuration basis living in code
versus in the .config file. Then, as the need to run tests in
different configurations grew, .config files flourished to this
end. I'd like to think the changes for
bug 704654 as
the next logical step in Talos's configuration evolution.
Longer term, we'd like to remove even more of Talos's configuration and
replace .yml files with command line options. The complexity of
configuration will be managed by
mozharness .
May 14, 2012 09:03 AM
May 07, 2012
So yesterday we had a small get-together at my place, which gave me the opportunity to try something I’d been meaning to do for a while: build my own retroscope.
The idea is pretty simple: have a webcam record bits and pieces of a social event, then play them back on-the-spot a few minutes/hours later. I first heard about the concept from reading Nat Friedman’s blog entry from 2005 — if you read that, you see that he just hooked up a video camera to his TiVo. 7 years in the future, laptop webcams are ubiquitous and we have the awesome HTML5 <video> tag, so I figured it would be easy to knock up something interesting in short order with zero custom hardware.
Having only remembered that I wanted to do this about 30 minutes before people were scheduled to start arriving, I didn’t have much time to do anything really perfect. I settled on using this little snippet from stackoverflow to generate short (5 second) movies on my laptop, then used scp to copy them over and display a montage of them in an auto-refreshing webpage on my “television” (which is a Mac-Mini connected to a large computer monitor). Despite being a total hack job, the end result generated much amusement. I think this is a bit different from what Nat originally did (it sounds from his blog like his retroscope played back longer segments), but I think the end result is actually a bit more fun.

Perhaps unfortunately, but probably ultimately for the best, only a few snippets from the actual night got stored away. One example is this gem:
(yes, that handsome fellow with the Pernot is me)
I thought it might be fun to release the slightly-cleaned up results of this experiment as opensource for others to play with, so I created a small project for it on github. Unlike the original version, no complicated scp scheme is required — I just reused Joel Maher’s most excellent mozhttpd library from mozbase to run a web server in the same process as the capture logic. All you need to do is run the server on a Linux machine with a webcam and connect to it with a web browser from any other machine on your local network.
https://github.com/wlach/retroscope
Enjoy!
May 07, 2012 03:08 PM
May 04, 2012
Ok, this is somewhat mundane, but I’ve already had to do it twice (and helped someone do something similar on #mobile), so I figured I might as well blog about it for posterity.
For various automation tasks (notably the Eideticker dashboard and the cross-browser startup tests), we need to be able to launch an Android browser on the command line (via adb shell or our own custom SUTAgent). This is a bit of a black art, but you can find references on how to do this on stackoverflow and other places. The magic incantation is:
am start -a android.intent.action.VIEW -n <application/intent> -d <url>
So, for example, to launch Fennec, you’d run this on the Android command prompt:
am start -a android.intent.action.VIEW -n org.mozilla.fennec/.App -d http://mygreatsite.info
Ok, easy enough, but what if we want to launch a new browser that we just downloaded (e.g. Google Chrome)? Where do we get the application and intent names?
The short answer is that you need to reach into the apk and dig.
There’s probably many ways of doing this, but here’s what I do (which has the distinct advantage of not needing to compile, download or run weird java applications):
1. Copy the apk onto your machine (the apk should be in /data/app: if you have a rooted phone, you should be able to copy that off to your machine).
2. Extract AndroidManifest.xml from the apk (it’s just a .zip) and run axml2xml.pl on it.
3. Examine the resultant xml file and look for the <manifest> tag. It should have a property called <package> which is the package name. For example:
We can see pretty clearly that the application name in this case is com.android.chrome (you can also get this by running ps when using the application)
4. Finally, look for a tag called <intent filter> with an <action> tag with <android.intent.action.VIEW> as the android-name property. Scan up for the overarching activity tag, whose android-name property. This is the activity name. For example:
Likewise here we see that the activity name we want is .Main (which Android explicitly expands out to com.android.chrome.Main)
Armed with this information, you should now have enough information to launch the application. Furthering the example above, here’s how to start Chrome on Android via adb’s shell:
am start -a android.intent.action.VIEW -n com.android.chrome/.Main -d http://mygreatsite.info
Hope this helps someone, somewhere.
May 04, 2012 08:39 PM
At Mozilla, we have many different testing frameworks, each of which fills a different niche (although there is definitely some degree of overlap among them). For testing WebAPIs in B2G, some of these existing frameworks can be utilized, depending on the API. For example, mozSettings and mozContacts can be tested using mochitests, since there isn’t much, if anything, that’s device-specific to them. (We’re not currently running mochitests on B2G devices, but will be soon.)
But there are many other WebAPIs which are not testable using any of our standard frameworks, because tests for them need to interact with hardware in interesting ways, and most of our frameworks are designed to operate entirely within a gecko context, and thus have no ability to directly access hardware.
Malini Das and I have been working on a new framework called Marionette which can help. Marionette is a remote test driver, so it can remotely execute test steps within a gecko process while retaining the ability to interact with the outside world, including devices running B2G. When this is combined with the B2G emulator’s ability to query and set hardware state, we have a solution for testing a number of WebAPIs that would be difficult or impossible to test otherwise.
To illustrate how this works, I’m going to walk through the entire process of writing WebAPI tests for mozBattery and mozTelephony, to be run on B2G emulators. We already have such tests running in continuous integration, reporting to autolog. If developers add new Marionette WebAPI tests, they will be run and reported here as well. Eventually, they will likely be migrated over to TBPL.
Building the emulator
These tests will be run on the emulator, so you’ll have to build the B2G Ice Cream Sandwich emulator first, if you don’t have one already. You’ll need to do this on linux, preferably Ubuntu. Make sure to install the build prerequisites before you begin, if you haven’t built B2G before.
git clone https://github.com/andreasgal/B2G
cd B2G
make sync (get a cup of coffee, this takes quite a while)
make config-qemu-ics (get another cup of coffee)
make gonk (get another drink, but I think you've had enough coffee by now)
make
You should now have an emulator, which can you launch using:
./emu-ics.sh
After you’ve verified the emulator is working, close it again.
Running a Marionette sanity test
Now we’ll run a single Marionette test to verify that everything is working as expected. First, ensure that you have Python 2.7 on your system. Then, install some prerequisites:
pip install (or easy_install) manifestdestiny
pip install (or easy_install) mozhttpd
pip install (or easy_install) mozprocess
Now, from the directory where you cloned the B2G repo:
cd gecko/testing/marionette/client/marionette
python runtests.py --emulator --homedir /path/to/B2G/repo \
tests/unit/test_simpletest_sanity.py
If everything has gone well, you should see something like the following:
TEST-START test_simpletest_sanity.py
test_is (test_simpletest_sanity.SimpletestSanityTest) ... ok
test_isnot (test_simpletest_sanity.SimpletestSanityTest) ... ok
test_ok (test_simpletest_sanity.SimpletestSanityTest) ... ok
----------------------------------------------------------------------
Ran 3 tests in 2.952s
OK
SUMMARY
-------
passed: 3
failed: 0
todo: 0
Writing a battery test
The B2G emulator allows you to arbitrarily set the battery level and charging state, by telnetting into the emulator’s console port and issuing certain commands. Marionette has an EmulatorBattery class which abstracts these operations, and allows you to interact with the emulator’s battery using a very simple API.
A simple example is given in the EmulatorBattery documentation on MDN. Save this example to a file named test_battery_example.py, and run this command:
python runtests.py --emulator --homedir /path/to/B2G/repo /path/to/test_battery_example.py
Marionette should launch an emulator and run the test; when it’s done you should see:
TEST-START test_battery_example.py
test_level (test_battery_example.TestBatteryLevel) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.391s
OK
SUMMARY
-------
passed: 1
failed: 0
todo: 0
How it works
This test, like all Marionette Python tests, is written using Python’s unittest framework, which provides the assert methods used in the test. Other methods used by the test are provided by the Marionette and EmulatorBattery classes.
When the test executes this line:
self.marionette.emulator.battery.level = 0.25
the EmulatorBattery class telnets into the emulator and sets the battery’s level. We then read the level back (which invokes another telnet command) to verify that the emulator’s battery state was updated as expected. And finally, we execute a snippet of JavaScript inside gecko:
moz_level = self.marionette.execute_script("return navigator.mozBattery.level;")
and verify that it returns the same battery level as the emulator is reporting directly.
More tests with hardware interaction
In addition to battery interaction, the B2G emulator allows you to query and set the state of other properties normally set by hardware, like GPS location, network status, and various sensors. Tests for all these could be written in a similar way. It probably makes sense to make classes for these similar to EmulatorBattery which abstract the details of getting and setting the state of the underlying hardware. I would encourage WebAPI developers to add as many WebAPI tests as possible; if you would like us to add convenience classes, please ping us on IRC (jgriffin and mdas, on #ateam or #b2g) or file a bug under Testing:Marionette.
Multi-emulator tests
There are some WebAPIs which cannot be completely tested using a single device or emulator, like telephony and SMS. Marionette can help with these too, as Marionette can be used to manipulate two emulator instances which are capable of communicating with each other.
In any tests run with the --emulator switch, Marionette launches an emulator before running the tests, and this emulator is associated with an instance of the Marionette class available to the test as self.marionette. Tests can invoke a second emulator instance using self.get_new_emulator(), and these emulator instances can call and text each other using their port numbers as their phone numbers.
To illustrate how this works, Malini has written an example test in which one emulator is used to dial another, and the caller’s number is verified on the receiver. See this example at https://developer.mozilla.org/en/Marionette/Marionette_Python_Tests/Emulator_Integrated_Tests#Manage_Multiple_Emulators.
If you save this example to test_dial_example.py and run the command:
python runtests.py --emulator --homedir /path/to/B2G/repo /path/to/test_dial_example.py
you should see Marionette launch one emulator, and then after it starts execution of the test, you should see a second emulator instance launch. After the test is done, you should see a successful report, similar to the one shown for the battery test.
We currently have a few tests for mozTelephony, but many more could be added, and new tests should be added for SMS/MMS as well.
Adding new tests to the B2G continuous integration
When new test are ready to be added to the CI, they should be checked into gecko under their dom component, e.g., dom/telephony/test/marionette. They should be added to the manifest.ini file in the same directory, and then for new manifest.ini files, the path to the .ini file should be added to the master manifest at http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/tests/unit-tests.ini. After this is done, it should be picked up by the B2G CI, after the gecko fork of B2G is updated, where it will be reported along with the other tests to autolog.
Caveats, provisos, and miscellanea
B2G builds go to sleep after 60 seconds of inactivity. In the emulator, this “sleep” will completely lock up Marionette if it occurs while a test is running. This is very inconvenient while testing. See bug 739476. Until some better mechanism of handling this is available, I usually edit gecko/b2g/apps/b2g.js to increase the value of the power.screen.timeout pref before building, to prevent the emulator from going to sleep.
The current test failures in autolog are being tracked as bug 751403 and bug 751406.
Network access in the emulator currently doesn’t seem to work (see https://github.com/andreasgal/B2G/issues/287). This prevents some parts of Gaia from working correctly but doesn’t interfere with the above style of WebAPI tests, none of which rely on Gaia or network access.
Building the emulator is very time-consuming, mostly due to the time required to sync all the various repos needed by B2G. We hope to be able to post emulator builds for download soon, after a few details are worked out.
More reading
What is Marionette
Marionette Python tests
Marionette Emulator tests
the Marionette class
the Emulator class
Please contribute tests
There are many WebAPIs which are less tested than they could be. Please help us expand test coverage by contributing tests in areas similar to those described above. If you need help, contact :jgriffin or :mdas on IRC, or file a bug under Testing:Marionette.

May 04, 2012 05:31 PM
April 25, 2012
Automation and Testing : Considering a Page-Centric Talos
Currently, the canonical unit of Talos tests is a page set.
However, a page-centric point of view offers several intrinsic
advantages on top of being, in my opinion, more conceptually coherent.
A page-centric point of view allows easy adding and updating of
pages. Currently, making a new page set is a big deal. Since we
average over all pages in a page set to obtain a quality metric,
adding a new page (or removing a page) will change this number and
the entire baseline for comparison has to be recentered. If we
made the page the canonical unit of testing, then adding or
removing a page doesn't involve a recentering as each page has a
quality metric associated with it.
Taking an average over all pages to get a quality metric, as we do,
gives a higher weight to pages that take (e.g.) longer to load. For
instance, consider the output for tsvg:
|i|pagename|runs|
|0;gearflowers.svg;79;65;68;68;67
|1;composite-scale.svg;46;35;44;41;42
|2;composite-scale-opacity.svg;21;22;24;22;20
|3;composite-scale-rotate.svg;23;21;21;20;19
|4;composite-scale-rotate-opacity.svg;19;24;19;19;23
|5;hixie-001.xml;45643;14976;17807;14971;17235
|6;hixie-002.xml;51257;15193;21693;14969;14974
|7;hixie-003.xml;5016;37375;5021;5024;5008
|8;hixie-004.xml;5052;5053;5054;5054;5053
|9;hixie-005.xml;4618;4533;4611;4532;4554
|10;hixie-006.xml;5059;5107;9741;5107;5089
|11;hixie-007.xml;1629;1651;1648;1652;1649
A performance loss (or gain) in e.g. gearflowers.svg is likely not
to be noticed in this pageset as it is several orders of magnitude
lower than (e.g.) hixie-002.xml, so a small percentage-wise noise
in the latter could easily hide a legitimate regression in the former.
Having this additional data of what changes regress which pages allows
us to explore how these particular page modifications affect
performance. If we can isolate patterns, we can fix them.
One conceptual disadvantage to a page-centric approach is that
deciding whether a changeset is a net regression or not becomes
harder. Ideally a human (or other expert system) would evaluate all
of the data across pages and decide whether a change is a regression
or not. However, we have many pages and not enough people, so this is
harder to do than to craft a formula for a quality metric.
To obtain an overall quality metric for a push, some sort of averaging
over pages must be done. We currently throw away the highest value
and take the mean of remaining page averages. If we continue with
this approach we throw away the ability easily add and remove pages
without futzing with the metric. Instead, a method should be sought
whereby adding a new page does not affect a metric.
April 25, 2012 09:33 AM
[ For more information on the Eideticker software I'm referring to, see this entry ]
tl;dr: You can now run the standard eideticker benchmarks easily on any Android phone without any kind of specialized hardware.
So Eideticker is pretty great at comparing relative performance between different browsers and generally measuring things in an absolutely neutral way. Unfortunately it’s a bit of a pain to use it at the moment to catch regressions: the software still has a few bugs and encoding/decoding/analyzing the capture still takes a great deal of time. Not to mention the fact that it currently requires specialized hardware (though that will soon be less of a concern at least inside MoCo, where we have a bunch of Eideticker boxes on order for the Toronto and Mountain View offices).
A few months ago, Chris Lord wrote up some great code to internally measure the amount of checkerboarding going on in Fennec. I’ve thought for a while that it would be a neat idea to hook this up to the Eideticker harness, and today I finally did so. After installing Eideticker, you can now run the benchmark on any machine against an arbitrary fennec build just by typing this from the eideticker root directory:
adb shell setprop log.tag.GeckoLayerRendererProf DEBUG
./bin/get-metric-for-build.py --no-capture --get-internal-checkerboard-stats --num-runs 3 nightly.apk src/tests/scrolling/taskjs.org/index.html
In return, you’ll get some nice clean results like this:
=== Internal Checkerboard Stats (sum of percents, not percentage) ===
[167.34348, 171.871015, 175.3420296]
Just to be sure that the results were comparable, I did a quick set of runs on the Eideticker machine in Mountain View with both internal checkerboard statistics gathering and HDMI capturing enabled.
| Stats file |
HDMI capturing |
| 167.34348 |
177.022 |
| 171.87 |
184.46 |
| 175.34 |
184.44 |
While the results aren’t identical (we measure number of frames differently inside Fennec than we do with Eideticker, for one thing), they do seem roughly correlated. So go forth, benchmark and tweak!
P.S. If you’ve been following mobile automation, you might be asking why I don’t just suggest running Talos and Robocop on your workstation. Can’t they do the same sorts of things? The short answer is that yes, they can, but unfortunately they’re much more involved to set up and use at the moment. Arguably they shouldn’t be, and this is something we (Mozilla tools & automation) need to work on. We’ll get there eventually (and help would be welcome!). For now, hacks like this should help with getting out the first release of Fennec by providing a fast, easy to use tool for bisection and analysis.
April 25, 2012 01:47 AM
April 19, 2012
Build times for mozilla-central are a major factor in developer productivity. Faster build times mean more people using try (reducing breakage) and more fine-grained regression ranges (reducing the impact of breakages). As a side benefit, it allows us to avoid buying and maintaining more hardware (or put new hardware to better use). About a half-year ago, we set up a project called BuildFaster to try to bring these times down, setting the ambitious goal of getting build times (from checkin to tests done) down to 2 hours. We didn’t quite succeed, though we did make some major strides. As part of this project, we also developed a dashboard to track our progress and narrow down the major bottlenecks which were keeping up our build times.
Unfortunately, this dashboard went down earlier this year with the rest of Brasstacks and we hadn’t had the chance to bring it back up. I’m pleased to announce that thanks to Jonathan Griffin, it’s finally back online.
While no one is actively working on build performance at the moment (at least to my knowledge), it’s still useful to keep track of build times to make sure that we don’t regress. Anecdotally, it has seemed to me that the time needed to get results from try has been pretty stable over the last while, and this is borne out by the results:

As the cliche goes: no news is good news.
April 19, 2012 10:27 PM
April 06, 2012
[ For more information on the Eideticker software I'm referring to, see this entry ]
Participated in an interesting meeting on checkerboarding in Firefox for Android yesterday. As a reminder, checkerboarding refers to the amount of time you spend waiting to see the full page after you do a swipe on your mobile device, and it’s a big issue right now – so much so that it puts our delivery goal for the new native browser at risk.
It seems like we have a number of strategies for improving performance which will likely solve the problem, but we need to be able to measure improvements to make sure that we’re making progress. This is one of the places where Eideticker could be useful (especially with regards to measuring us against the competition), though there are a few things that we need to add before it’s going to be as useful as it could be. The most urgent, as I understand, is to come up with a suite of tests which accurately represent the set of pages that we’re having issues with. The current main measure of checkerboarding that we’re using with eideticker is taskjs.org which, while an interesting test case in some ways, doesn’t accurately represent the sort of site that the user would normally go to in the wild (and thus be annoyed by).
This is going to take a few days (and a lot of review: I’m definitely no expert when it comes to this stuff) to get right, but I just added two tests for the New York Times which I think are a step in the right direction of being more representative of real-world use cases. Have a look here:
http://wrla.ch/eideticker/dashboard/#/nytimes-scrolling
http://wrla.ch/eideticker/dashboard/#/nytimes-zooming
The results here actually aren’t as bad as I would have expected/remembered. There amount of checkerboarding after a zoom out is a bit annoying (I understand this a known issue with font caching, or something) but not too terrible. Still, any improvements that show up here will probably apply across a wide variety of sites, as the design patterns on the New York Times site are very common.
(P.S. yes, I know I promised a comparison with Google Chrome for Android last time… rest assured that’s still coming soon!)
April 06, 2012 01:12 AM
April 03, 2012
Just thought I’d mention this because I found it handy.
A while back AaronMT wrote up some clever instructions on taking Android screenshots by dumping the contents of ‘/dev/fb0′ and running ffmpeg on the results. This is useful, but you need to know the resolution of the device you have connected to pass the right arguments to ffmpeg. Wouldn’t it be better if you had just one script that would work for whatever device you had plugged in?
In fact, there is a way to do this using the monkeyrunner utility. Intended mainly as a tool for synthesizing input on Android (more on that some other time), you can also easily get a capture of the Android screen with its python/jython API (assuming you have the Android SDK installed). Here’s a quick script which does the job:
from com.android.monkeyrunner import MonkeyRunner, MonkeyDevice
import os
import sys
if len(sys.argv) != 2:
print "Usage: %s " % os.path.basename(sys.argv[0])
sys.exit(1)
device = MonkeyRunner.waitForConnection()
result = device.takeSnapshot()
result.writeToFile(sys.argv[1], 'png')
Copy that into a file called capture.py (or whatever), then run it like so:
monkeyrunner capture.py screenshot.png
And you’re off to the races! Nice screenshot, no utilities or non-essential command line arguments required!
(credit to this stackoverflow answer for the idea)
April 03, 2012 09:49 PM
March 22, 2012
[ For more information on the Eideticker software I'm referring to, see this entry ]
Since my first Eideticker dashboard post was so well received, I thought I’d give a quick update on another metric that I just brought online: checkerboarding (a.k.a. the amount of time you spend waiting to see the full page after you do a swipe on your mobile device).

[ link to real thing ]
Unfortunately the news here is not as good as before: as the numbers indicate, the new Native Fennec currently performs substantially worse than the version in Android market. This is a known issue, and is currently being tracked in bug 719447.
Next up: Seeing how we do against Google Chrome for Android.
March 22, 2012 10:07 PM
March 16, 2012
Over the last while, Clint Talbert and I have been working on setting up automatic mobile performance tests using Eideticker (a framework to measure perceived Firefox performance by video capturing automated browser interactions: for more information, see my earlier post).
There’s many reasons why this is interesting, but probably the most important one is that it can measure differences reliably across different types of mobile browsers. Currently I’m testing the old XUL fennec, the Android stock browser, and the latest nightlies.
I’m pleased to announce that the first iteration of the dashboard is available for public consumption, on my site.
http://wrla.ch/eideticker/dashboard/#/canvas

The demo is pretty cheesey (just click on any of the datapoints to see the video capture), but nonetheless does seem to illustrate some interesting differences between the three browsers. The big jump in performance for nightly comes from the landing of the Maple branch, which happened earlier this week. Hopefully this validates some of the work that the mobile/graphics team has been doing over the past while. Exciting times!
March 16, 2012 06:51 PM
March 12, 2012
Over the last year there has been a lot of research into reducing the noise in our talos performance numbers. For example looking at tscroll, we have a fluctuation in the reported numbers of almost 400 (out of 14000). Jan Larres took a look at this problem in his masters thesis, and found a variety of factors that did and didn’t contribute to the noise. We actually have Bug 706912 filed to implement some of his suggestions on how to calculated the posted number. Last fall, Stephen Lewchuk look at the raw data that was collected and found some inconsistencies in the way we were aggregating the data. In short, we have a lot of ground to cover if we want to reduce our numbers.
Over the last couple months, we have been working on a project call Signal From Noise. This is an attempt to fix the way we collect some numbers and redo the way we aggregate numbers for reporting. We have done a lot of experimenting with the primary focus on tp5. The way we run tp5 is to load each of the 100 pages once, then repeat 10 times. For each page, we would drop the highest value and take the median value of the remaining 9 numbers. This results in an array of 100 data points which get reported to the graph server. We take those 100 data points and average them out to generate the single number for tp5. It is easy to imagine that the small samples and median/average combination will produce a lot of noise.
Going forward, we are looking to change from column major to row major and collect 30 samples instead of 10. This means we focus on one page and load it 30 times, then move to the next page and repeat until all 100 pages have been loaded. The downfall is the runtime as we move from an average of 17 minutes to an average of 39 minutes for the entire tp5 run. Collecting 30 samples will give us a much more meaningful number, but we also found that the first 5-10 iterations contain the most noise. So initially we are looking to throw away the first 10 numbers instead of what we originally did by throwing away the highest number. When looking at the raw numbers (not the aggregated number), here are some graphs to highlight the difference:


This is only the first step in many changes needed. After rolling this out, we need to evaluate the other test suites as well and ensure we are running adequate cycles to get a valid sample size. We are also working on allowing the database to accept the raw values instead of the single median value per page. Likewise are are looking to stop doing a average([median(page)]). All of this will allow us to find regressions easier per page instead of having it washed over with the other numbers.

March 12, 2012 02:33 PM
March 06, 2012
So alongside my work with
Marionette, I've been working with the DXR folks to help get their builds tested. Since a lot of DXR status updates occur over at the #static IRC channel on irc.mozilla.org, here's a quick rundown of what's going on: Many of you have been asking if DXR is ready to replace MXR, and what the timeline for that is. Well, now there's some good news! Most of the implementation work to get DXR's search features running, and running quickly, are done, and all that's left to get MXR parity is a few UI features and tweaks.
Taras, the manager of the DXR team, is aiming to have these UI changes worked out next month, so expect big things for DXR in the short future. Right now, the Lanedo team is developing the codebase, and I'm running between them and release engineering to get production ...
March 06, 2012 10:07 PM
March 05, 2012
UPDATE: It's been pointed out that the current metric (sum of squares of unresponsive periods, divided by 1000) is used in Talos and has had a fair bit of thought put into it. I was curious what not squaring the results would do, but I wouldn't go with another metric without more careful thought.
UPDATE 2: It has also been pointed out that peptest tests performance, not correctness, and hence should report its results elsewhere (essentially as I've done with the sampled data) and not be a strict pass/fail test. This approach definitely warrants some consideration.
About a week and a half ago,
peptest was deployed to try. To recap, peptest identifies periods of unresponsiveness, where "unresponsiveness" is currently defined as any time the event loop takes more than 50 ms to complete. We have a very small suite of basic tests at the moment, looking for unresponsiveness while opening a blank tab, opening a new window, opening the bookmarks menu, opening and using context menus, and resizing a window.
The results are currently ignored, since we still don't know how useful they will be, but you can see them by going to
https://tbpl.mozilla.org/?tree=Try&noignore=1. They are marked by a "U" (not sure why exactly, but it will change at some point to something more obvious).
At the moment, every platform fails at least one of these tests, and most of the time there are multiple failed tests. This isn't too surprising, since 50 ms is a pretty bold target. However, going forward, we need some sort of baseline result, so that we can identify real regressions. To accomplish this, peptest tests can be configured with a failure threshold. We calculate a metric for each test (see below), and, if a failure threshold is configured, a metric value below this threshold is considered a pass. Hopefully, we can identify a threshold for each test (or, likely, a threshold for each platform-test combination) such that all the tests pass but significant increases in unresponsiveness will trigger failures. At the same time, we will also file bugs on all the tests so we don't forget about the fact that there are still unresponsive periods during their execution that are being hidden by the thresholds. We can lower or eliminate the thresholds if these bugs are partially or fully fixed.
Things, of course, aren't that simple. I gathered and analyzed the peptest logs from try over a four-day period, and there is quite a lot of variance in the results, even on the same platform. With a sufficiently generous threshold, we could get the tests to pass most of the time, but there are occasionally some crazy outliers that no reasonable threshold could contain. However, it is probably okay to have the tests turn orange once in a while. 0 oranges might be an unreasonable target for this project, and intermittent oranges would be a reminder that, sometimes, there are really unacceptable periods of unresponsiveness.
(Btw one test, test_contextMenu.js, appears to only fail on Linux and Linux64, but this is actually a bug in the test--on all the other platforms, it's erroring out before it hits the end. I've since fixed this but haven't collected new data yet.)
I experimented a bit with the test metric, to see if that improved the situation. Right now, as deployed on try, the metric is calculated as the sum of the squares of the unresponsive periods in a single test (an unresponsive period being, by definition, a value above 50). I tried just summing the periods without squaring them, which seemingly increases the variance in some tests and decreases it in others. I also experimented with raising the minimum unresponsive period from 50 ms to 100 ms, since there are strong arguments that 50 ms is pretty unrealistic, at least at this stage.
I've graphed the failures, along with their mean and standard deviations, at
http://people.mozilla.com/~mcote/peptest/results/. I also plotted passes as 0s (there are certainly lots of unresponsive periods less than 50 ms in those passes, but for all intents and purposes they are 0) in a different colour. There are unique URLs to all combinations of platform, test, and metric. The raw data is also available there (in JSON).
Following is a brief discussion of some of the problems with identifying good failure thresholds.
Some of the simple tests don't have much variance. test_openBlankTab.js, which just measures the responsiveness when opening and closing a blank tab, mostly passes, with just a few outliers. Some slightly more complicated tests, however, have quite a bit of variance. The bookmarks-menu test, test_openBookmarksMenu.js, scrolls through the bookmarks menu and then opens the bookmarks window. The results on snowleopard are particularly egregious:

As you can see, most of the failures are clustered around the mean. The standard deviation encompasses most of them. Changing the metric from the sum of squares of unresponsive periods to just the sum of the periods improves things a little:

There is only one point above a single standard deviation, although two are rather close. Increasing the allowable unresponsive period to 100 ms reduced the standard deviation, but only because a few low points became passes:

So this is one example where we would expect to see at least one orange every few days, even if we set the metric to about 25% higher than the mean.
In other cases we have mostly passes but some really crazy outliers. On snowleopard, test_openWindow.js, which merely opens a new window, has mostly passes, but in this sample there is one run that had unresponsive periods totalling more than 250 ms.

So here, we could leave the failure threshold at 0 ms, although we'd still have oranges every few days. In this case, setting the unresponsive threshold to 100 ms wouldn't make a difference, since the few failures are significantly above 100 ms.
test_openWindow.js on leopard, however, is all over the place when using just the sum of unresponsive periods:

There aren't really any outliers here, just a large spread of values. A reasonable failure threshold here would have to be twice the mean to ensure that oranges only occur occasionally.
In this case, switching to a sum of squares makes the outliers more obvious, although the standard deviation becomes quite large:

And in case it wasn't obvious, the results are completely different on a different OS. Take test_openWindow.js on Windows 7:

Most results are clustered, but there are 5-6 real outliers, depending on how you define an outlier. This test-platform combination looks to be a real potential for regular oranges unless an extremely generous failure threshold is defined.
In conclusion, it's going to be kind of tough to define failure thresholds such that most runs pass and that real regressions are identified. There doesn't seem to be a huge difference between using the sum of unresponsive periods versus the sum of their squares, although in some instances the latter makes the outliers more obvious. Raising the minimum acceptable unresponsive period unsurprisingly causes more passes but doesn't really improve the variance in the failures. Regardless, it looks like I will have to go through the sampled results and, for each test, set a failure threshold that encompasses the majority of the failures, but even still there will be intermittent oranges. Comments and suggestions welcome!
March 05, 2012 03:51 PM
February 26, 2012
I've released version 2.0 of
flot-axislabels, the
flot plug-in for labelling axes. Flot is a great, easy-to-use JavaScript graphing lib, based on canvas; however, many people (myself included) viewed the lack of support for axis labels to be a big fault. With flot-axislabels, you can get said labels by just loading the script after flot and setting one extra option per axis (or a couple more if you have specific needs).
Version 2.0 (which is actually the first "real" release but has a lot of recent changes) now supports any number of X and Y axes. Previously only 2 X and 2 Y axes were supported (top, bottom, left and right).
Having more than 4 axes on a single plot probably sounds a bit weird, but apparently it is useful when plotting weather conditions:

You can see the live
example and view its source to see how it's done. It's really quite simple.
flot-axislabels continues to support CSS translations, canvas, and traditional CSS positioning (plus a special mode for IE 8 combining CSS positioning with IE's special rotation functions). In the first two modes, labels for Y axes are rotated to face the plot. Graceful degradation is attempted based on the browser's detected capabilities.
Internally, it no longer pays attention to the name of the axis (yaxis, y2axis, etc.) but rather looks at the 'position' variable, which flot automatically sets if it is not provided. I believe this means that it will only work with flot 0.7, however.
Read the
README, download the
zip, and follow the
project on github.
February 26, 2012 04:17 AM
February 15, 2012
Talos Signal from Noise: Configurable Talos Data Filters
As part of
Signal from Noise
I introduced a patch that changes the way --ignoreFirst works and
adds configurable data filters to
Talos :
While this is a small change in terms of how the code currently works,
it lays the groundwork for a window of possibilities in terms of Talos
statistics. Currently, pageloader calculates the "median" (ignoring
the high value), the mean, the max, and the min, and outputs these
along with the raw run data. Pageloader is for loading pages and
taking measurements, not really for doing statistics. So it would be
nice to move this upstream: first to Talos, then to graphserver proper.
Being able to specify data filters with --filter from the command
line and filter: in the .yml configuration file allows the
test-runner to change the "interesting number" by which we measure
performance metrics on the fly. While there are currently only a few
filters available, it is easy to add more metrics as we need them.
In a parallel effort, the
JetPerf
software
consumes Talos
filters
. This is a good example of the expansion of the Talos ecosystem: as a
ciritical part of our performance testing infrastructure, building
tests and frameworks on top of Talos. In general, the
A Team is moving towards a
testing ecosystem of reusable parts and sane APIs.
Data filters were added to talos
as an interim measure to make the "interesting number" calculations
more flexible. As we play with different types of statistics, we need
the ability to change configuration without having to jump through too
many hoops and this fulfills this immediate need.
However, in the longer term, Talos and pageloader shouldn't really be
doing statistics at all. They are in the "statistics gathering" camp
where
graphserver is in the
"statistics processing" business. It would also be nice if there was
a piece of software that let you analyze Talos results locally,
ideally using the same statistics processing package that graphserver uses.
This is outlined in
https://bugzilla.mozilla.org/show_bug.cgi?id=721902 .
February 15, 2012 12:44 PM
February 11, 2012
For the last few days I’ve been experimenting with getting a Pandaboard running Android 4.0, continuing the work that Clint Talbert started in the fall to get these boards for use as a replacement for the Tegra in Mozilla’s android automation. The first objective is to get a reproducible build going, after that we’ll try to get some of our custom tools (SUTAgent & friends) installed by default.
So far this has been… interesting. Much as Clint did before, I thought I’d document some of the notes on what I did in the hopes that they’ll be helpful to other people trying to do similar things.
Getting things up and running is a two step process. First, you build the beast. This part is straightforward, just follow the instructions here:
At least the build part is more or less straightforward. Just follow the instructions here:
Note that you almost certainly want to build in the “eng” configuration, which is rooted and (apparently) has some extra tools installed.
Installing it is a little more tricky. The way they want you to do this is put the pandaboard into a special mode and copy the stuff you built onto an sdcard. Seem a little funny to you? Yeah, it does to me too. Why not just build an sdcard image directly?
Nonetheless, this is the officially supported way of imaging a pandaboard, so let’s just follow it until we can think of a better way of doing things.
The instructions for doing this on the pandaboard are located in the source tree here:
device/ti/panda/README
These are mostly correct as far as I can tell, but there’s a few gotchas. First, you need to run the commands mentioned as root unless you’ve configured USB to be configurable by your user. Second, most of those commands are not in the path by default so you’ll need to specify the full path to e.g. the fastboot utility. The instructions here cover these exception cases: I recommend following them instead.
One thing which neither document mentions is that you really need to make sure your sdcard is wiped completely clean before using fastboot. The “oem format” step only recreates the partition table, it doesn’t delete any corrupted partitions. If you reboot while these are still in place, it will try to bring up your corrupted version of Android, not the fastboot console. I spent quite some time debugging why I couldn’t properly flash the operating system before realizing this. Easiest way to get around this is to dd /dev/zero onto the sdcard before beginning the flashing process.
Also, while not strictly necessary to get something up and running, I recommend highly getting an HDMI monitor as well as a serial<->USB adapter. The former is useful to see if your Android device actually successfully booted up, the latter is useful for debugging boot issues where you don’t get that far (the serial console is always available from boot).
So, after painfully learning about the above caveats, I have managed to get things mostly working. I can see the ICS homescreen on my attached HDMI monitor and interact with it if I attach a USB mouse. The one gotcha is that both ethernet and WIFI networking are totally broken. Plugging in an ethernet cable or connecting to a WIFI network seems to result in the machine randomly rebooting, with the logs saying nothing useful. Both of these things are ostensibly supposed to be working according to the latest I’ve read from Google so I’m not exactly sure what’s going on. Investigations will continue.
February 11, 2012 12:04 AM
January 31, 2012
Talos Signal from Noise: analyzing the data
Recently, a change was pushed as part of the
Signal from Noise
effort in order to make
Talos
statistics better: https://bugzilla.mozilla.org/show_bug.cgi?id=710484
The idea being that the way were are doing things is skewing the data
and not helping with noise.
Currently,
pageloader
calculates the median after throwing out the highest point:
http://hg.mozilla.org/build/pageloader/file/beca399c3a16/chrome/report.js#l114
We introduced --ignoreFirst to instead ignore the first point and
calculate the median of the remaining runs.
However, after introducing the change we noticed that our distribution
had gone bimodal during side by side staging:
Were we doing something other than what we thought we were doing? Were
our calculations wrong? Or was something else going on?
So
jmaher
and I dove in to take a look at the data. jmaher dug up a high-mode
and low-mode case from the TBPL logs corresponding to the push sets
displayed on
graphserver
https://tbpl.mozilla.org/php/getParsedLog.php?id=8982519&tree=Firefox&full=1
high point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,109,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|109|average|354.25|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;123.5;354.25;92;1130;147;1130;1078;92;100
NOISE: |1;small-group-opacity-2500.svg;109;2333.25;103;9247;103;9012;9247;111;107
NOISE: __end_tp_report
https://tbpl.mozilla.org/php/getParsedLog.php?id=8982267&tree=Firefox&full=1
low point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,108,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|108|average|113.00|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;119;353.75;91;1132;139;1132;1086;91;99
NOISE: |1;small-group-opacity-2500.svg;108;113;103;9116;103;133;9116;108;108
NOISE: __end_tp_report
From http://pastebin.mozilla.org/1470000 .
Since I can't really read this being a mere human being, I modified
results.py
to parse this data:
+
+if __name__ == '__main__':
+ import sys
+ string_high = """
+|0;big-optimizable-group-opacity-2500.svg;123.5;354.25;92;1130;147;1130;1078;92;100
+|1;small-group-opacity-2500.svg;109;2333.25;103;9247;103;9012;9247;111;107
+"""
+ string_low = """
+|0;big-optimizable-group-opacity-2500.svg;119;353.75;91;1132;139;1132;1086;91;99
+|1;small-group-opacity-2500.svg;108;113;103;9116;103;133;9116;108;108
+"""
+ big = PageloaderResults(string_high)
+ small = PageloaderResults(string_low)
+ import pdb; pdb.set_trace()
This makes some explorable PageloaderResults objects that explorable with
pdb . While I did this for a
one-off hack, this is something we'll probably generally want as part of
Signal from Noise: https://bugzilla.mozilla.org/show_bug.cgi?id=722915
Then I looked at the data:
(Pdb) pp(small.results)
[{'index': '|0',
'max': 1132.0,
'mean': 353.75,
'median': 119.0,
'min': 91.0,
'page': 'big-optimizable-group-opacity-2500.svg',
'runs': [139.0, 1132.0, 1086.0, 91.0, 99.0]},
{'index': '|1',
'max': 9116.0,
'mean': 113.0,
'median': 108.0,
'min': 103.0,
'page': 'small-group-opacity-2500.svg',
'runs': [103.0, 133.0, 9116.0, 108.0, 108.0]}]
(Pdb) pp(big.results)
[{'index': '|0',
'max': 1130.0,
'mean': 354.25,
'median': 123.5,
'min': 92.0,
'page': 'big-optimizable-group-opacity-2500.svg',
'runs': [147.0, 1130.0, 1078.0, 92.0, 100.0]},
{'index': '|1',
'max': 9247.0,
'mean': 2333.25,
'median': 109.0,
'min': 103.0,
'page': 'small-group-opacity-2500.svg',
'runs': [103.0, 9012.0, 9247.0, 111.0, 107.0]}]
You'll notice that a few things from the runs data:
- the runs data is indeed bifurcated. In all case there is a low value,
around a hundred, and a high value in the thousands
- contrary to the assumption that the first datapoint may be biased and high,
you can't really see any bias, at least compared to the magnitude of
the bifurcation
So how does this compare to the graphserver results?
http://graphs-new.mozilla.org/graph.html#tests=[[170,1,21],[57,1,21]]&sel=1327791635000,1328041307110&displayrange=7&datatype=running
For the old data and the low value of the new data, we see times around
110-120ms. The high value of the new data is around 590ms. Are these
numbers what we'd expect?
Throwing away the high value and taking the median for both data sets gives
a number of the order of 100 or so (the old algorithm). Taking the median
functions as a filter for the bifurcated results towards the majorant
population. Since the low population is slightly more majorant, dropping
the highest number in the way that pageloader does further biases towards it.
It is not surprising we see no bifurcation in the old data.
For the new data, we drop the first run. Coincidentally or not, for the cases
studied the first run was part of the low population, so that tends
towards bifurcation. Taking the median of the remaining data points gives
High case:
- big-optimizable-group-opacity-2500.svg : (1078 + 100) / 2 = 589
- small-group-opacity-2500.svg : (9012 + 111) / 2 = 4561.5
Low case:
- big-optimizable-group-opacity-2500.svg (99 + 1086) / 2 = 592.5
- small-group-opacity-2500.svg : (133 + 108) / 2 = 120.5
So why does high case come out high and the low case come out low? So
there is even more magic. Graphserver reports an average by take the
mean of all the pages but discarding the high result: http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208
(from
http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l265
from http://hg.mozilla.org/graphs/file/d93235e751c1/server/collect.cgi
). Since both of the runs exhibit the high value of the bifurcation in
the high case, you report the lower of the two bifurcated values: 589,
from big-optimizable-group-opacity-2500.svg. Since in the low case
only one of the values is bifurcated, you get the low value: 120.5,
from small-group-opacity-2500.svg .
Okay mystery solved. We know why graphserver is reporting what data
it is reporting and we also know that our algorithm is doing what we
think it is doing. However, this is the beginning instead of the end
of the problem.
By taking the average and discarding the high value of two data
points, we are doing something weird and wrong. We are effectively
only reporting one of the two pages. Note for the high and the low
case what we are actually viewing data from the different pages! This
is misleading and probably outright wrong. We essentially have two
pages just to throw one of them away and then we have no confidence at
what we are looking at. I'm not sure if the code at
http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208
would even work for a single page. Probably not. In general I grow
increasingly skeptical of our amalgamation of results. We need
increasingly to be able to get to and manipulate the raw data. We
certainly need a way of digging into the stats and know what we're
looking at and have confidence in it. In general, talos, pageloader,
and graphserver need to be made such that it is both easier to try new
filters as well as more transparent to what is actually happening.
We have been trying to bias towards the low numbers. Looking at the data for
the four tests show that there are 13 low-state numbers and 7 high-state
numbers. While there are more numbers in the low state, it is not an
overwhelming majority.
This leaves the big elephant in the room: why are these runs
bifurcated? Are we seeing a code path, or is something else happening
on these builders that leads to bifurcated results? While this will
be challenging to investigate, IMHO we should know why this happens.
While our method of throwing out the highest data point, getting the
median, throwing the data to graphserver, then getting the average of
the whole pageset back, has a positive effect of minimizing noise
(which is important), it is also sweeping a lot under the rug. We
need to have confidence that what we're ignoring is okay to ignore. I
don't have that confidence yet.
January 31, 2012 04:42 PM
January 25, 2012
I’ve been spending a bit more time on refining the checkerboarding tests in Eideticker that I talked about last time. Most of my work has been focused on making the results as representative of a real world scenario as possible, to that effect I’ve been working on:
- Changed the test case from a web site of my own concoction to a more realistic example (the taskjs.org site)
- Use actual Android native events (via MonkeyRunner) to synthesize touch-based scrolling instead of simulating the event in JavaScript (which exercises a completely different codepath).
- Fixing various synchronization issues to make results more repeatable. Before captures were of wildly variable lengths, which made the numbers extremely suspect. There’s probably still a few issues, but much less than before.
The end result of this is a framework that gives much more meaningful results. The bad news is that the results that I’m measuring don’t show a very positive picture for where we’re at with the native re-write of Firefox. Even relative to the version of mobile Firefox which is currently on the Android Market, we still have some catching up to do. Here’s some video of the “old” firefox in action:
And here’s the Native fennec (what we’re currently offering in nightly, with some minor modifications by me to change the way the “checkerboard” is drawn for analysis purposes):
The numbers behind this comparison:
| Platform |
Percent checkerboarding over run of test |
| Old Fennec |
2% |
| Native Fennec |
57% |
(by the way, this performance regression is filed as bug 719447)
I know there’s lots of great effort going into improving this situation, so I have hope that we’ll be doing much better on this metric in the coming days/weeks. The process for creating these videos/analyses is mostly automated at this point, so my plan is to create a small dashboard (ala arewefastyet.com) to measure these numbers over time on the latest nightlies. Stay tuned!
January 25, 2012 10:18 PM
January 24, 2012
Mozilla Automation and Testing - Jetpack Performance Testing
I have a working proof of concept for
Jetpack
performance testing
(JetPerf):
http://k0s.org/mozilla/hg/jetperf .
JetPerf uses
mozharness
to run
Talos ts
tests with an addon built with the
Jetpack
addon-sdk
to measure differences betwen performance with and without the addon
installed.
Playing with Jetpack + Talos performance lets us explore statistics in
a bit more straight-forward manner than the production Talos numbers.
As part of the
Signal from Noise
project which I am also part of, there is a lot of parts to staging
even small changes in how we process Talos data since the system
involved has many moving parts
(
Talos,
pageloader,
graphserver
). By contrast, since JetPerf is a new project, it is much more
flexible to explore the data that we have not hitherto explored.
I made a
mozharness script
to clone the
hg mirror of addon-sdk .
It then builds a
sample addon
and runs Talos with it installed.
Looking at raw numbers wasn't very interesting, so I made a
parser
for Talos's
data format
It was pretty quick to get some
averages
out before and after the addon was installed, but I thought it would
be more usefulto display the raw data along with the averages.
These really aren't fair numbers, as currently the stub jetpack I use
prints to a file, but its at least a start of a methodology.
The reason I'm sharing this isn't just to make a progress report, but
more to present some ideas about thinking about what to do with Talos data.
While this was done for JetPerf, much of this also applies to Signal
from Noise. You run Talos and get some results. What do you do with
them? Currently we just shove them into http://graphs.mozilla.org/
and say that's where you process them, but I think looking at them
locally is not only important but necessary if you're doing
development work. I think a big part of any statistics-heavy projects
is to make it easy for all of the stakeholders to explore data,
apply different filters and see how things fit together. While it
takes a statistician to be rigorous about the process, anyone can play
with statistics and it takes a village to really conceptualize what is
being looked at. I hope, to this end, developers will use my software
so that they can understand what it is doing and provide the valuable
feedback I need.
TODO
JetPerf is still very much at a proof of concept stage. Ignoring the
fact that none of it is in production, there are still many
outstanding questions
about basic facts of what we are doing here. But outside of polishing
rough edges, here are some things on the pipe.
- test more variation of addons; currently we just load panel and
print something to a file
- test on checkin (CI):
so the main point of JetPerf is to get a better idea of what SDK
changes cause addon performance regressions and hits, to be able to
quantify them. While as stated this is a very open ended project,
one thing to turn this from a casual exploration to a developer
tool is running the tests on checkin. This will give an update in
real time of if a checkin breaks performance.
- graphserver: in order to assess Jetpack's performance over time, we
will want to send numbers to some sort of
graphserver .
This will allow us to keep track of the data,
to view it, and apply various operations to it.
I may also spin off the (ad hoc) graphing portion and the Talos log parser
portions into their own modules, as they may be useful outside of just
Jetperf.
January 24, 2012 01:10 PM
January 03, 2012
After my post on measuring checkerboarding in mobile Firefox, Clint Talbert (my fearless manager) suggested I run a before and after test to measure the improvement that just landed as part of bug 709512. After a bit of cleanup, I did so, measuring the delta between my build on December 20th and the latest version of Aurora. The difference is pretty remarkable: at least on the LG G2X that I’ve been using for testing, we’ve gone from checkerboarding between 10-20% of the time and not checkerboarding almost at all (in between two runs of the test with the Aurora build, there is exactly one frame that checkerboards). All credit to Chris Lord for that!
See the video evidence for yourself. Before:
After:
January 03, 2012 08:18 PM
January 01, 2012
Mozilla Automation + Testing - MozBase Continuous Integration
As part of the
A-Team
2011 Q4 goals
I was able to devote a few days to setting up
continuous integration (CI) for
MozBase .
I revived and extended
autobot to support
buildbot 0.8.5, set up
tests
and a simple
test runner
for mozbase, and deployed a test instance to k0s.org. You can see
the waterfall here: http://k0s.org:8010
While buildbot comes with a
gitpoller
the version in
buildbot 0.8.5
(the current in http://pypi.python.org/ ) did not work with
git 1.6.3, the version on k0s.org. Since my box is on an ancient
version of Ubuntu (and is remote and not trivially upgradable), I
brought the generic
autobot poller
from being buildbot 0.8.3 compatible to 0.8.5 compatible
(which is worth noting is not trivial).
Also, while there has been
a patch for an hgpoller
submitted by
Mozilla
developers some four years ago, it has been be WONTFIX ed, so I
went ahead with a generic polling architecture which (IMHO) seems a
wiser architectural choice. While I sympathize with the architectural
ideology of using a push-based architecture, and believe this is
closer to ideal, polling will always work and does not require access
to the repository servers which is a huge factor when using
https://github.com or even Mozilla hg repositories. (Incidentally,
I found neither this patch nor
http://hg.mozilla.org/build/buildbotcustom/file/tip/changes/hgpoller.py
to work OOTB, so, sadly, I proceeded to roll my own. Also
incidentally, it is not trivial to depend on buildbotcustom using
install_requires due to its lack of a setup.py file.)
After debugging the gitpoller I pushed
a test change and was happy to see
that autobot built correctly. Autobot now listens to MozBase changes!
I was unable to finish the (parenthetical)
Q4 goal
of having autobot report to
autolog , so
this remains outstanding work. There is a lot that could be done with
autolog. The basic idea and TODOs are outlined in the
README
(which itself could use some work; it is largely up to date
except the Projects section, though incomplete). I will endeavor to
work on this in my available time or as need escalates, but my
priority for
2012 Q1
will be separating
Talos Signal From Noise
so it is unlikely I will be able to put a lot of time into autobot
(sadly). On the other hand, I am more than willing to help
and advise if anyone
wants any features or to iron out the crinkles. While the
architecture is not completely straight forward, it is a decent
approximation to a
convex hull
over the
problem space
of having simple to write, simple to maintain, simple to debug
continuous integration for small(er) projects. As usual, if anyone
wanted to seek out alternate solutions, that is fine too, but I am
essentially happy with my architecture decisions and technology
choices.
Regardless of whether the CI solution for MozBase is autobot or
(other), it is important to remember that continuous integration is a
safety net and not a first line of defense. It is regrettable that
autobot has no more notifications (yet) than the
waterfall display
and the autobot character lurking in
#ateam (the default
IRC bot
isn't very verbal OOTB and I haven't had time to customize
it). But I think having some (admittedly smokescreen) automated testing
for MozBase is an important step towards the evolution of the software
as well as towards development practices in general.
January 01, 2012 04:51 PM
December 28, 2011
Auto-tools Q4 in reflection: progress on mozbase and talos
Most of my effort this quarter was spend on two related goals:
- developing a sane set of python packages to build test harnesses
on top of. We call this MozBase: https://wiki.mozilla.org/Auto-tools/Projects/MozBase
- Making Talos sane and
porting it to use the MozBase set of packages.
These are illustrated in our goals page:
https://wiki.mozilla.org/Auto-tools/Goals/2011Q4#Mozbase
From one point of view, this isn't exciting work. But I live for this
stuff. I think of software as an ecosystem to be cultivated and I
live to cultivate it. So while, for the most part, I can't point to
any exciting features that I implemented (nor were there planned to
be), in retrospect I am proud of the fruits of my efforts and those of
my team-mates and comrades. A big shout out to BYK and others who
have stepped up to the plate to help the
A-Team with these
super-important efforts.
When I look back I see:
- Talos wasn't a python package. Now it is!
- MozBase didn't even exist or have a repo. Now it does
- MozBase didn't have documentation or tests worth speaking of. Now
it has at least a good start!
- Talos even has a test for installation. We need more tests, but its a good start!
- There has been a lot of cleanup of Talos towards the end of making it more robust, easier to use, and easier to contribute to.
- The A-Team didn't have any community contributors. Now we do!
This one actually makes me the happiest :)
When I look the progress, I see Talos evolving towards what I would
call real software (instead of a one-off that has been extended to do
way too much to make it a one-off) that Mozillians can hack on and
extend and make useful changes to. This also sets the stage for
making Talos easier for developers to use locally to test their
changes as well as getting more of our test harnesses to use the
MozBase suite of utilities as well as making it easier to write new
harnesses without reinventing so much of the wheel.
One of our our next priorities towards these ends is
Bug 713055 - get Talos on Mozharness in production
This is a huge step towards making buildbot more extensible as well as
having desktop talos be more accessible to developers in a way that
should be identical to the way that it is done in automation.
:aki has done a bunch of work to start
moving our aging buildbot infrastructure towards something more sane.
This is mozharness .
Armen (:armenzg) also updated the
way that talos.zip is sought so that it can be decoupled from
buildbot. This is another big step forward that he details in his blog post:
talos.zip, talos.json and you .
So a huge shout out to
:jmaher and
:wlach for all the Talos help, and
:ahal
and :ctalbert
as well as all the help from those in
release engineering
for making all of this possible. I look forward to getting this all
better in the coming year.
December 28, 2011 12:18 PM
December 27, 2011
One of the longest running performance measurements we have is how long it takes Firefox to start. We do it very simply just to get a raw number (and yes, there have been many improvements made but this is the gist of the automation):
- Start Firefox with a URL ending with a query parameter like “start=”<current time in ms since EPOC>
- The page that the URL points to does a JavaScript “new Date().getTime();” as part of its onload handler and subtracts that from the value in the “start” query parameter
- The URL prints out the date to the console (because we can do that since we control the browser and the profile)
- Automation reads the console and puts the value in a database
Pretty simple. Applying this to different browsers, you have to nix the “print to console” idea. But, how hard could it be to POST to a web service that stuffs your result in a database? Do that and the rest of it will all “just work”, right?
Well, not really. Every browser implements the cross-origin access policy to a different degree, and since we did this on android, some of them don’t seem to support it at all. Once we found a way around that, we realized that not all the data was making it into the database because the automation would kill the browser before it had a chance to POST its results. So we slowed that down, forcing the automation to wait 20s before closing the browser. Then our database crashed, this part we had nothing to do with, but Murphy’s law states that you can’t have an automation project without at least one bonfire igniting under your chair.
Add to this cross-browser headache that we’re automating this on multiple phones. The older Nexus phones (Nexus One and Nexus S) will not stay connected to a wireless network after reboot (appears fixed with Galaxy-Nexus or with ICS, not sure which). Even if you put these phones on an open network with no contention and they are set to “join automatically”, they will at some point boot into a state with their wireless disabled. We had to write some service code to ensure the wireless remained on and connected to our specific network on boot. Our other phones (a Droid Pro and a Samsung Galaxy S2) have no problem staying connected to the network, but they alternately “freeze”. I’m still trying to debug what this “freeze” actually is, but everything is functioning fine on the phone – network, logcat, process list etc are all normal. However, the phone stops running the automation. It’s interesting that the Nexus phones never encounter this issue and they are all running the same version of the automation code and browsers.
At long last, we have fought through enough of these issues so that we can start to see the results of the data coming into our database (select “2 months” or “all” to see data). Because we are merely firing our “timing” function when the “onload” event happens for the page, we can see the different interoperability issues with measuring this event. We knew it wasn’t perfect, but the results we are seeing on Android make me call into question the usefulness of this as a cross-browser comparison tool at all.
- Opera seems to fire the onload event randomly. I’m not sure what they are doing, but their timing is all over the place. Note that this could be a fluke in the automation as the Samsung/Droid Pro hang usually occurs during the Opera test (which, by chance, is also the first test). However, note that the Opera numbers for the Nexus phones are also wild, and they are not afflicted by this unusual hang.
Dolphin and the stock Android browser are both webkit based browsers and we have always known that webkit tends to fire this event very early in the page-load sequence. This is reinforced by the fact that the event always happens at roughly the same time regardless of the underlying phone hardware, especially on the stock Android browser
Fennec – this automation measures the new native Fennec product. Currently, the system contains results from the beginning of the project to the point at which we moved from the birch tree into the mozilla-central tree. I have another set of jobs to run that will get us the last two weeks from the mozilla-central tree, once the phones finish their jobs from the previous two months. Of the four browsers being measured, the only one changing versions is Fennec; therefore, you can see the effect of our developers’ work as they add features and battle regressions. Native Fennec is still under heavy development, and this is why the Fennec number jumps around as much as it does.

The system is far from perfect. Measuring onload is at best an artificial metric, and not at all indicative of what the user sees. In desktop automation, we don’t even use onload, we use the “mozafterpaint” event notification. For the next stage of the cross-browser test we are going to automate some visual comparison tests to get closer to measuring the metric that really matters: real-life user experience. In the meantime, the onload tests will continue to give us a rough barometer of our regressions and performance, especially against our own historical data. To that end, I am going to undertake the next few improvements to this automation:
- Understand what the hang is on the Galaxy S2 and Droid Pro phones and fix it
- Add more phones to the system so that it doesn’t take so long to run through a set of jobs (we only need these temporarily until the system catches up on old data).
- Experiment with lowering the timeout period between “results uploaded” and killing the browser under test. (This might work better now that we have changed database backends).
- Get a better front end UI for the results. If you’d like to contribute to this, let me know, because this website could sure use your help!

December 27, 2011 10:48 PM
December 23, 2011
Just before I leave for some Christmas vacation, it’s time for another update on the state of Eideticker. Since I last blogged about the software, I’ve been working on the following three areas:
- Coming up with better algorithm (green screen / red screen) for both determining the area of the capture as well as the start/end of the capture. The harness was already flood filling the area with these colours at the beginning/end of the capture, but now we’re actually using this information. The code’s a little hacky, but it seems to work well enough for the test cases I’ve been using so far.
- As a demonstration, I wrote up a quick test that demonstrates checkerboarding on mobile Fennec, and wrote up a quick bit of analysis code to detect this pattern and give an overall measure of how much this test “checkerboards” (i.e. has regions that are not fully painted when the user scrolls). As I understand this is an area that our mobile team is currently working on this problem quite a bit, it will be interesting to watch the numbers given by this test and see if things improve.
- It’s a minor thing, but you can now view a complete webm movie of the captured movie right from the web interface.
Here’s a quick demonstration video that shows all the above in action. As before, you might want to watch this full screen:
Happy holidays!
December 23, 2011 04:59 PM
December 21, 2011
I have been asked a few times over the last couple months how to help out at Mozilla, specifically with python. I know there are dozens of teams within Mozilla that have various python related projects. I am on the automation and tools team at Mozilla (known as the A*Team) and we do a lot of python related work. It seems that we are asked to add new and crazy stuff to harnesses or write new and interesting tools (usually a blend of python and javascript).
1) Mozbase. Our efforts in our spare time is to refactor our test harnesses within Mozilla to share common code where possible, we call it mozbase. I recommend doing a git clone of mozbase and getting it installed on your system: git clone git@github.com:mozilla/mozbase.git
2) Talos. Next is to pick up a test harness. We have been focusing on talos. Mostly because you don’t have to pull the entire mozilla-central tree, do hour long builds, and really because the talos code base is in need of some serious updating. To get talos, you need to clone it: hg clone http://hg.mozilla.org/build/talos
3) Configure Talos. Talos is run in 2 steps right now. A configuration step and a execution step. The configuration step requires a path to firefox.exe as well as an active test (I use ts to keep it simple) is pretty easy: “python PerfConfigurator.py –develop -a ts -e <path>/firefox.exe –output mytest.yml”.
4) Run Talos. this step is easy. Make sure you don’t have another instance of firefox.exe running on the computer and then run: “python run_tests.py -d -n mytest.yml”.
5) Take a look at some of these bugs that we have which are related to mozbase and/or talos: http://bit.ly/tZHs3G
While this isn’t exhaustive or a perfect guide for how to work on the perfect bug in an hour or less, these 5 steps should get you setup to work on basic Mozilla code and start fixing bugs! Pop into #ateam on irc.mozilla.org and ask some questions.
Now back to the other PI(e) that I always talk about!

December 21, 2011 04:38 AM
December 20, 2011
Making the Mozilla automation infrastructure run reliably for each checkin on mobile devices has been my primary focus for the last few years. Last year at this time we were just trying to get Android automation up and running, all tools and harnesses had been written and ready to run. The core buildbot code for running the tests was in place. The problem was that we just had so many failures of the devices (NVidia Tegra development boards) and the tests.
So as the months went on from last December and up through August, we really made little progress. A few tests were fixed, some disabled, some checks in place to make the boards stay online, but really no consistent set of test results.
There were a couple things that fixed our problems:
1) a rock star intern (:jchen) who found and fixed some workarounds with the OS so fennec wouldn’t crash all the time (issues with networking and libc).
2) a weekly meeting started by :blassey to go over all the bugs, status, issues, future work, and other items.
Both of these items are signs that the mobile development team was serious about testing and wanting to see Android unittests become a part of Mozilla. While this seems trivial, it was next to impossible to keep tests running smoothly without support from the entire team.
Enjoy the reliable unit tests on Android!

December 20, 2011 09:03 PM
December 11, 2011
Peptest is an automated test harness for measuring responsiveness (or lack thereof) in Firefox (see my
older post).
Recently, peptest landed in Mozilla Central along with a make target. This means that you can run the tests with:
cd path/to/objdir
make peptest
There's some more
in depth documentation on MDN.
Adding Tests
Being able to run the tests is great and all, the only problem is that there aren't any tests yet (aside from a handful of example tests)!
Tests need to be added by developers and should correspond to a bug. I'll be making another post about best practices in a bit, but for now you can check out the
test format to get started.
Once you have your test finished, tested and reviewed you can add it to the default list of tests. Tests currently live in
testing/peptest. For example, say I had a couple tests that tested
the responsiveness of Firefox's tabs. I might:
- Create a new directory called testing/peptest/tests/firefox/tabbrowsing
- Place the tests into this directory
- Create a manifest called tabbrowsing.ini and populate it with my tests eg:
[tabbrowsing_open.js]
[tabbrowsing_close.js]
[tabbrowsing_switch.js]
- Add this manifest to the firefox_all.ini manifest:
[include:tabbrowsing/tabbrowsing.ini]
Any tests that are included in the
firefox_all.ini manifest will be run when
make peptest is run.
Feel free to ping me (ahal) on irc if you have any questions or need help writing tests.
December 11, 2011 03:42 AM
December 09, 2011
So I got some nice feedback on my Eideticker post yesterday on various channels. It seems like some people are interested in hacking on the analysis portion, so I thought I’d give some quick pointers and suggestions of things to look at.
- As I mentioned yesterday, the frame analysis is rather stupid. We need to come up with a better algorithm for disambiguating input noise (small fluctuations in the HDMI signal?) from actual changes in the page. Unfortunately the breadth of things that Eideticker’s meant to analyze makes this a bit difficult. I.e. edge detection probably wouldn’t work for something like Microsoft’s psychedelic browsing demo. I suspect the best route here is to put some work into better understanding the nature of this “noise” and finding a way to filter it out explicitly.
- Our analysis code is still rather slow, and is crying out to be parallelized (either by using multiple cores of the same CPU or a GPU). Burak Yiğit Kaya recommended I look into PyCuda which looks interesting. It looks like there are other possibilities as well though.
- Clipping capture by green screen/red screen. This should be doable by writing some relatively simple code to detect large amounts of green and red and then ignoring previous/current/subsequent frames as appropriate.
- Moar test cases! It was initially suggested to use some of the classic benchmarks, but these only seem to barely work on Fennec (at least with the setup I have). I don’t know if this is fixable or not, but until it is, we might be better off coming up with more reasonable/realistics measures of visual performance.
You might be able to find other inspiration on the Eideticker project page (note that some of this is out of date).
You obviously need the decklink card to perform captures, but the analysis portion of Eideticker can be used/modified on any machine running Linux (Mac should also work, but is untested). To get up and running, just follow the instructions in README.md, dump a pregenerated capture into the captures/ directory (here’s one of a clock), and off you go! The actual analysis code (such as it is) is currently located in src/videocapture/videocapture/capture.py while the web interface is in https://github.com/mozilla/eideticker/blob/master/src/webapp.
I’m going to be out later today (Friday), but I’m mostly around on IRC M-F 9ish-5ish EST on irc.mozilla.org #ateam as `wlach`. Feel free to pester me with questions!
P.S. I didn’t really cover infrastructure/automation portions above as I suspect people will find that less interesting (especially without a video capture card to test with), but you can look at my newsgroup post from yesterday if you want to see what I’ll likely be up to over the next few weeks.
December 09, 2011 03:47 PM
December 08, 2011
Since I last blogged about Eideticker, I’ve made some good progress. Here’s some highlights:
- Eideticker has a new, much simpler harness and tests are much easier to write. Initially, I was using Talos for this task with the idea that it’s better not to have duplicate code where it’s not really required. Seemed like a fine idea in principle, but in practice Talos’s architecture (which is really oriented around running a large sequence of tests and uploading the results to a central server) was difficult to extend to do what we need to do. At heart, eideticker really only needs to do a few things right now (start up Firefox, start videocapture, load a webpage, stop videocapture) so it’s best to keep things simple.
- I’ve reworked the capture analysis API to use numpy behind the scenes. It’s still not quite as fast as I would like (doing a framediff analysis on a 30 second animation still takes a minute or so on my fast machine), but we’re doing an order of magnitude better than before. numpy also seem to have quite the library of routines for doing the types of matrix algebra useful in image analysis, which should be helpful as the project progresses.
- I added the beginnings of a fancy pants web interface for browsing captures and doing visualizations on them! I’m pretty happy with how this is turning out so far, it’s already been an incredibly useful tool for debugging Eideticker’s analysis system and I think it will be equally useful for understanding Firefox’s behaviour in general.
Here’s an example analysis session, where I examine a ~60 second capture of the fishtank demo from Microsoft, borrowed from Mark Cote’s speedtest library. You might want to view this fullscreen:
A few interesting things to note about this capture:
1. Our frame comparison algorithm is still comparatively dumb, it just computes the norm of the difference in RGB values between two frames. Since there’s a (very tiny) amount of noise in the capture, we have to use a threshold to determine whether two frames are the same or not. For all that, the FPS estimate it comes with for the fishtank demo seems about right (and unfortunately at 2 fps, it’s not particularly good).
2. I added a green screen / red screen at the start / end of every capture to eliminate race conditions with starting the capture, but haven’t yet actually taken those frames out of the analysis.
3. If you look carefully at the animation, not all of the fish that should be displaying in the demo are. I think this has to do with the new native version of Fennec that I’m using to test (old versions don’t exhibit this property). I filed a bug for this.
What’s next? Well, as I mentioned last time, the real goal is to create a tool that developers will find useful. To that end, we have plans to set up an Eideticker machine in Mozilla Mountain View office that more people can use (either locally or remotely over the VPN). For this to be workable, I need to figure out how to get the full setup working on “demand”. Most of the setup already allows this, with one big exception: the actual Android device that we want to capture video from. The LG G2X that I’m currently using works fine when I have physical access to it, but as far as I can tell it’s not possible to get it outputting proper video of an application unless it’s in an unlocked state (which it obviously isn’t most of the time).
My current thinking is that a Panda Board running a Vanilla version of Android might be a good candidate for a permanently-connected device. It is capable of HDMI output, doesn’t have unwanted the bells and whistles of a physical phone (e.g. a lock screen), and should be much reliable due to its physical networking. So far I haven’t had much luck getting it the video output working with the Decklink capture card, but I’ve only just started trying. Work will continue.
If I can somehow figure that out, and smooth out some of the rough edges with the web interface and capture API, I think the stage will be set for us all to do some pretty interesting stuff! Looking forward to it.
December 08, 2011 05:11 PM
December 05, 2011
I have had the honor of working with Trevor on a few projects during his internship at Mozilla. One of earlier projects he worked on was TegraPool, a utility to check out a tegra and run tests on it as we do on tinderbox. Trevor doesn’t have a blog setup or a feed to planet, so here is what Trevor has to say:
A few weeks ago, I got nominated as a friend of a tree for being able to help out with mobile development get past issues with talos. Being a relatively new hire, I too am aware about how difficult it can be to get a testing environment set up, especially if you haven’t worked with mobile devices at all. Luckily, the first project I worked on this term is especially useful for those who can’t be bothered with setting up talos, or getting a Tegra and setting up the proper config for it.
Tegra Pool is an internal-only site for those who need to debug the issues in an automated testing run (such as on TBPL). It can be found here: TegraPool. It will show you a table of all the devices available, and a couple forms to checkin and checkout.
If you are local and have the entire mobile testing suite set up, then it’s easy. Just put in your LDAP credentials and click “Check Out”. The IP of the Tegra you get will pop up, and not only will you be able to telnet and use the SUTAgent, but the AndroidDeviceBridge(ADB) will use TCP/IP, allowing you to connect with “adb connect <ip>”.
Remoties have a bit more of a problem, as most tests will require the Tegra to contact an “external server”, but we don’t want it to be making requests off-network. Also, many users don’t want to run the tests on their own computer, because they might need to run other things, or might not want to set up the entire testing environment. Luckily TegraPool solves these problems too.
If you are remote, or don’t have the time to set up the entire testing set, you can select the “I want server…” checkbox. To set up everything for you, you will have to point it to a test folder to get the app and zip files. This is best done with going to the build on Try in TinderBoxPushLog and clicking, “go to build directory”, when B is selected, and selecting the try-android-xul directory (or equivalent). Alternatively, ftp…mozilla-central-android (or other folders in the nightly directory) is usually a good folder to use. This will set up a temporary account for you on the TegraPool server (based on your LDAP username).
Once you have checked out a Tegra and received a temporary account, you can now SSH into the machine (The password is a standard “giveMEtegra”). If you look in the home folder, you can see a lot of scripts that will just run. runMochiRemote.sh will run every single mochitest, runTalosRemote.sh will run the quick tpan test, and runRefRemote.sh will run the ref tests. If you connect to the device with ADB before running this, you should be able to pinpoint where issues are occurring.
This directory is a quick product, and should not limit what you can do. You can sftp new fennec.apk files, or modify the .sh files to run the necessary tests (i.e. other talos tests or specific mochitests).
Hopefully, this should let anybody who wants to debug mobile issues have a fast and easy option. Right now there are only 2 Tegra boards and 2 Panda boards running android, but if there is enough usage, more devices will be added. Happy debugging!
-Trevor (tfair on IRC)

December 05, 2011 10:41 PM
The state of Talos this week:
- we need to fix Talos importing of mozbase. We want to get Talos to consume
mozdevice, mozinfo, mozhttpd, and mozrunner, mozprocess, mozprofile
- The current state of things:
- talos includes the files of mozdevice and mozhttpd.py
- we mirror these manually but things get out of sync
- An interim solution is posed in bug 707218: mirror mozdevice,
mozinfo, and mozhttpd to talos for the purpose of creating a
tests.zip file and list them in setup.py for setuptools
installation. This works because these are all simple dependencies,
but will not work for mozprocess, mozrunner, and mozprofile as these
all have dependencies of their own
- In order to use these dependencies (mozprocess, mozrunner,
mozprofile) in production talos, we will need a releng python
package index: Bug 701506 . I will do a mock-up there; whether it
will fulfill releng needs or not is hard to say. We will probably
want to transition to mozharness soon thereafter or at the same
time, but we shouldn't block on any more than we need to. These are
all big changes to our deployment strategy, and for purposes of QA
we will want to make as strategic and specific decisions as possible.
- Once the transition is completely done, we can do away with
talos.zip entirely.
- Additionally, in order to make talos work with setup.py, the
pywin32 package should be listed. However, pywin32 is in general
compiled for specific python and windows versions. See e.g.
https://bugzilla.mozilla.org/show_bug.cgi?id=673132#c8 . BYK is
looking into this, possibly switching the linked dependency based on
the platform you are on.
- slewchuk is looking into Talos data aggregation: bug 707486
This is a rough map of what we want to do. As said, with so many
balls in the air, we will want to block on as little as possible and
make as few really big changes at a time so that we can ensure that
each piece of the puzzle fits together correctly.
December 05, 2011 08:06 PM
November 30, 2011
With ahal's impending return to studentdom and myself coming back from paternity leave, I will be taking over
Peptest development and maintenance. To get myself up to speed, I wrote some tests.
The easiest test to write is one that looks for unresponsiveness while simply loading a page. I've noticed that the site for my favourite blog,
The Daily What, causes some pain in Firefox, what with all the videos and images and so forth. I wrote a very simple test to see if I was onto something:
Components.utils.import('resource://mozmill/driver/mozmill.js');
let c = getBrowserController();
pep.performAction('open_page', function() {
c.open('http://thedailywh.at');
c.waitForPageLoad();
})
Indeed, while there were no very long pauses, there were a string of short ones. Remember, we care about pauses longer than 50 ms, which Peptest identifies for us:
PEP TEST-START | test_dailyWhat.js
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 103 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 199 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 112 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 204 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 105 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 57 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 79 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 194 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 202 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 68 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 182 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 63 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 84 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 118 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 51 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 55 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 67 ms
PEP WARNING | test_dailyWhat.js | open_page | unresponsive time: 215 ms
PEP TEST-UNEXPECTED-FAIL | test_dailyWhat.js | fail threshold: 0.0 | metric: 322.362
PEP TEST-END | test_dailyWhat.js | finished in: 9426 ms
Not awful, but not great. I filed
bug 706250 to investigate this.
Next, I decided to delve into some
Project Snappy bugs to see what else I could find.
Changing the URL in the above test was all I needed to confirm that the page in
comment 62 of bug 61684 was still an issue. This time I
got about 50 unresponsive periods, the longest being 2.6 s. Ouch.
Bug 430106 is a little more interesting. Someone reported problems switching back to a tab in which a large image was loaded. The simplest way I could replicate this was by loading the example URL in one tab, loading any old page in a second tab, waiting for about 20 seconds, then switching back to the first tab. In peptest form,
Components.utils.import('resource://mozmill/driver/mozmill.js');
let c = getBrowserController();
while (c.window.gBrowser.tabs.length < 2) {
c.window.gBrowser.addTab();
}
// Load large image in first tab.
c.tabs.selectTabIndex(0);
c.open('http://flickr.com/photos/thomasstache/2429920499/sizes/o/');
c.waitForPageLoad();
// Load any page in second tab.
c.tabs.selectTabIndex(1);
c.open('http://www.mozilla.org');
c.waitForPageLoad();
// Wait for memory to be freed from first tab.
c.sleep(20000);
pep.performAction('switch_tab', function() {
c.tabs.selectTabIndex(0);
// Wait for image to repaint.
c.sleep(2000);
});
When I ran this, I saw a visible delay before the image was repainted. Peptest confirmed this:
PEP TEST-START | test_largeImgTabSwitchLocal.js
PEP WARNING | test_largeImgTabSwitchLocal.js | switch_tab | unresponsive time: 54 ms
PEP WARNING | test_largeImgTabSwitchLocal.js | switch_tab | unresponsive time: 835 ms
PEP TEST-UNEXPECTED-FAIL | test_largeImgTabSwitchLocal.js | fail threshold: 0.0 | metric: 700.141
PEP TEST-END | test_largeImgTabSwitchLocal.js | finished in: 24083 ms
The unresponsiveness appears to be relative to the size of the image, as an image of about twice the dimensions, that is, 3.4 times as many pixels, resulted in a delay about 3.4 times as long.
One of
the next steps for Peptest is to add JS-function tracing so we can figure out the exact sources of unresponsiveness. This, however, requires a fix for
bug 580055, which in turn depends on
bug 702740. As soon as patches for those bugs have landed, we'll add support to Peptest.
For more information on Peptest, see
the wiki article and/or check out the code, which has recently been moved to hg.mozilla.org under mozilla-central/testing/peptest.
November 30, 2011 06:38 PM
November 28, 2011
November 21, 2011
I've been developing
Talos
recently. There are many caveats working on this test harness that
demands a more rigorous process than, say, a webapp. It has a large
amount of necessary platform-specific code. It is deployed in a
complex infrastructure
environment. And it has no
tests.
In order to test Talos, the
A*Team
has an internal staging environment (thanks to the efforts of
anode and
bhearsum and others)
that mirrors the production testing infrastructure environment. Like
production, it requires an HTTP-hosted URL structure containing
pageloader , a pageset
(tp5 ), and other
resources necessary for
buildbot
plus Talos. (We should probably document the directory structure.)
In order to test Talos, you point the
A*Team staging environment
configuration to your HTTP-hosted location of your copy of
this structure of resources. Then you issue a buildbot sendchange
(which can be scripted for ease of use) that corresponds to a set of
Talos tests
that are run on each platform of interest.
We have some simple scripts to run tests (i.e. ./chrome.sh or
./dirty.sh) to run sets of tests as we do in production.
This translates to a variety of buildbot sendchange commands
appropriate for the tests to be run. Green runs means good.
In order to test my Talos changes, I needed to setup a system whereby
I could translate my changes into a hosted copy of talos, pageloader,
etc. So here is what I did.
Steps:
Replicate http://people.mozilla.org/~jmaher/taloszips/tip/
It would be nice to provide a sane base template for this.
Put the talos zips on a web server:
cd mozilla/web/talos # change to a desired hosted directory
wget -r -l0 --no-parent http://people.mozilla.org/~jmaher/taloszips/tip/
mv people.mozilla.org/~jmaher/taloszips/tip # the piece you need
rm -rf people.mozilla.org # cleanup unneeded directories
find tip -iname 'index.html*' -delete # remove unneeded index pages
[Example: http://k0s.org/mozilla/talos/tip/]
Clone a copy of Talos:
cd ~/mozilla/src/
virtualenv.py talos-staging
cd talos-staging; mkdir src; cd src
hg clone http://k0s.org/mozilla/hg/talos
echo 'default-push = ssh://k0s.org/mozilla/hg/talos' >> talos/.hg/hgrc
Development process:
Based on
jmaher's update_talos.sh, I
wrote a script to help me turn changes into changes in my hosted copy
of talos.zip. Since I work largely in diffs hosted on
bugzilla or
my mercurial queue
of Talos patches, I wanted a script that would apply a series of
changes to a checkout of
talos .
In addition, I wanted to keep the flexibility of being able to edit
these files on disk.
The script lives at http://k0s.org/mozilla/update_talos.py . I will
endeavor to improve it as testing needs become more apparent. It
sadly loses
jmaher's update_talos.sh
feature to create versioned zips. I thought about hosting a dedicated
talos repository for testing (and still may, if that seems better down
the line), but usually want to test a specific change and rollback to
a known state.
The script does the following:
- Cleans up and reclones, optionally
- Applies a series of diffs
- Creates a talos.zip and moves to the appropriate place on disc.
- Fetches a fresh copy of pageloader.xpi
- Syncs the files with the HTTP server
- Cleans up and reclones, optionally
After the HTTP copy is updated, I can run (e.g.) xperf.sh to
trigger that set of tests in the staging environment and watch the
waterfall to assess the viability of the change
It would be nice to have something more generic, but the path to good
software is through iteration. Perhaps as more people develop their
own scripts to test Talos in the staging environment we will evolve to
a more generic script to update talos as well as copies or templates
of the URL/directory structure of what as needed as well as the
staging software.
November 21, 2011 10:42 AM
November 18, 2011
Just a quick note that a planet for Mozilla Tools & Automation (the so-called “a team”) is now up, thanks to Reed Loden. With the exception of Jeff Hammel, everyone there was already being syndicated on Planet Mozilla, but this should offer a more focused feed of our doings for those who can’t always keep up with the firehose. Have a look:
http://planet.mozilla.org/ateam
Who should care? Well, we maintain all the major testing frameworks like Mochitest, Reftest, and Talos as well as automated tooling for QA like Mozmill. Our latest work is focused on making sure that Firefox is as robust, responsive, and performant as possible on desktop and mobile. In short, if you’re writing or verifying code from mozilla-central, what we’re doing probably affects you. Please let us know what you think about our projects and whether there’s anything we can do to make your job easier: we’re listening.
Quick bonus note: It’s not immediately obvious (or at least it wasn’t to me), but Mozilla has some fairly finely tuned infrastructure for running planets. If your team or group wants one, it’s definitely better to plug into that instead of rolling your own.
Reed Loden is the maintainer and the source lives in subversion.
November 18, 2011 10:47 PM
If you work someplace, you have meetings. It’s impossible not to. Because the Automation and Tools team works on many different projects simultaneously, it was natural for us to have one big meeting a week to discuss the status of these projects, raise concerns, make announcements etc. This is also the one meeting I’d invite outside contributors to so that they can learn who everyone on the team is and what we’re all doing.
However, week after week as I asked for each project’s status and listened to it, I wondered why on earth would anyone want to come to this? And why were we spending an hour each week boring ourselves to tears when we could be doing something useful like being silly on IRC? So, the A-team and I talked about it, and we decided to do an experiment with the meeting. Here’s what we’ve been doing for November:
- One person spends an hour or so a week collecting the status from everyone on the team.
- That person puts together the wiki page.
- At the meeting on Monday, that person is the emcee and does a five minute run down of the week’s highlights. This is the toughest job. We have a great team, and there are always a lot of highlights.
- After that, we raise any issues that need raising and discuss them, five to ten minutes.
- The emcee gets to pick the emcee for the following week.
- Then we remind people to check the wiki page for the schedule of project-specific meetings that week, and we’re done.
The entire thing takes no more than twenty minutes, and most weeks it takes less than ten. So far, I have to say I’m a fan of the new meeting. I worried that I’d lose my ability to stay abreast of what is happening on our projects, but that hasn’t been the case. In fact, if you compare the wiki pages from before with these new ones, you’ll see that our emcees do an amazing job pulling together the data and communicating the highlights.
The other benefit this gives us is that as we grow into a larger team, it’s harder for all of us to interact. Our rotating emcee gives each person a chance to talk with everyone else on the team and learn something about everyone’s projects.
I don’t know if this would work well for other teams, but it has worked really well for us so far. If you’d like to drop in, here’s the information about our meeting. This week’s emcee is our illustrious maple-bacon-cake-baking, cowboy-boot-wearing intern, Tfair.

November 18, 2011 02:42 AM
November 16, 2011
We've just introduced a change to Mozilla's
Mochitest harness to improve test run times, as per
Bug 367393. This involved the removal of unnecessary MochiKit usage. We found that we were including a minified version of MochiKit, packed.js, in all our tests and within the harness, but we would only use a small portion of this enormous suite. That added an extra load of about 5 minutes per debug test run, so we removed MochiKit from our harness and added replacement functionality to
SimpleTest. Note that the SimpleTest.js file in MXR may not yet be updated, so pull the latest mozilla-central code to see latest changes!
What this means for you: If you're writing a test that doesn't require MochiKit, please
do not include packed.js in your test. This just adds extra load. If your test does require some part of MochiKit, please check if that functionality ...
November 16, 2011 09:38 AM
November 15, 2011
Mozconfigwrapper is a tool inspired by Doug Hellman's magnificent
virtualenvwrapper. In a nutshell, mozconfigwrapper
hides all of your mozconfigs into a configurable directory (defaults to ~/.mozconfigs), and lets you easily switch, create,
remove, edit and list them. Mozconfigwrapper is Unix only for now.
Mozconfigwrapper is brand new. I still need to add some better error checking and do testing on OSX. So if you have any problems
installing or using it, please let me know or file an issue.
Installation
To install first make sure you have
pip. Then run the command
sudo pip install mozconfigwrapper
Next open up your ~/.bashrc file and add the line
source /usr/local/bin/mozconfigwrapper.sh
Note that it may have been installed to a different location on your system. You can use the command 'which mozconfigwrapper.sh'
to find it.
Finally run the command
source ~/.bashrc
Mozconfigwrapper is now installed.
Usage
Mozconfigwrapper allows you to create, remove, switch, list and edit mozconfigs.
To build with (activate) a mozconfig named foo, run:
buildwith foo
To create a mozconfig named foo, run:
mkmozconfig foo
To delete a mozconfig named foo, run:
rmmozconfig foo
To see the currently active mozconfig, run:
mozconfig
To list all mozconfigs, run:
mozconfig -l
To edit the currently active mozconfig, run (the $EDITOR variable must be set):
mozconfig -e
Configuration
By default mozconfigs are stored in the ~/.mozconfigs directory, but you can override this by setting the
$BUILDWITH_HOME environment variable.
e.g, add:
export BUILDWITH_HOME=~/my/custom/mozconfig/path
to your ~/.bashrc file.
When you make a new mozconfig, it will be populated with some basic build commands and the name of the mozconfig
will be appended to the end of the OBJDIR instruction. You can modify what gets populated by default by editing
the ~/.mozconfigs/.template file. For example, if I wanted my default configuration to store object directories
in a folder called objdirs and enable debugging and tests, I'd edit the ~/.mozconfigs/.template file to look like:
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdirs/
ac_add_options --enable-application=browser
ac_add_options --enable-debug
ac_add_options --enable-tests
Now if I ran the command 'mkmozconfig foo', foo would be populated with the above and have the word 'foo'
appended to the first line.
November 15, 2011 08:11 PM
Introducing MozBase
Over the years, Mozilla has developed a number of
test harnesses
for automated testing of Firefox and other applications. Most of the
harness code is written in
python due to its utility towards this
type of development. As one would expect, the harnesses arose from
necessity and grew organically. However, as the harnesses grew it
became apparent that there were several generic tasks that the
harnesses shared:
- creating and manipulating a profile
- installing addons into the profile
- invoking (e.g.) Firefox in a desired manner
- process management
- ...a few other things
These pieces have largely been developed in a vacuum (in the early
stages) or copy+pasted from other harnesses (in the later stages).
This has lead to duplicated functionality, difficult to maintain
and inconsistent harness software (since fixing things one place means
that they probably need to fix them other places), and a system which
was fully understood by no one after it became of sufficient
complexity. The harness software could not be reused because it is
tightly coupled to the implementation even when the underlying intent
was generic.
Meet MozBase!
As software grows, it should be cultivated such that the effectivity
and its knowledge base are maximized. Code should be made reusable
and the architecture evolved towards a representation of intent. This
is the goal of the MozBase effort by the
A-Team :
https://wiki.mozilla.org/Auto-tools/Projects/MozBase
- we want to make high quality components to build test harnesses
- ... and other pieces of software
- ... that might be useful on their own
- we want to replace existing code with these pieces
- ... but cultivate their knowledge base
- we want to develop canonical and reusable python tools
- ... and encourage the community to use them
Developing
MozBase is
one of the
A-Team goals
this quarter. While cultivating software is an ongoing effort, we're
off to a good start. We already have several MozBase python packages:
Our immediate goals are to cultivate these into high-quality tools
taking lessons from the existing harnesses. Then, porting the
harnesses to these tools that can be maintained in a unified manner.
Right now, we're working on
Talos both because this
is a good proving ground for these tools and because much of its code
can be replaced with MozBase code easily (for some definition of
"easy").
While MozBase is about software, it is also about having a sane and
maintainable environment to cultivate software in. While modular
packages are great, their utility is in how they may be used together
(as well as with other code) instead of in the craft of an individual
package. So we're tackling these issues too.
Python importing in Mozilla Central: currently (most) python in
mozilla central is not packaged and we manually
futz with pythonpath
and sys.path in several inconsistent and hard to maintain ways.
In order to move towards python packages in any reasonable fashion we
need to make importing easy and unified as well as moving towards how
the python world typically does importing. There is
bug 661908
for creating a unified virtualenv in
the $OBJDIR. Work is likely to start on this or a similar effort
soon (either this quarter or Q1 2012).
Mirroring software to Mozilla Central: we have hampered ourselves --
rewritten software and avoided fixing bugs -- by not using third-party
python packages for tools that live in mozilla-central. In addition,
since many of the test harness already
live in m-c ,
if we are going to move these to consume mozbase we will need a
strategy to mirror it and other software to the tree. While nothing
has been definitively decided, preliminary discussion has pointed
towards having a script to fetch resources from a variety of locations
and add them to mozilla-central or elsewhere. We're having a meeting
this week to figure out what we really want to do and go from there.
Such is the MozBase effort. I am excited to start moving our code
into a solid maintainable structure, and I hope you are too. If you
are, please check out our
github project or sign in to
#ateam# and tell us what you think. We'd love contributors!
November 15, 2011 10:59 AM
November 14, 2011
Malini Das and I have been working on a new test framework called Marionette, in which tests of a Gecko-based product (B2G, Fennec, etc.) are driven remotely, ala Selenium. Marionette has client and server components; the server side is embedded inside Gecko, and the client side runs on a (possibly remote) host PC. The two components communicate using a JSON protocol over a TCP socket. The Marionette JSON protocol is based loosely on the Selenium JSON Wire Protocol; it defines a set of commands that the Marionette server inside Gecko knows how to execute.
This differs from past approaches to remote automation in that we don’t need any extra software (i.e., a SUTAgent) running on the device, we don’t need special access to the device via something like adb (although we do use adb to manage emulators), nor do tests need to be particularly browser-centric. These differences seem advantageous when thinking about testing B2G.
The first use case to which we might apply Marionette in B2G seems to be WebAPI testing in emulators. There are some WebAPI features that we can’t test well in an automated manner using either desktop builds or real devices, such as WebSMS. But we can write automated tests for these using emulators, since we can manipulate the emulator’s hardware state and emulators know how to “talk” to each other for the purposes of SMS and telephony.
Since Marionette tests are driven from the client side, they’re written in Python. This is what a WebSMS test in Marionette might look like:
from marionette import Marionette
if __name__ == '__main__':
# launch the emulators that will do the sending and receiving
sender = Marionette(emulator=True)
assert(sender.emulator.is_running)
assert(sender.start_session())
receiver = Marionette(emulator=True)
assert(receiver.emulator.is_running)
assert(receiver.start_session())
# setup the SMS event listener on the receiver
receiver.execute_script("""
var sms_body = "";
window.addEventListener("smsreceived",
function(m) { sms_body = m.body });
""")
# send the SMS event on the sender
message = "hello world!"
sender.execute_script("navigator.sms.send(%d, '%s');" %
(receiver.emulator.port, message))
# verify the message was received by the receiver
assert(receiver.execute_script("return sms_body;") == message)
The JavaScript portions of the test could be split into a separate file from the Python, for easier editing and syntax highlighting. Here’s the adjusted Python file:
from marionette import Marionette
if __name__ == '__main__':
# launch the emulators that will do the sending and receiving and
# load the JS scripts for each
sender = Marionette(emulator=True)
assert(sender.emulator.is_running)
assert(sender.start_session())
assert(sender.load_script('test_sms.js'))
receiver = Marionette(emulator=True)
assert(receiver.emulator.is_running)
assert(receiver.start_session())
assert(receiver.load_script('test_sms.js'))
# setup the SMS event listener on the receiver
receiver.execute_script_function("setup_sms_listener")
# send the SMS event on the sender
message = "hello world!"
target = receiver.emulator.port
sender.execute_script_function("send_sms", [target, message])
# verify the message was received by the receiver
assert(receiver.execute_script_function("get_sms_body") == message)
And here’s the JavaScript file:
function send_sms(target, msg) {
navigator.sms.send(target, msg);
}
var sms_body = "";
function setup_sms_listener() {
window.addEventListener("smsreceived",
function(m) { sms_body = m.body });
}
function get_sms_body() {
return sms_body;
}
Both of these options are just about usable in Marionette right now. Note that the test is driven, and some of the test logic (like asserts) resides on the client side, in Python. This makes synchronization between multiple emulators straightforward, and provides a natural fit for Python libraries that will be used to interact with the emulator’s battery and other hardware.
What if we wanted JavaScript-only WebAPI tests in emulators, without any Python? Driving a multiple-emulator test from JavaScript running in Gecko introduces some complications, chief among them the necessity of sharing state between the tests, the emulators, and the Python testrunner, all from within the context of the JavaScript test. We can imagine such a test might look like this:
var message = "hello world!";
var device_number = Marionette.get_device_number(Marionette.THIS_DEVICE);
if (device_number == 1) {
// we're being run in the "sender"
// wait for the test in the other emulator to be in a ready state
Marionette.wait_for_state(Marionette.THAT_DEVICE, Marionette.STATE_READY);
// send the SMS
navigator.sms.send(Marionette.get_device_port(Marionette.THAT_DEVICE), message);
}
else {
// we're being run in the "receiver"
// notify Marionette that this test is asynchronous
Marionette.test_pending();
// setup the event listener
window.addEventListener("smsreceived",
function (m) {
// perform the test assertion and notify Marionette
// that the test is finished
is(m.body, message, "Wrong message body received");
Marionette.test_finished();
}
);
// notify Marionette we're in a ready state
Marionette.set_state(Marionette.STATE_READY);
}
Conceptually, this is more similar to xpcshell tests, but implementing support for this kind of test in Marionette (or inside the existing xpcshell harness) would require substantial additional work. As it currently exists, Marionette is designed with a client-server architecture, in which information flows from the client (the Python part) to the server (inside Gecko) using TCP requests, and then back. Implementing the above JS-only test syntax would require us to implement the approximate reverse, in which requests could be initiated at will from within the JS part of the test, and this would require non-trivial changes to Marionette in several different areas, as well as requiring new code to handle the threading and synchronization that would be required.
Do you think the Python/JS hybrid tests will be sufficient for WebAPI testing in emulators?

November 14, 2011 09:12 PM
jhammel now maintains mozregression
So the secret is out!
http://harthur.wordpress.com/2011/11/01/new-mozregression-owner/
I am going to be maintaining mozregression going forward. I released a
0.6 version to pypi today which hopefully fixes a few setup.py issues.
You can find me at jhammel __at__ mozilla __dot__ com or as jhammel in #ateam.
http://groups.google.com/group/mozilla.tools/t/b1f12f5127761207
November 14, 2011 03:13 PM
Talos is now a python package
The A-Team is working on
creating a set of high-quality python utilities that are consumable,
general purpose, and interoperable in an effort called
MozBase .
A huge part of
this quarter's effort
is to improve Talos
to consume MozBase software and to make it an extensible harness that
may also be consumed.
As one of the first steps towards making Talos consume upstream
MozBase packages, I have
made Talos a python package .
This allows Talos to depend on upstream python packages in an
automated fashion, permit additional setup/install time steps to be
automated, and install in a manner that dotted paths against talos
can be resolved by python import. That is, other packages can now
usefully import talos without depending on a set directory structure.
Unfortunately, since the talos repository was arranged such that all
the python scripts and other data lived in a fairly disorganized
top-level directory, this involved making a talos subdirectory and
moving all files (except the README) into that subdirectory and
carefully ensuring that all data resources were properly installed
alongside the python scripts.
Even more unfortunately, this change led to some confusion that
could have been avoided ahead of time. Talos uses a tests.zip
file that contains both the scripts and the data, and though I would
have liked to do additional cleanup as part of making Talos a python
package, I deliberately held off on changing anything that would
invalidate this methodology. However, unbeknownst to me, there were
other resources that depended on the talos directory structure, and these
got broken with my change. I apologize for that, and will communicate these
changes more widely next time. In the meantime, if you have any tools
that depend on the talos directory structure, know that they will break
next time you update. If you have questions about this, please contact me.
Although the fallout was regrettable, I think this is a necessary and
forward facing change in the light of MozBase,
Mozharness , and general good
python practices. We're now looking at deprecating the tests.zip
methodology and moving towards a
Mozharness script for running Talos
for both desktop testers and production. More on that as things
progress.
November 14, 2011 02:32 PM
November 11, 2011
I’ve been spending the last month or so at Mozilla prototyping a new project called Eideticker which aims to use video capture data and image/frame analysis for performance measurement of Firefox Mobile. It’s still in quite a rough state, but it’s now complete enough that I thought it would be worth spending a bit of time describing both its motivation and how it works.
First, a bit of an introduction. Up to now, our automated performance tools have used entirely synthetic benchmarks (how long til we get the onload event? how many ms since we last hit the main loop?) to gather performance information. As we’ve found out, there’s a lot you can measure with synthetic benchmarks. Tools like Talos have proven themselves by catching performance regressions on a very regular basis.
Still, there’s many things that synthetic benchmarks can’t easily or reliably measure. For example, it’s nice to know that a page has triggered an “onload” event (and the sooner it does that, the better), but what does the browser look like before then? If it’s a complicated or image intensive page, it might take 10 or 15 seconds to load. In this interval, user studies have clearly shown that an application displaying something sooner rather than later is always desirable if it’s not possible to display everything immediately (due to network traffic, CPU constraints, whatever). It’s this area of user-perceived performance that Eideticker aims to help with. Eideticker creates a system to capture live data of what the browser is displaying, then performs image/frame analysis on the result to see how we’re actually doing on these inherently subjective metrics. The above was just one example, others might include:
- Measuring amount of time it takes to actually see the start page from time of launch.
- Measuring amount of time you see the checkboard pattern after panning the browser.
- Measuring the visual artifacts while loading a complicated page (how long does it take to display something? how long until we get something close to the final expected result? how long until we get the actual final result?)
It turns out that it’s possible to put together a system that does this type of analysis using off-the-shelf components. We’re still very much in the early phase, but initial signs are promising. The initial test system has the following pieces:
- A Linux workstation equipped with a Decklink extreme 3D video capture card
- An Android phone with HDMI output (currently using the LG G2X)
- A version of talos modified to video capture the results of a test.
- A bit of python code to actually analyze the video capture data.
So far, I’ve got the system working end-to-end for two simple cases. The first is the “pageload” case. This lets you capture the results of loading any page within a talos pageset. Here’s a quick example of the movie we generate from a tsvg test:
Here’s another example, a color cycle test (actually the first test case I created, as a throwaway):
After the video is captured, the next step is to analyze it! As described above (and in further detail on the Eideticker wiki page), there’s lots of things we could measure but the easiest thing is probably just to count the number of unique frames and derive a frame rate for the capture based on that (the higher the better, obviously). Based on an initial prototype from Chris Jones, I’ve started work on a python library to do exactly this. Assuming you have an eideticker capture handy, you can run a tool called “analyze.py” on the command line, and it’ll give you its best guess of the # of unique frames:
(eideticker)wlach@eideticker:~/src/eideticker$ bin/analyze.py ./src/talos/talos/captures/capture-2011-11-11T11:23:51.627183.zip
Unique frames: 121/272
(There are currently some rough edges with this: we’re doing frame comparisons based on per-pixel changes, but the video capture data is slightly noisy so sometimes a pixel changes its value even when nothing has actually happened in the browser)
So that’s what I’ve got working so far. What’s next? Short term, we have some specific high-level goals about where we want to be with the system by the end of the quarter. The big unfinished pieces are getting an end-to-end test involving real user interaction (typing into the URL bar, etc.) going and turning this prototype system into something that’s easy for others to duplicate and is robust enough to be easily extended. Hopefully this will come together fairly quickly now that the basics are in place.
The longer term picture really depends on feedback from the community. Unlike many of the projects we work on in automation & tools, Eideticker is not meant to be something that’s run on every checkin. Rather, it’s intended to be a useful tool that can be run on an as needed basis by developers and QA. We obviously have our own ideas on how something like this might be useful (and what a reasonable user interface might be), but I’ve found in cases like this it’s much better to go to the people who will actually be using this thing. So with that in mind, here’s a call for feedback. I have two very specific questions:
- Is there a specific problem you’ve been working on that a framework like this might be helpful for?
- What do you think of the current workflow model described in the README?
My goal is to make something that people will love, so please do let me know what you think.
Nothing about this project is cast in stone and the last thing I want is to deliver a product that people don’t actually want to use.
Equally, while Eideticker is being written primarily with the goal of making Mobile Firefox better (and in the slightly-less short term, desktop Firefox and Boot to Gecko), much of it is broadly applicable to any user-facing mobile or desktop application. If you think some component of Eideticker might be interesting to your project and want to collaborate, feel free to get in touch.
November 11, 2011 08:57 PM
November 06, 2011
While responsiveness is one of the main goals for Firefox this quarter, we still don't quite have the means to measure and test our progress
towards this goal. The good news is that there are, and have been for some time, several efforts to fix this problem. Back in June, Ted wrote
some event tracing instrumentation that gives us a reasonable
idea of when the browser becomes unresponsive. This event tracer is already being used by some Talos tests which gives us a good general idea of
whether or not Firefox is more or less responsive than it was previously. What it doesn't give us is a method for developers to write their own
tests and determine whether a specific action or feature they are working on is causing unresponsivness.
Peptest is designed for the missing use case. Namely, it can be used to
automate user interactions in the browser and determine whether those actions are causing unresponsivness. This may be useful for creating a
suite of responsiveness regression tests, or for developers working on a responsiveness related feature or fix. The Peptest harness is designed
to be lightweight (so as not to interfere with results), simple to run and easy to write tests for.
Tests are nothing but Javascript files that will be executed in
chrome scope. This means that Peptests are basically browser-chrome tests without any of the assertions (since assertions aren't needed in this
context). However, since many Peptests will likely need to perform some kind of UI automation, the Peptest harness also exposes
Mozmill's driver for convenience. I feel it's important to note, that importing Mozmill
is completely optional (though recommended if you need to do any automation). I also feel it's important to note that I did some work to
isolate Mozmill's driver which means that the actual test harness bits of
Mozmill have been completely stripped out. What's left over is surprisingly lightweight and lives in a handful of JS files.
Currently, it is possible to run tests locally on your machine,
though I could potentially add features or change any aspect of the harness. I've also been working on a Mozharness script in
bug 692091 so we can run tests automatically for tinderbox builds.
Finally, I'd like to say: I need feedback! The requirements of this harness have been very vague from the outset. I've been doing my best
to interpret the requirements in a way that makes sense, but I'm still kind of flying blind so to speak. What I mean is, I'm not sure what
developers want and/or need. I'm also not sure how useful what I've thrown together so far will be. So if you have any ideas or general comments,
please ping ahal on irc, or e-mail ahalberstadt@mozilla.com and I'd be very grateful.
November 06, 2011 03:56 AM
November 03, 2011
This quarter I became the proud owner of Talos (well at least for a quarter or two). Over the last few years talos has not had much churn, but this year (2011 proper) we have seen addons, responsiveness, xperf, mozafterpaint and experiments with eideticker. With all of this talos has grown and more people are working on writing patches for it.
So there are plenty of efforts underway to refactor talos to make it easier to expand. This is fine and dandy, but for a developer wanting to help out or reproduce a bug it is next to impossible. We have standalone talos, but that still requires some effort and hacking.
If you are interested in running talos, or if you have some pet peeve that you have encountered while running talos please file a bug, comment on existing bugs, or let us know in #ateam on irc.

November 03, 2011 01:49 PM
October 25, 2011
So as others have been posting about, we’ve been making some headway on our progress on the GoFaster project. Unfortunately it seems like we’re still some distance away from reaching our magic number of a 2 hour turnaround for each revision pushed.

It’s a bit hard to see the exact number on the graph (someone should fix that), but we seem to teetering around an average of 3 hours at this point. Looking at our build charts, it seems like the critical path has shifted in many cases from Windows to MacOS X. Is there something we can do to close the gap there? Or is there a more general fix which would lead to substantial savings? If you have any thoughts, or would like to help out, we’re scheduled to have a short meeting tomorrow.
Anyone is welcome to join, but note that we’re practical, results-oriented people. Crazy ideas are fun, but we’re most interested in proposals that have measurable data behind them and can be implemented in reasonable amounts of time.
October 25, 2011 10:13 PM
October 24, 2011
October 20, 2011
At the beginning of September, I was asked to write yet another automated
test harness for testing user responsiveness. Among other things, the harness
needed to be capable of automating a wide range of user interactions in Firefox (such as
opening context menus, clicking buttons etc). Oh and by the way this needs to
be finished as quickly as possible.
It turns out that machines aren't very good at interacting with user interfaces designed for humans.
Properly...
October 20, 2011 09:00 PM
October 14, 2011
In response to David Boswell’s post on getting involved at Mozilla, I thought I’d relate my own story.
I worked at a company called SimDesk that decided to reuse the Thunderbird and Sunbird code bases and make a great email application–this was long before the Lightning extension came into being. Like any good closed-source company, we stole the code and worked on it in secret until we had a shining example of an “Outlook killer” (well, more or less).
Then we started feeling like we should contribute some of that code back to Mozilla. We had a bunch of very awkward meetings with Dan Mosedale and Mike Shaver as they tried to teach us how to do open source. They kept saying, “just submit a patch”, we kept wondering which lawyers we’d have to get involved to do that.
Eventually, Mike Hovis (an old friend and superior developer) and I started writing those patches. It became clear that our changes wouldn’t apply cleanly to the newly refactored “Lightning” source base. We decided that I’d make it part of my job (20% of my time, as I recall) to make patches for functionality we cared about and get it to the Mozilla calendar team.
I started attending the calendar team’s public meetings, and during one, when they asked if anyone wanted to lead a calendar QA team, I volunteered. I had no idea how to actually do this, but I wanted to try organizing online to see if some of my offline organizing skills would translate. My contribution of time grew. As SimDesk directed me to work on Outlook extensions rather than an Outlook killer, I spent more and more of my time working with my calendar team, writing patches, mentoring, and aiding volunteers as they found their roles as leaders and developers in the calendar project.
And one day, when I could plainly see the writing on the wall, I asked Dan if Mozilla would actually consider a resume from me. After his enthusiastic “yes”, I applied, and the rest is history.
Starting in the calendar project was incredible. It was smaller (of course so was Mozilla in those days–even though it felt huge to me at the time). It was easier to see your impact in such a small space, easier to identify volunteers, and easier to mentor people through the process and watch them become leaders.
Starting in that small area was also fortuitous because there was so much that needed to be done and opportunities were everywhere.
I still think that there are small areas across Mozilla where people can start and have a similar experience. However, I think that Mozilla seems so monolithic these days that it is daunting to even try to find those niches where you can start out as a volunteer. It is up to us on our teams to identify those areas where people can start, publicize them, and help people make that leap from “casually interested party” to “volunteer”. In that vein, I tried articulating the roles that we’d like to see people step up to fill on my team. If you’re interested, you know where to find me.

October 14, 2011 01:55 AM
October 13, 2011
We’re looking at updating our Android support with these PandaBoard cards. We already run with Tegras in our automation, but the Tegra 250′s are discontinued, and we can’t update to newer versions of Android with them, so introducing Pandaboards.
Well, PandaBoards come with nothing, not even a power supply. They can be powered off USB, but it’s pretty difficult to get adb working in that state (if you have steps, I’d love to hear them). So, here are the steps to getting something usable working (See the official getting started too):
- Order power cord, specifically the adapter and the cord
- Order 8-16Gb SDCard
- Ensure you have a mini USB cord
- Ensure you have a CAT 5 network cable.
Once you have this, you can build or download a build onto your SDCard. Oh yeah, you’ll need an SDCard writer/reader. Most computers have them by default these days, thankfully.
Then, plug it all in, and it should work. I’ve noticed a few oddities:
- Our SUTAgent had some difficulties at first, but now it seems to be working fine. Still debugging this.
- ADB won’t work if the card is plugged in when it boots. I think this is due to the build, as I seem to recall seeing an issue on it earlier. I’ll keep researching and will try some different builds to find something more stable. In the mean time, unplug when you reboot the card, plug in after the card is up and running. Also, you won’t see the “USB” notification that you usually see in Android. So, don’t expect that.
- There is something going on with the package manager. I installed Fennec, but the pm doesn’t list it, and claims that it is not installed. However, it runs fine, appears in the applications, and can’t be re-installed. I just can’t uninstall it. Still investigating that too, and like the other OS level issues, I’m wondering about this downloaded build.

October 13, 2011 12:16 AM
October 07, 2011
Despite making a dramatic shift from front-end development to back-end stuff since I started at Mozilla a few months ago, I’ve still had occasion to have to do a fair bit of user-facing code, even if an audience of other developers is a bit more limited than what I’ve been used to. Since my mission is to make the rest of Mozilla more productive, it’s worth putting a bit of time and intention into the user interface for my stuff. If I can reduce learning curves or streamline day-to-day workflows, that’s a win for everyone since they can spend that much more time rocking at their jobs (whether that be release engineering, platform work, or whatever). This brings up a point that I’ve had in the back of my mind for a while:
Despite conventional wisdom, developers can design half-decent user interfaces (if they try)!
I used to be certain that a project really needed graphic designers and/or usability experts to provide guidance on UI issues, but my experience over the last few years with iOS/web development has made me reconsider. Sure, pixel pushing and vector art is never going to be a programmer’s strong suit (and there’s certain high-level techniques that take years of study to acquire/understand), but the basic principles behind good UI design are accessible to anyone. There’s really only three core skills:
* An ability to put yourself in the shoes of the user. Who are you designing for, and what are they trying to accomplish? How can I streamline my UI to help them quickly solve the task at hand? This is one of the reasons why I find user stories so helpful.
* An understanding of common vocabulary for describing/designing applications and knowing what is “good”. Unfortunately I haven’t found anything like this for the web, but Apple’s human interface guidelines have some good general advice on this (just ignore the stuff specific to phones/tablet apps if that’s not what you’re doing).
* A willingness to iterate. The best ideas usually aren’t apparent immediately, and may only come out of a back forth. It’s been my experience that the more constructive dialog there is between people actively involved in the project on user experience issues, the better the end result is likely to be.
For example, one of the things that release engineering has found most useful in the GoFaster Dashboard has been the build charts. Believe it or not, the idea for that view started out as this useless piece of junk (I can say that because I created it). It was only after a good half hour back and forth on irc between myself, jgriffin, and jmaher (all of us backend/tool developers) that we came up with the view that inspired so much good analysis on the project.
All this is not to say that usability experts and graphic designers don’t have special skills that are worthy of respect. Indeed, if you’re a designer and would like to get involved with our work, please join us, we’d love your help. My only point is that on a project where a design resource isn’t available, thinking explicitly about usability is still worthwhile. And even where you have a UX expert on staff, programmers can have useful feedback too. Good UI is everyone’s responsibility!
October 07, 2011 02:48 PM
October 05, 2011
At Mozilla we have made our unit testing on android devices to be as important as desktop testing. Earlier today I was asked how do we measure this and what is our definition of success. The obvious answer is no failures except for code that breaks a test, but reality is something where we allow for random failures and infrastructure failures. Our current goal is 5%
So what are these acceptable failures and what does 5% really mean. Failures can happen when we have tests which fail randomly, usually poorly written tests or tests which have been written a long time ago and hacked to work in todays environment. This doesn’t mean any test that fails is a problem, it could be a previous test that changes a Firefox preference on accident. For Android testing, this currently means the browser failed to launch and load the test webpage properly or it crashed in the middle of the test. Other failures are the device losing connectivity, our host machine having hiccups, the network going down, sdcard failures, and many other problems. With our current state of testing this mostly falls into the category of losing connectivity to the device. For infrastructure problems they are indicated as Red or Purple and for test related problems they are Orange.
I took at a look at the last 10 runs on mozilla-central (where we build Firefox nightlies from) and built this little graph:

Firefox Android Failures
Here you can see that our tests are causing 6.67% of the failures and 12.33% of the time we can expect a failure on Android.
We have another branch called mozilla-inbound (we merge this into mozilla-central regularly) where most of the latest changes get checked in. I did the same thing here:

mozilla-inbound Android Failures
Here you can see that our tests are causing 7.77% of the failures and 9.89% of the time we can expect a failure on Android.
This is only a small sample of the tests, but it should give you a good idea of where we are.

October 05, 2011 07:35 PM
October 03, 2011
September 29, 2011
I recently implemented some improvements in the A-Team's
Automated Speed Tests as per some requests I got back when I first announced them in July. Not everything's done, but I think this is a good point to advertise what's been changed thus far.
Firstly, I ditched the awful BIRT reports in favour of a custom web app that is faster, easier to use, and more flexible. You can restrict the date range (default is the last four weeks) and switch between tests and machines. The graph is also more responsive when turning on and off particular browsers (just click on the name in the legend). All the same data is there, but it's less cluttered and, well, less ugly!
By the way, BIRT appears to have a security hole in that it will insert the value of some GET parameters directly into the page without sanitizing them! So beware of that if you want to use BIRT for some reason.
Secondly, more tests! The first is
MazeSolver, one that Firefox is particularly bad at. The second is
test262, a JavaScript conformance test that has unfortunately made the name "Speed Tests" a bit of a lie.
A couple interesting observations I've made recently:
- Nightly recently got better at Santa's Workshop. I ran the test myself to see, and Nightly maintains a higher number of elves for longer, but eventually it goes back down to one. So still a ways to go, but the median FPS is higher, at least. Nightly still also doesn't display all the colours properly.
- Seems that Nightly 10.0a1 has one less pass in test262 than 9.0a1.
If you're wondering, the two Windows machines are running different hardware; Win7 1 is a 32-bit machine, and Win7 2 is a 64-bit machine, although I only switched it to use the 64-bit nightlies today. Email me if you want more particulars on the hardware.
Still more to come, including
- more tests!
- more browser strains!
- more platforms!
And, as always, please let me know if there's more I can do to make the framework, tests, or data more useful.
September 29, 2011 09:31 PM
Last week I created a python webserver as a patch for make talos-remote. This ended up being frought with performance issues, so I have started looking into it. I based it off of the profileserver.py that we have in mozilla-central, and while it worked I was finding my tp4 tests were timing out.
I come to find out we are using a synchronous webserver, so this is easy to fix with a ThreadingMixIn, just like the chromium perf.py script:
class MyThreadedWebServer(ThreadingMixIn, BaseHTTPServer.HTTPServer):
pass
Now the test was finishing, but very very slowly (20+ minutes vs <3 minutes). After doing a CTRL+C on the webserver, I saw a lot of requests hanging on log_message and gethostbyaddr() calls. So I ended up overloading the log_message call and things worked.
class MozRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
# I found on my local network that calls to this were timing out
def address_string(self):
return "a.b.c.d"
# This produces a LOT of noise
def log_message(self, format, *args):
pass
Now tp4m runs as fast as using apache on my host machine.

September 29, 2011 02:54 PM
September 21, 2011
Today we rolled out changes to talos such that tests that use the pageloader (chrome, nochrome, tp) will have the option to report the page load times after we receive a MozAfterPaint event instead of a Load event.
Currently this is only active on Mozilla-Central as we will run the numbers side by side to ensure we get a solid new baseline number. In addition we upgraded the version of flash we are using and this seems to cause a small increase in the numbers as well.
We will run these side by side for a week and then we will turn off the non paint versions. This will go branch by branch until we have no more side by side tests running. If you look at the talos names, the original tests are marked as old_{testname} (i.e. old_tp, or old_chrome), and on the graph server the new tests are called {testname}_paint (i.e. tp_paint, tdhtml_paint, etc…)

September 21, 2011 06:39 PM
September 19, 2011
September 15, 2011
September 06, 2011
For the GoFaster project, releng and the A-team have been working on various tasks which we hope will result in getting the total commit to all-tests-done time down to 2 hours for the main branches (try excluded). This total turnaround time was 6-8 hours a couple of months ago when we began this project.
We’ve recently made some improvements that seriously reduce the total machine time required to run all tests for a given commit. These include hiding the mochitest results table, removing packed.js from mochitest, and streamlining individual slow tests (see bug 674738, bug 676412, and bug 670229). These together have reduced the total machine time for test down from about 40 hours to around 25 hours per commit, a big win.
However, the total turnaround times are still much slower than our goal:

We already knew that PGO builds are slow, and jhford is working on turning on-demand builds into non-PGO builds, and make PGO builds every four hours (bug 658313). However, we needed a way to dig deeper into the data to see what our other pain points are.
Will Lachance made some awesome build charts which help us visualize what’s going on in these buildbot jobs. Clicking any commit will show a chart that displays all the relevant buildbot jobs in relative clock time; this makes it easier to see where the bottlenecks are.
Build times
Display the build chart for just about any commit (e58e98a89827 for instance), and you’ll see the problem right away: just about every commit includes builds that far exceed 2 hours. These aren’t always opt builds, and they sometimes occur even on our ‘fast’ OS: linux. Check out 5d9989c3bff6, which has a linux64 opt build that takes 214 min, compared to the linux32 opt build that takes 61 minutes. 198c7de0699d has an OSX 10.5 debug build that takes 171 minutes, but the 10.6 debug build takes only 82 minutes. Clearly, we can’t hit our 2-hour goal with builds that take 2+ hours. What’s going on?
It’s necessary to spend a little time digging through build logs to find out. It turns out there are multiple factors.
- We already know that PGO builds are slow, particularly on Windows. Once bug 658313 lands, we expect the overall situation to improve dramatically.
- On some builds, the ‘update’ step includes a full ‘hg clone’ of mozilla-central, while others use ‘hg pull -u’. Below is a graph of update times; the average time for an update that includes ‘hg clone’ is 12.9 min, for those that use ‘hg pull’ the average is 0.6 min. Each full clone is costing us an average of 12 minutes.

- On some build slaves, we do a full build (with no obj dir from a previous build), on others we do an incremental build. Below is a graph showing incremental vs full compile times for opt and debug builds. On average, full compiles are taking 17 minutes longer than incremental ones.

- We have a mix of slow and fast slaves. This can easily be seen in the below graph of linux compile times. On linux and linux64 builds, full compiles with moz2-linux(64)-* slaves are slow (those > 75 min), while those made with linux(64)-ix-* slaves are fast (those < 75 min). 32-bit mac builds show a similar split, with those on moz2-darwin9* slaves slow, and those on bm-xserve* slaves fast. Hardware doesn’t appear to create a significant difference for windows and 64-bit mac builds.

- On macosx64 machines, the ‘alive test’ step takes an average of 6 min (vs 1 min on other os’s).
- The ‘checking clobber times’ step often takes just a couple of seconds, however when this step actually results in some clobbering being done, it can take up to 21 minutes (average: 6 min).
When all these factors coincide, we can get builds (which include compile, update, and other steps) that exceed 4 hours. This suggests doing away with on-demand PGO builds may not in itself get us to our 2-hour goal.
From this data, two of the more obvious ways to improve our build times might be:
- Investigate retiring slow linux and 32-bit mac build slaves.
- Investigate ways to reduce clobbering. Clobbering itself takes time (see bullet #6 above), but also indirectly costs time through increased update and compile times. Currently, about 51% of our builds are operating on clobbered slaves, requiring full hg clones and full compiles. If this number could be reduced, we might see a significant reduction in our average turnaround times.
Test times
According to Will’s build charts, the E2E time for tests is often within our 30-minute target range. The exception is mochitest-other on debug builds, which often takes from 60 to 90 minutes. We could improve this situation somewhat by splitting mochitest-browser-chrome (the longest-running chunk of mochitest-other) into its own test job.
Additionally, wait times for test slaves running android and win 7 tests is sometimes non-trivial; see e.g. the details for commit 97216ae0fc04. We should try to understand why this happens; the graph of test wait times doesn’t show a clear trend, other than highlighting the fact that wait times for windows and android are usually worse than the other os’s.


September 06, 2011 11:50 PM
September 01, 2011
A bit quiet here for the last few months. What’s been happening?
1. I got married and had a wonderful honeymoon in France.
2. I started a fantastic new job with Mozilla’s tools & automation group. Currently working on bringing down the build/test times for Firefox (part of a project called GoFaster), which has been really interesting.
3. I moved into a fantastic new apartment in an old victorian building near Vendôme metro.
In short, life has been treating me really well! More updates soon.
September 01, 2011 04:07 PM
August 12, 2011
August 09, 2011
I’m sure anyone who has ever submitted a patch to a Mozilla tree is familiar with this drill:
- hg push
- check TBPL, wait
- check TBPL again, wait some more
- go to Starbucks for a caramel macchiato, install a new OS on your laptop, review all the patches in your queue, plan next winter’s tropical vacation, check TBPL, and….
- wait some more
Recently, the total end-to-end time from submit to all-tests-done has been around 6-8 hours, depending on load. That’s too long, and RelEng and the A-Team think we can do something about it. For the past couple of months we’ve been working on the GoFaster project; our goal is to get that turnaround time down to 2 hours. We have a list of tasks, and recently one of these landed with some significant improvements.
Cameron McCormack wrote a patch which hides the mochitest results table when MOZ_HIDE_RESULTS_TABLE=1 (see: bug 479352). The initial version of this patch caused frequent hangs during mochitest-1/5. We didn’t discover the reason behind this, but I updated the patch to hide the result table in a different way, and the hang vanished. I pushed this change to mozilla-central, and Cameron made a table displaying before and after durations for all the test runs.
The results? That one change saves about 13 hours of machine time per checkin. The entire suite of unit tests which prior to that change took about 40 machine-hours to run now takes 27. Wow!
What kind of improvement in the end-to-end time does that translate into? I’m not sure. Sam Liu, an A-Team intern, has been working on a dashboard to help track this, but it’s currently using canned (stale) data. RelEng is working on exposing live data to be consumed by the dashboard, and when that’s ready we should be able to easily track the effect of changes like this in the overall time.
Meanwhile, check out the project’s wiki page or attend one of our meetings. If you have thoughts on ways we can improve our total turnaround time, we’d love to hear from you.

August 09, 2011 05:21 PM
July 12, 2011
It's hard to find a discussion of the speed of modern browsers that doesn't mention Microsoft's
Test Drive speed demos. It's a common occurrence to find hundreds of fish swimming around a graphics developer's monitor. Continuing our mission to make developers' lives easier, the Mozilla A-Team has put together a framework to automatically run a few of these tests and put the
results online. They're a bit ugly and slow, but some day I'll get around to cleaning them up.
We have set up a small framework that executes 5 speed tests twice daily against all the major browsers: IE, Safari, Chrome, Opera, and Firefox. Since we're particularly concerned with the latter, we run both the latest released version of Firefox and the latest Nightly.
For most tests, we sample the FPS every 5 seconds, since there is often a ramping-up time as objects are created and such. We then plot the median FPS for the test for each browser to make comparison easy. The results of any particular test run are also available through links in the graph and table for those curious about how the browser performs at various points during the test run.
One test, Psychedelic Browsing, uses a different metric, namely, the RPMs of a spinning patterned wheel. This is sampled only once, at the end of the test.
Disclaimer: I won't get into technical issues here, but sufficed to say that automating one browser is a little tricky; automating 5 browsers from different vendors is very tricky. One way we've reduced the number of variables is by limiting network access to prevent automatic, potentially performance-affecting browser and OS upgrades. This requires some periodic manual maintenance to update everything. We also reboot the machine after every test run. But mistakes happen and bugs crop up, so there are gaps in some of the graphs where one or more browsers were unable to start up or load the test suite, and some swings in results where the browser or machine was perturbed by some force (unfortunately this happened recently, which is why the last Firefox 4 results are all over the place after running stably for weeks). But the main method we employ to deal with all this is by running the suite twice every day, even though all browsers (except Nightly) change much less frequently. So, as in any scientific endeavour, ignore the outlying points and focus on the trends.
Here are a few things I've noticed, some obvious, some less so.
- Different browsers excel at different tests, though, not surprisingly, IE does well on all of them. Firefox is good or excellent at 4 of the 5 tests, but it's much worse than Chrome and IE at Santa's Workshop (see roc's
post about this).
- Some browsers max out (60 FPS) on some tests. These tests would have to be modified for a true comparison. However some tests report FPSs above 60, which means they must be using some sort of "virtual" frame rate, since no monitor can display that much. More investigation needs to be done to see if this is a valid statistic for comparison.
- Nightly generally outperforms Firefox 4 except where they have maxed out. This is especially noticeable in SpeedReading, where 4 was only at about 32-33 FPS, but Nightly and Firefox 5 have are at 60 FPS.
- Some browser/test combinations are quite stable, with almost all results being the same, and some vary up and down. For instance, most browsers have stable results for Mr Potato Gun, but IE varies by 20-30 FPS.
- OS and browser updates
definitely affect performance. Recently the network was left fully connected, and Firefox, Opera, and potentially Windows downloaded updates during test runs. This dropped performance noticeably.
As usual, feel free to make suggestions. Specifically, if there are particularly useful tests out there, I am more than willing to add them to the suite.
July 12, 2011 05:22 PM
May 31, 2011
Before I started interning at Mozilla back in May 2010, I really didn't know what to expect. How does a non-profit company with an open source product operate? After working at giant corporations like IBM and McAfee I couldn't fathom what this experience would be like.
Although I've always been somewhat of a Firefox fanboy, I also had my worries. You may remember that at that time, Chrome had been out for a...
May 31, 2011 08:15 PM
May 20, 2011
May 06, 2011
The addons performance testing system has been up and live for a few weeks now. With so many more eyes then mine on the system I’ve seen a bunch of bug filings - which is awesome. With each bug fixed the Talos system works better for both addons and for the general Firefox performance testing.
Here’s what’s already fixed and rolled out:
Here’s what’s fixed but waiting on deployment (which will probably happen early next week):
There are still more bug fixes in the works. Next on my list is Bug 648225 - Performance of platform-dependent add-ons is not tested, which will improve the testing system’s simulation of the real world.
In terms of future plans, we are going forward with a second quarter goal of completing an on-demand addon testing service. Basically, this would allow an addon author to request that their addon be tested at any time, instead of waiting for our weekly tests of the 100 most downloaded addons. This will gain us greater coverage of more addons along with a means to double or triple check results. Did your addon perform poorly? Retest! Are you suspicious of the results? Retest! Did the addon fail to download or install? Retest! If you want to follow along the bugs that will lead to this system are:
Once the on-demand system is in place, we’ll be working to introduce a greater variety of tests. The ts (test startup) was an easy test to begin with, but it can be of limited meaning for a lot of addons. I’d like us to cover far more of the available Talos tests, concentrating on the tp (test pageload) tests. Tp is interesting because it uses a set of collected, local web pages (100 culled from the Alexa top 300 list of worldwide top used sites) that are then cycled through ten times. With a given addon installed and active (for some meaning of ‘active’ which will be different for different addons) this will give a greater idea of how real world page load time is impacted. As a side benefit of Tp, Talos will monitor the memory footprint and CPU usage during this test. By comparing an addon run to a no-addon run we’ll be able to observe memory and CPU usage differences.
I believe that we are also going to need to put effort into provided some testing hooks/prefs for Talos to use. As in Bug 459965 - Add standardized support for first-run pages to install.rdf. Talos doesn’t react well to first run pages and it would be great to have the means to disable them with a single pref, instead of a customized pref per-addon - especially since the standard use case is that users do not see the first-run page on a regular basis, as they only see it post-installation and never again. I believe that there are probably some other settings like this that would standardize creating a Talos testing environment, and thus make addon-testing more applicable to the type of bulk testing that Talos does.
Testing addons has been a whole new area of Talos testing, and it has its own unique set of challenges. I’ve spent most of my time at Mozilla concentrating on automated browser performance testing and the addons world is still quite new to me. Each addon effects the browser in its own way; while I’ve grown accustomed to standardizing tests across browsers versions, platforms and machines this is definitely a new horizon. The Talos tests are just one way to look at performance impact, but not the final word. It may never be appropriate for a given type of addons, but I most definitely want to work with addon developers to try and get the best coverage that we can.
May 06, 2011 11:36 PM
tl;dr From work on the
War on Orange, I spun off three
flot plugins:
flot-axislabels,
flot-hiddengraphs, and
flot-tickrotor. Use 'em how you will & feel free to gimme feedback.
I'm going to step away from telling you about the
A-Team's projects for a few minutes and talk about our by-products. Yup, software by-products. Think virtual horse-glue or electronic fertilizer. Well actually those comparisons aren't very good. Anyway, what I mean is that sometimes I overcome my natural laziness and package up bits of the work that I do that I think would be of particular benefit to others.
The
War on Orange is all about statistics, and statistics are boooooring. Pictures, however, make the whole stats thing somewhat bearable, so the War on Orange makes extensive use of graphs. We decided on
flot, a popular program for doing a whole buncha different plot types. Of course flot can't do everything, so it has support for
plugins so you can add functionality without too much effort.
The first thing that we noticed was absent from flot was axis labels. We have
graphs that show daily orange counts alongside the "orange factor"—oranges per test run—and we were using a second axis since the two stats are orders of magnitude apart. Not sure why axis labels weren't available out of the box, since they seem to be a pretty fundamental part of a graph, but luckily someone had already started on a plugin. Alas, it only provided labels for primary axes. But the plugin structure was all there, along with some interesting hacks. I've had a
github account for a little while but didn't really use it, so it was very exciting to get down to some hardcore forking action. A while later, I had secondary-axis labels going in my
flot-axislabels fork:

In the spirit of github, I submitted a pull request so the original author could incorporate my work into the original plugin, but I guess he had lost interest and never accepted it. So as far as I know, my improved flot-axislabel plugin is still the most fully featured one out there—although it does have a bit of weird behaviour sometimes as a side effect of the hacks needed to fit the labels in. Btw
I accept pull requests...
The War on Orange has been going on for some time, and all the information we were trying to cram into our graphs starting making them feel cramped. Experience modifying flot-axislabels gave me the courage to create my own plugin to solve this problem:
flot-hiddengraphs. This plugin allows you to hide and show the various graphs on one plot via the legend:


I made some interesting discoveries while working on that plugin, including the fact that mouseenter and mouseleave don't seem to
always fire. Maybe if I weren't so lazy I'd fix it to use mousemove. Oh and it's still a bit ugly and I dunno why I have this fascination with links in square brackets. Did I mention I accept pull requests?
Well now that this plugin thing was old hat, I had to get creative to continue to ensure my life as a software developer was still painful. We've got a
graph that can have quite a lot of columns (whether this is the right kind of display for this data is another matter). While conducting a different war, the Battle to Understand
BIRT (aka BIRT Y U NO LIKE ME?), I stumbled on a nice control that allows you to rotate tick labels, so you can fit more, and longer, labels in. I started with the same hack as flot-axislabels to allocate some space for the labels... but how much space? Well I've never worked in graphics (to which my UIs will attest), and ten years is ample time for formal education to abandon me, so I couldn't even think of the
word trigonometry at first. But Google knows all, and a short while later I was all Math.sin() and Math.cos() and Math.PI. Felt good to know that a few more university dollars paid off. So now the universe has
flot-tickrotor (making up for a string of boring project names).

Now's the part in which I tell you what sucks about it: I had some problems with the allocation of space (seemingly the hardest part of these plugins) for long labels slanted down and to the right, and 'cause I'm lazy and think down-left-slanting labels look better anyway, I left it out. Automatically scaling fonts would be teh aw3som3 as well. Pull requests: I accept 'em.
So yeah, please use them and tell people about them and complain to me when they don't work and then send me pull requests when I tell you I'm too lazy to fix them.
If you're still reading, maybe you care about some of the interesting bits (read: sublime (in the
Schopenhauerean sense) hacks) in flot plugin development. The main one I had to contend with was, as I've mentioned above, the allocation of space for new or bigger elements. As the code comments in the
original flot-axislabels state,
This is kind of a hack. There are no hooks in Flot between
the creation and measuring of the ticks (setTicks, measureTickLabels
in setupGrid() ) and the drawing of the ticks and plot box
(insertAxisLabels in setupGrid() ).
Therefore, we use a trick where we run the draw routine twice:
the first time to get the tick measurements, so that we can change
them, and then have it draw it again.
What that comes down to, I figured out after a while, is that there's no way to tell flot "hey make the graph itself smaller 'cause I got stuff to put in the margin", since you don't know how big the graph is going to be until the plot is drawn. So a plugin that wants those margins to be bigger needs to do some calculations based on the standard size, set the label-dimension options appropriately, then trigger the draw event a second time. Now you've got spacier margins and can insert your elements. Note that this actually seems to be invisible to the user; I guess the first and second draw events happen before anything is actually
displayed.
Unfortunately, this approach can screw over other plugins that
also want to put stuff in the margins. flot-axislabels is actually okay because it is just allocates a bit more space and doesn't replace anything. But flot-tickrotor replaces the labels entirely... oh wait, maybe I can fix it if I display the labels in the first draw and then just calculate how much bigger the labels will have to be and ugh man this stuff is tedious. Anyway for now, if you use both, make sure tickrotor is loaded first. Because you're sick of hearing it, I'll make up a French version:
j'accepte les demandes de tire. Oh hey there's a French github...
demandes de "pull"? Pah, how unoriginal.
May 06, 2011 08:32 PM
April 23, 2011
April 21, 2011
Over the last year and a half I have been editing the talos harness for various bug fixes, but just recently I have needed to dive in and add new tests and pagesets to talos for Firefox and Fennec. Here are some of the things I didn’t realize or have inconveniently forget about what goes on behind the scenes.
- tp4 is really 4 tests: tp4, tp4_nochrome, tp4_shutdown, tp4_shutdown_nochrome. This is because in the .config file, we have “shutdown: true” which adds _shutdown to the test name and running with –noChrome adds the _nochrome to the test name. Same with any test that us run with the shutdown=true and nochrome options.
- when adding new tests, we need to add the test information to the graph server (staging and production). This is done in the hg.mozilla.org/graphs repository by adding to data.sql.
- when adding new pagesets (as I did for tp4 mobile), we need to provide a .zip of the pages and the pageloader manifest to release engineering as well as modifying the .config file in talos to point to the new manifest file. see bug 648307
- Also when adding new pages, we need to add sql for each page we load. This is also in the graphs repository bug in pages_table.sql.
- When editing the graph server, you need to file a bug with IT to update the live servers and attach a sql file (not a diff). Some examples: bug 649774 and bug 650879
- after you have the graph servers updated, staging run green, review done, then you can check in the patch for talos
- For new tests, you also have to create a buildbot config patch to add the testname to the list of tests that are run for talos
- the last step is to file a release engineering bug to update talos on the production servers. This is done by creating a .zip of talos, posting it on a ftp site somewhere and providing a link to it in the bug.
- one last thing is to make sure the bug to update talos has an owner and is looked at, otherwise it can sit for weeks with no action!!!
This is my experience from getting ts_paint, tpaint, and tp4m (mobile only) tests added to Talos over the last couple months.

April 21, 2011 04:25 PM
April 16, 2011
April 08, 2011
Firefox is known for its extensibility. In fact, over 2.4 *billion* add-ons have been downloaded to date,
meaning there are a lot of people using a lot of add-ons. While having 20+ add-ons can undoubtedly personalize your
browsing experience, it can also be a pain in the arse to manually install them every time you set up a new
Firefox profile. As a developer working on Firefox related automation tools, this is twice as
true since I create a...
April 08, 2011 07:00 AM
March 21, 2011
The A-Team is embarking on a new initiative, and we need your help! After all, the A-Team's customers is Mozilla at large, and we like to keep our customers happy.
The project this time is
Autolog. It's intended to be a generic
tbpl-like results viewer for all the various projects that have test suites but aren't part of mozilla-central and the related branches.
We've already got a good start on the back-end: we're using an Elastic Search database to store results, and we're serving them up, and accepting new results, via a RESTful interface.
But now's the hard part: the UI! As we mentioned, the original concept was a tbpl-like interface, something clean and easy to scan. But tbpl is tied tightly to tinderbox, so it isn't easy to extend. We've spent some time starting at the code, and it looks like some extensive modifications would be in order, and they wouldn't necessarily make future extensions any easier.
Then we were told about an alternate to tbpl, asuth's
ArbPL. This was designed to be extensible and has some neat features: it tells you what area has been changed (e.g. "Accessibility: Tests", "Layout: C++ Code"), it displays some details of failed tests automatically so you don't have to click on the failures first, and, for Mozmill, it has some very pretty stack traces and other information (
example from the Thunderbird tree).
To be brutally honest, though, the A-Team is biased towards tbpl's look, both because it's the current standard and because it's cleaner. ArbPL has some very nice features, though, so it might be worth the effort to implement a tbpl-like interface, as time consuming as that might be.
But in the end, it's YOU, the customer, that is important to us. So let's hear it: do you like tbpl? What do you think of ArbPL? Is one much better than the other? Are there aspects you like of one and wish were in the other?
For the linkophobic, here are contemporaneous screenshots from tbpl and ArbPL (click to embiggen).

tbpl

ArbPL
March 21, 2011 10:13 PM
March 15, 2011
Dave Fugate of Microsoft announced an update to test262.ecmascript.org, the test suite for ECMAScript 5. I thought I would check it out. Internet Explorer 9 was tested on 32bit Windows Vista, while the others were all tested on Mac OS X 10.5.
| Browser |
Tests To Run |
Total Tests Ran |
Pass |
Fail |
Failed To Load |
| Internet Explorer 9 |
10456 |
10456 |
10439 |
17 |
0 |
| Firefox 4.0 |
10456 |
10456 |
10155 |
301 |
0 |
| Chrome 10 |
10456 |
10456 |
9959 |
497 |
0 |
| Safari 5.0.4 |
10456 |
10456 |
9156 |
1300 |
72 |
| Opera 11 |
10456 |
10456 |
6905 |
3551 |
66 |
The test suite looks very cool. Kudos to everyone involved in creating test262!
March 15, 2011 09:04 PM
March 07, 2011
GrafxBot has been updated to include the mochitest version of the WebGL Conformance Tests. When you run GrafxBot tests using the new version, it will run the usual reftests first, followed by the new WebGL tests. Both sets of test results are posted to the database at the end of test.
The WebGL tests may be skipped for a couple of reasons: they’ll be skipped if you have a Mac running less than 10.6, or if WebGL isn’t enabled in Firefox on your machine, which could happen if you don’t have supported hardware or drivers. GrafxBot doesn’t try to force-enable either WebGL or accelerated layers.
Partially to support these tests, GrafxBot now reports some additional details about Firefox’s acceleration status, similar to what you see in about:support:
| webgl results |
132 pass / 7 fail |
| webgl renderer |
Google Inc. — ANGLE — OpenGL ES 2.0 (ANGLE 0.0.0.541) |
| acceleration mode |
2/2 Direct3D 10 |
| d2d enabled |
true |
| directwrite enabled |
true: 6.1.7600.20830, font cache n/a |
I encourage users to download and run the new version; I’d like to get some feedback before I update it on AMO, to make sure users aren’t running into problems with the new tests.
The new version of GrafxBot can be downloaded here.

March 07, 2011 10:36 PM