Planet Mozilla Research

August 06, 2014

Joshua Cranmer

Why email is hard, part 7: email security and trust

This post is part 7 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. Part 6 discusses how email security works in practice. This part discusses the problem of trust.

At a technical level, S/MIME and PGP (or at least PGP/MIME) use cryptography essentially identically. Yet the two are treated as radically different models of email security because they diverge on the most important question of public key cryptography: how do you trust the identity of a public key? Trust is critical, as it is the only way to stop an active, man-in-the-middle (MITM) attack. MITM attacks are actually easier to pull off in email, since all email messages effectively have to pass through both the sender's and the recipients' email servers [1], allowing attackers to be able to pull off permanent, long-lasting MITM attacks [2].

S/MIME uses the same trust model that SSL uses, based on X.509 certificates and certificate authorities. X.509 certificates effectively work by providing a certificate that says who you are which is signed by another authority. In the original concept (as you might guess from the name "X.509"), the trusted authority was your telecom provider, and the certificates were furthermore intended to be a part of the global X.500 directory—a natural extension of the OSI internet model. The OSI model of the internet never gained traction, and the trusted telecom providers were replaced with trusted root CAs.

PGP, by contrast, uses a trust model that's generally known as the Web of Trust. Every user has a PGP key (containing their identity and their public key), and users can sign others' public keys. Trust generally flows from these signatures: if you trust a user, you know the keys that they sign are correct. The name "Web of Trust" comes from the vision that trust flows along the paths of signatures, building a tight web of trust.

And now for the controversial part of the post, the comparisons and critiques of these trust models. A disclaimer: I am not a security expert, although I am a programmer who revels in dreaming up arcane edge cases. I also don't use PGP at all, and use S/MIME to a very limited extent for some Mozilla work [3], although I did try a few abortive attempts to dogfood it in the past. I've attempted to replace personal experience with comprehensive research [4], but most existing critiques and comparisons of these two trust models are about 10-15 years old and predate several changes to CA certificate practices.

A basic tenet of development that I have found is that the average user is fairly ignorant. At the same time, a lot of the defense of trust models, both CAs and Web of Trust, tends to hinge on configurability. How many people, for example, know how to add or remove a CA root from Firefox, Windows, or Android? Even among the subgroup of Mozilla developers, I suspect the number of people who know how to do so are rather few. Or in the case of PGP, how many people know how to change the maximum path length? Or even understand the security implications of doing so?

Seen in the light of ignorant users, the Web of Trust is a UX disaster. Its entire security model is predicated on having users precisely specify how much they trust other people to trust others (ultimate, full, marginal, none, unknown) and also on having them continually do out-of-band verification procedures and publicly reporting those steps. In 1998, a seminal paper on the usability of a GUI for PGP encryption came to the conclusion that the UI was effectively unusable for users, to the point that only a third of the users were able to send an encrypted email (and even then, only with significant help from the test administrators), and a quarter managed to publicly announce their private keys at some point, which is pretty much the worst thing you can do. They also noted that the complex trust UI was never used by participants, although the failure of many users to get that far makes generalization dangerous [5]. While newer versions of security UI have undoubtedly fixed many of the original issues found (in no small part due to the paper, one of the first to argue that usability is integral, not orthogonal, to security), I have yet to find an actual study on the usability of the trust model itself.

The Web of Trust has other faults. The notion of "marginal" trust it turns out is rather broken: if you marginally trust a user who has two keys who both signed another person's key, that's the same as fully trusting a user with one key who signed that key. There are several proposals for different trust formulas [6], but none of them have caught on in practice to my knowledge.

A hidden fault is associated with its manner of presentation: in sharp contrast to CAs, the Web of Trust appears to not delegate trust, but any practical widespread deployment needs to solve the problem of contacting people who have had no prior contact. Combined with the need to bootstrap new users, this implies that there needs to be some keys that have signed a lot of other keys that are essentially default-trusted—in other words, a CA, a fact sometimes lost on advocates of the Web of Trust.

That said, a valid point in favor of the Web of Trust is that it more easily allows people to distrust CAs if they wish to. While I'm skeptical of its utility to a broader audience, the ability to do so for is crucial for a not-insignificant portion of the population, and it's important enough to be explicitly called out.

X.509 certificates are most commonly discussed in the context of SSL/TLS connections, so I'll discuss them in that context as well, as the implications for S/MIME are mostly the same. Almost all criticism of this trust model essentially boils down to a single complaint: certificate authorities aren't trustworthy. A historical criticism is that the addition of CAs to the main root trust stores was ad-hoc. Since then, however, the main oligopoly of these root stores (Microsoft, Apple, Google, and Mozilla) have made their policies public and clear [7]. The introduction of the CA/Browser Forum in 2005, with a collection of major CAs and the major browsers as members, and several [8] helps in articulating common policies. These policies, simplified immensely, boil down to:

  1. You must verify information (depending on certificate type). This information must be relatively recent.
  2. You must not use weak algorithms in your certificates (e.g., no MD5).
  3. You must not make certificates that are valid for too long.
  4. You must maintain revocation checking services.
  5. You must have fairly stringent physical and digital security practices and intrusion detection mechanisms.
  6. You must be [externally] audited every year that you follow the above rules.
  7. If you screw up, we can kick you out.

I'm not going to claim that this is necessarily the best policy or even that any policy can feasibly stop intrusions from happening. But it's a policy, so CAs must abide by some set of rules.

Another CA criticism is the fear that they may be suborned by national government spy agencies. I find this claim underwhelming, considering that the number of certificates acquired by intrusions that were used in the wild is larger than the number of certificates acquired by national governments that were used in the wild: 1 and 0, respectively. Yet no one complains about the untrustworthiness of CAs due to their ability to be hacked by outsiders. Another attack is that CAs are controlled by profit-seeking corporations, which misses the point because the business of CAs is not selling certificates but selling their access to the root databases. As we will see shortly, jeopardizing that access is a great way for a CA to go out of business.

To understand issues involving CAs in greater detail, there are two CAs that are particularly useful to look at. The first is CACert. CACert is favored by many by its attempt to handle X.509 certificates in a Web of Trust model, so invariably every public discussion about CACert ends up devolving into an attack on other CAs for their perceived capture by national governments or corporate interests. Yet what many of the proponents for inclusion of CACert miss (or dismiss) is the fact that CACert actually failed the required audit, and it is unlikely to ever pass an audit. This shows a central failure of both CAs and Web of Trust: different people have different definitions of "trust," and in the case of CACert, some people are favoring a subjective definition (I trust their owners because they're not evil) when an objective definition fails (in this case, that the root signing key is securely kept).

The other CA of note here is DigiNotar. In July 2011, some hackers managed to acquire a few fraudulent certificates by hacking into DigiNotar's systems. By late August, people had become aware of these certificates being used in practice [9] to intercept communications, mostly in Iran. The use appears to have been caught after Chromium updates failed due to invalid certificate fingerprints. After it became clear that the fraudulent certificates were not limited to a single fake Google certificate, and that DigiNotar had failed to notify potentially affected companies of its breach, DigiNotar was swiftly removed from all of the trust databases. It ended up declaring bankruptcy within two weeks.

DigiNotar indicates several things. One, SSL MITM attacks are not theoretical (I have seen at least two or three security experts advising pre-DigiNotar that SSL MITM attacks are "theoretical" and therefore the wrong target for security mechanisms). Two, keeping the trust of browsers is necessary for commercial operation of CAs. Three, the notion that a CA is "too big to fail" is false: DigiNotar played an important role in the Dutch community as a major CA and the operator of Staat der Nederlanden. Yet when DigiNotar screwed up and lost its trust, it was swiftly kicked out despite this role. I suspect that even Verisign could be kicked out if it manages to screw up badly enough.

This isn't to say that the CA model isn't problematic. But the source of its problems is that delegating trust isn't a feasible model in the first place, a problem that it shares with the Web of Trust as well. Different notions of what "trust" actually means and the uncertainty that gets introduced as chains of trust get longer both make delegating trust weak to both social engineering and technical engineering attacks. There appears to be an increasing consensus that the best way forward is some variant of key pinning, much akin to how SSH works: once you know someone's public key, you complain if that public key appears to change, even if it appears to be "trusted." This does leave people open to attacks on first use, and the question of what to do when you need to legitimately re-key is not easy to solve.

In short, both CAs and the Web of Trust have issues. Whether or not you should prefer S/MIME or PGP ultimately comes down to the very conscious question of how you want to deal with trust—a question without a clear, obvious answer. If I appear to be painting CAs and S/MIME in a positive light and the Web of Trust and PGP in a negative one in this post, it is more because I am trying to focus on the positions less commonly taken to balance perspective on the internet. In my next post, I'll round out the discussion on email security by explaining why email security has seen poor uptake and answering the question as to which email security protocol is most popular. The answer may surprise you!

[1] Strictly speaking, you can bypass the sender's SMTP server. In practice, this is considered a hole in the SMTP system that email providers are trying to plug.
[2] I've had 13 different connections to the internet in the same time as I've had my main email address, not counting all the public wifis that I have used. Whereas an attacker would find it extraordinarily difficult to intercept all of my SSH sessions for a MITM attack, intercepting all of my email sessions is clearly far easier if the attacker were my email provider.
[3] Before you read too much into this personal choice of S/MIME over PGP, it's entirely motivated by a simple concern: S/MIME is built into Thunderbird; PGP is not. As someone who does a lot of Thunderbird development work that could easily break the Enigmail extension locally, needing to use an extension would be disruptive to workflow.
[4] This is not to say that I don't heavily research many of my other posts, but I did go so far for this one as to actually start going through a lot of published journals in an attempt to find information.
[5] It's questionable how well the usability of a trust model UI can be measured in a lab setting, since the observer effect is particularly strong for all metrics of trust.
[6] The web of trust makes a nice graph, and graphs invite lots of interesting mathematical metrics. I've always been partial to eigenvectors of the graph, myself.
[7] Mozilla's policy for addition to NSS is basically the standard policy adopted by all open-source Linux or BSD distributions, seeing as OpenSSL never attempted to produce a root database.
[8] It looks to me that it's the browsers who are more in charge in this forum than the CAs.
[9] To my knowledge, this is the first—and so far only—attempt to actively MITM an SSL connection.

August 06, 2014 03:39 AM

May 27, 2014

Joshua Cranmer

Why email is hard, part 6: today's email security

This post is part 6 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. This part discusses how email security works in practice.

Email security is a rather wide-ranging topic, and one that I've wanted to cover for some time, well before several recent events that have made it come up in the wider public knowledge. There is no way I can hope to cover it in a single post (I think it would outpace even the length of my internationalization discussion), and there are definitely parts for which I am underqualified, as I am by no means an expert in cryptography. Instead, I will be discussing this over the course of several posts of which this is but the first; to ease up on the amount of background explanation, I will assume passing familiarity with cryptographic concepts like public keys, hash functions, as well as knowing what SSL and SSH are (though not necessarily how they work). If you don't have that knowledge, ask Wikipedia.

Before discussing how email security works, it is first necessary to ask what email security actually means. Unfortunately, the layman's interpretation is likely going to differ from the actual precise definition. Security is often treated by laymen as a boolean interpretation: something is either secure or insecure. The most prevalent model of security to people is SSL connections: these allow the establishment of a communication channel whose contents are secret to outside observers while also guaranteeing to the client the authenticity of the server. The server often then gets authenticity of the client via a more normal authentication scheme (i.e., the client sends a username and password). Thus there is, at the end, a channel that has both secrecy and authenticity [1]: channels with both of these are considered secure and channels without these are considered insecure [2].

In email, the situation becomes more difficult. Whereas an SSL connection is between a client and a server, the architecture of email is such that email providers must be considered as distinct entities from end users. In addition, messages can be sent from one person to multiple parties. Thus secure email is a more complex undertaking than just porting relevant details of SSL. There are two major cryptographic implementations of secure email [3]: S/MIME and PGP. In terms of implementation, they are basically the same [4], although PGP has an extra mode which wraps general ASCII (known as "ASCII-armor"), which I have been led to believe is less recommended these days. Since I know the S/MIME specifications better, I'll refer specifically to how S/MIME works.

S/MIME defines two main MIME types: multipart/signed, which contains the message text as a subpart followed by data indicating the cryptographic signature, and application/pkcs7-mime, which contains an encrypted MIME part. The important things to note about this delineation are that only the body data is encrypted [5], that it's theoretically possible to encrypt only part of a message's body, and that the signing and encryption constitute different steps. These factors combine to make for a potentially infuriating UI setup.

How does S/MIME tackle the challenges of encrypting email? First, rather than encrypting using recipients' public keys, the message is encrypted with a symmetric key. This symmetric key is then encrypted with each of the recipients' keys and then attached to the message. Second, by only signing or encrypting the body of the message, the transit headers are kept intact for the mail system to retain its ability to route, process, and deliver the message. The body is supposed to be prepared in the "safest" form before transit to avoid intermediate routers munging the contents. Finally, to actually ascertain what the recipients' public keys are, clients typically passively pull the information from signed emails. LDAP, unsurprisingly, contains an entry for a user's public key certificate, which could be useful in large enterprise deployments. There is also work ongoing right now to publish keys via DNS and DANE.

I mentioned before that S/MIME's use can present some interesting UI design decisions. I ended up actually testing some common email clients on how they handled S/MIME messages: Thunderbird, Apple Mail, Outlook [6], and Evolution. In my attempts to create a surreptitious signed part to confuse the UI, Outlook decided that the message had no body at all, and Thunderbird decided to ignore all indication of the existence of said part. Apple Mail managed to claim the message was signed in one of these scenarios, and Evolution took the cake by always agreeing that the message was signed [7]. It didn't even bother questioning the signature if the certificate's identity disagreed with the easily-spoofable From address. I was actually surprised by how well people did in my tests—I expected far more confusion among clients, particularly since the will to maintain S/MIME has clearly been relatively low, judging by poor support for "new" features such as triple-wrapping or header protection.

Another fault of S/MIME's design is that it makes the mistaken belief that composing a signing step and an encryption step is equivalent in strength to a simultaneous sign-and-encrypt. Another page describes this in far better detail than I have room to; note that this flaw is fixed via triple-wrapping (which has relatively poor support). This creates yet more UI burden into how to adequately describe in UI all the various minutiae in differing security guarantees. Considering that users already have a hard time even understanding that just because a message says it's from example@isp.invalid doesn't actually mean it's from example@isp.invalid, trying to develop UI that both adequately expresses the security issues and is understandable to end-users is an extreme challenge.

What we have in S/MIME (and PGP) is a system that allows for strong guarantees, if certain conditions are met, yet is also vulnerable to breaches of security if the message handling subsystems are poorly designed. Hopefully this is a sufficient guide to the technical impacts of secure email in the email world. My next post will discuss the most critical component of secure email: the trust model. After that, I will discuss why secure email has seen poor uptake and other relevant concerns on the future of email security.

[1] This is a bit of a lie: a channel that does secrecy and authentication at different times isn't as secure as one that does them at the same time.
[2] It is worth noting that authenticity is, in many respects, necessary to achieve secrecy.
[3] This, too, is a bit of a lie. More on this in a subsequent post.
[4] I'm very aware that S/MIME and PGP use radically different trust models. Trust models will be covered later.
[5] S/MIME 3.0 did add a provision stating that if the signed/encrypted part is a message/rfc822 part, the headers of that part should override the outer message's headers. However, I am not aware of a major email client that actually handles these kind of messages gracefully.
[6] Actually, I tested Windows Live Mail instead of Outlook, but given the presence of an official MIME-to-Microsoft's-internal-message-format document which seems to agree with what Windows Live Mail was doing, I figure their output would be identical.
[7] On a more careful examination after the fact, it appears that Evolution may have tried to indicate signedness on a part-by-part basis, but the UI was sufficiently confusing that ordinary users are going to be easily confused.

May 27, 2014 12:32 AM

April 05, 2014

Joshua Cranmer

Announcing jsmime 0.2

Previously, I've been developing JSMime as a subdirectory within comm-central. However, after discussions with other developers, I have moved the official repository of record for JSMime to its own repository, now found on GitHub. The repository has been cleaned up and the feature set for version 0.2 has been selected, so that the current tip on JSMime (also the initial version) is version 0.2. This contains the feature set I imported into Thunderbird's source code last night, which is to say support for parsing MIME messages into the MIME tree, as well as support for parsing and encoding email address headers.

Thunderbird doesn't actually use the new code quite yet (as my current tree is stuck on a mozilla-central build error, so I haven't had time to run those patches through a last minute sanity check before requesting review), but the intent is to replace the current C++ implementations of nsIMsgHeaderParser and nsIMimeConverter with JSMime instead. Once those are done, I will be moving forward with my structured header plans which more or less ought to make those interfaces obsolete.

Within JSMime itself, the pieces which I will be working on next will be rounding out the implementation of header parsing and encoding support (I have prototypes for Date headers and the infernal RFC 2231 encoding that Content-Disposition needs), as well as support for building MIME messages from their constituent parts (a feature which would be greatly appreciated in the depths of compose and import in Thunderbird). I also want to implement full IDN and EAI support, but that's hampered by the lack of a JS implementation I can use for IDN (yes, there's punycode.js, but that doesn't do StringPrep). The important task of converting the MIME tree to a list of body parts and attachments is something I do want to work on as well, but I've vacillated on the implementation here several times and I'm not sure I've found one I like yet.

JSMime, as its name implies, tries to work in as pure JS as possible, augmented with several web APIs as necessary (such as TextDecoder for charset decoding). I'm using ES6 as the base here, because it gives me several features I consider invaluable for implementing JavaScript: Promises, Map, generators, let. This means it can run on an unprivileged web page—I test JSMime using Firefox nightlies and the Firefox debugger where necessary. Unfortunately, it only really works in Firefox at the moment because V8 doesn't support many ES6 features yet (such as destructuring, which is annoying but simple enough to work around, or Map iteration, which is completely necessary for the code). I'm not opposed to changing it to make it work on Node.js or Chrome, but I don't realistically have the time to spend doing it myself; if someone else has the time, please feel free to contact me or send patches.

April 05, 2014 05:18 PM

April 03, 2014

Brendan Eich

The Next Mission

Slides for the brief talk that I gave at a Harvard seminar on privacy and user data organized by John Taysom last week.

My talk was really more about the “network problem” than the “protocol problem”. Networks breed first- and second-mover winners and others path-dependent powers, until the next disruption. Users or rather their data get captured.

Privacy is only one concern among several, including how to realize economic value for many-yet-individually-weak users, not just for data-store/service owners or third parties. Can we do better with client-side and private-cloud tiers, zero-knowledge proofs and protocols, or other ideas?

In the end, I asked these four questions:

  1. Can a browser/OS “unionize its users” to gain bargaining power vs. net super-powers?
  2. To create a data commons with “API to me” and aggregated/clustered economics?
  3. Open the walled gardens to put users first?
  4. Still be usable and private-enough for most?

I think the answer is yes, but I’m not sure who will do this work. It is vitally important.

I may get to it, but not working at Mozilla. I’ve resigned as CEO and I’m leaving Mozilla to take a rest, take some trips with my family, look at problems from other angles, and see if the “network problem” has a solution that doesn’t require scaling up to hundreds of millions of users and winning their trust while somehow covering costs. That’s a rare, hard thing, which I’m proud to have done with Firefox at Mozilla.

I encourage all Mozillians to keep going. Firefox OS is even more daunting, and more important. Thanks indeed to all who have supported me, and to all my colleagues over the years, at Mozilla, in standards bodies, and at conferences around the world. I will be less visible online, but still around.

/be

April 03, 2014 11:12 PM

Joshua Cranmer

If you want fast code, don't use assembly

…unless you're an expert at assembly, that is. The title of this post was obviously meant to be an attention-grabber, but it is much truer than you might think: poorly-written assembly code will probably be slower than an optimizing compiler on well-written code (note that you may need to help the compiler along for things like vectorization). Now why is this?

Modern microarchitectures are incredibly complex. A modern x86 processor will be superscalar and use some form of compilation to microcode to do that. Desktop processors will undoubtedly have multiple instruction issues per cycle, forms of register renaming, branch predictors, etc. Minor changes—a misaligned instruction stream, a poor order of instructions, a bad instruction choice—could kill the ability to take advantages of these features. There are very few people who could accurately predict the performance of a given assembly stream (I myself wouldn't attempt it if the architecture can take advantage of ILP), and these people are disproportionately likely to be working on compiler optimizations. So unless you're knowledgeable enough about assembly to help work on a compiler, you probably shouldn't be hand-coding assembly to make code faster.

To give an example to elucidate this point (and the motivation for this blog post in the first place), I was given a link to an implementation of the N-queens problem in assembly. For various reasons, I decided to use this to start building a fine-grained performance measurement system. This system uses a high-resolution monotonic clock on Linux and runs the function 1000 times to warm up caches and counters and then runs the function 1000 more times, measuring each run independently and reporting the average runtime at the end. This is a single execution of the system; 20 executions of the system were used as the baseline for a t-test to determine statistical significance as well as visual estimation of normality of data. Since the runs observed about a constant 1-2 μs of noise, I ran all of my numbers on the 10-queens problem to better separate the data (total runtimes ended up being in the range of 200-300μs at this level). When I say that some versions are faster, the p-values for individual measurements are on the order of 10-20—meaning that there is a 1-in-100,000,000,000,000,000,000 chance that the observed speedups could be produced if the programs take the same amount of time to run.

The initial assembly version of the program took about 288μs to run. The first C++ version I coded, originating from the same genesis algorithm that the author of the assembly version used, ran in 275μs. A recursive program beat out a hand-written assembly block of code... and when I manually converted the recursive program into a single loop, the runtime improved to 249μs. It wasn't until I got rid of all of the assembly in the original code that I could get the program to beat the derecursified code (at 244μs)—so it's not the vectorization that's causing the code to be slow. Intrigued, I started to analyze why the original assembly was so slow.

It turns out that there are three main things that I think cause the slow speed of the original code. The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself. The other two causes are directly measured from observing runtime changes as I slowly replaced the assembly with code. When I replaced the use of push and pop instructions with a global static array, the runtime improved dramatically. This suggests that the alignment of the stack could be to blame (although the stack is still 8-byte aligned when I checked via gdb), which just goes to show you how much alignments really do matter in code.

The final, and by far most dramatic, effect I saw involves the use of three assembly instructions: bsf (find the index of the lowest bit that is set), btc (clear a specific bit index), and shl (left shift). When I replaced the use of these instructions with a more complicated expression int bit = x & -x and x = x - bit, the program's speed improved dramatically. And the rationale for why the speed improved won't be found in latency tables, although those will tell you that bsf is not a 1-cycle operation. Rather, it's in minutiae that's not immediately obvious.

The original program used the fact that bsf sets the zero flag if the input register is 0 as the condition to do the backtracking; the converted code just checked if the value was 0 (using a simple test instruction). The compare and the jump instructions are basically converted into a single instruction in the processor. In contrast, the bsf does not get to do this; combined with the lower latency of the instruction intrinsically, it means that empty loops take a lot longer to do nothing. The use of an 8-bit shift value is also interesting, as there is a rather severe penalty for using 8-bit registers in Intel processors as far as I can see.

Now, this isn't to say that the compiler will always produce the best code by itself. My final code wasn't above using x86 intrinsics for the vector instructions. Replacing the _mm_andnot_si128 intrinsic with an actual and-not on vectors caused gcc to use other, slower instructions instead of the vmovq to move the result out of the SSE registers for reasons I don't particularly want to track down. The use of the _mm_blend_epi16 and _mm_srli_si128 intrinsics can probably be replaced with __builtin_shuffle instead for more portability, but I was under the misapprehension that this was a clang-only intrinsic when I first played with the code so I never bothered to try that, and this code has passed out of my memory long enough that I don't want to try to mess with it now.

In short, compilers know things about optimizing for modern architectures that many general programmers don't. Compilers may have issues with autovectorization, but the existence of vector intrinsics allow you to force compilers to use vectorization while still giving them leeway to make decisions about instruction scheduling or code alignment which are easy to screw up in hand-written assembly. Also, compilers are liable to get better in the future, whereas hand-written assembly code is unlikely to get faster in the future. So only write assembly code if you really know what you're doing and you know you're better than the compiler.

April 03, 2014 04:52 PM

March 26, 2014

Brendan Eich

Inclusiveness at Mozilla

I am deeply honored and humbled by the CEO role. I’m also grateful for the messages of support. At the same time, I know there are concerns about my commitment to fostering equality and welcome for LGBT individuals at Mozilla. I hope to lay those concerns to rest, first by making a set of commitments to you. More important, I want to lay them to rest by actions and results.

A number of Mozillians, including LGBT individuals and allies, have stepped forward to offer guidance and assistance in this. I cannot thank you enough, and I ask for your ongoing help to make Mozilla a place of equality and welcome for all. Here are my commitments, and here’s what you can expect:

I know some will be skeptical about this, and that words alone will not change anything. I can only ask for your support to have the time to “show, not tell”; and in the meantime express my sorrow at having caused pain.

Mozilla is a movement composed of different people around the world, working productively together on a common mission. This is important to our ability to work and grow around the world.

Many Mozillians and others know me as a colleague or a friend. They know that I take people as they come and work with anyone willing to contribute. At the same time, I don’t ask for trust free of context, or without a solid structure to support accountability. No leader or person who has a privileged position should. I want to be held accountable for what I do as CEO. I fully expect you all to do so.

I am committed to ensuring that Mozilla is, and will remain, a place that includes and supports everyone, regardless of sexual orientation, gender identity, age, race, ethnicity, economic status, or religion.

You will see exemplary behavior from me toward everyone in our community, no matter who they are; and the same toward all those whom we hope will join, and for those who use our products. Mozilla’s inclusive health benefits policies will not regress in any way. And I will not tolerate behavior among community members that violates our Community Participation Guidelines or (for employees) our inclusive and non-discriminatory employment policies.

You’ll also see more from Mozilla under my leadership in the way of efforts to include potential contributors, especially those who lack privilege. This entails several projects, starting with Project Ascend, which is being developed by Lukas Blakk. I intend to demonstrate with meaningful action my commitment to a Mozilla that lives up to its ideals, including that of being an open and inclusive community.

/be

March 26, 2014 06:57 PM

March 24, 2014

Brendan Eich

Mozilla News

A quick note to update everyone on Mozilla news. Our Board of Directors has appointed me CEO of Mozilla, with immediate effect. I’m honored and humbled, and I promise to do everything I can to lead Mozilla to new heights in this role.

I would first like to thank Jay Sullivan for his contributions to Mozilla and to the Web. He has been a passionate force at Mozilla whose leadership, especially during the last year, has been important to our success, in particular with Firefox OS. Jay is helping with the CEO transition and will then leave to pursue new opportunities.

My co-founder and 15-year partner in Mozilla, Mitchell Baker, remains active as Executive Chairwoman of Mozilla. I could not do what I do for Mozilla without Mitchell, and I like to think she feels the same way about me ;-). We have worked together well since she took on management of the tiny mozilla.org staff fragment embedded in Netscape. At that time I was “acting manager” (more like method acting manager :-P). I’ve learned a lot from Mitchell and my other peers at Mozilla about management since then!

Mozilla is about people-power on the Web and Internet — putting individual users, who create as well as consume, above all other agendas. In this light, people-fu trumps my first love, which you might say is math-fu, code-fu or tech-fu (if I may appropriate the second syllable from kung fu). People around the world are our ultimate cause at Mozilla, as well as source of inspiration and ongoing help doing what we do.

Speaking of people a bit more, I’ll take this moment to introduce Li Gong as my incoming COO. Li set up Mozilla China and our Taipei office, and he has been a crucial partner in building up Firefox OS. If you don’t know him yet, you will probably get a chance if you pass through our headquarters, as Li will be moving back to the US to help manage here.

Mozilla remains a global public benefit organization, so I’m sure I will see all of you more as I travel: to all of our offices (I have not yet been to Beijing or Taipei), to the places where we are bringing Firefox OS and the $25 smartphone, and everywhere Mozillians, developers, and others are working to make the Web better for everyone.

/be

March 24, 2014 05:11 PM

March 14, 2014

Joshua Cranmer

Understanding email charsets

Several years ago, I embarked on a project to collect the headers of all the messages I could reach on NNTP, with the original intent of studying the progression of the most common news clients. More recently, I used this dataset to attempt to discover the prevalence of charsets in email messages. In doing so, I identified a critical problem with the dataset: since it only contains headers, there is very little scope for actually understanding the full, sad story of charsets. So I've decided to rectify this problem.

This time, I modified my data-collection scripts to make it much easier to mass-download NNTP messages. The first script effectively lists all the newsgroups, and then all the message IDs in those newsgroups, stuffing the results in a set to remove duplicates (cross-posts). The second script uses Python's nntplib package to attempt to download all of those messages. Of the 32,598,261 messages identified by the first set, I succeeded in obtaining 1,025,586 messages in full or in part. Some messages failed to download due to crashing nntplib (which appears to be unable to handle messages of unbounded length), and I suspect my newsserver connections may have just timed out in the middle of the download at times. Others failed due to expiring before I could download them. All in all, 19,288 messages were not downloaded.

Analysis of the contents of messages were hampered due to a strong desire to find techniques that could mangle messages as little as possible. Prior experience with Python's message-parsing libraries lend me to believe that they are rather poor at handling some of the crap that comes into existence, and the errors in nntplib suggest they haven't fixed them yet. The only message parsing framework I truly trust to give me the level of finess is the JSMime that I'm writing, but that happens to be in the wrong language for this project. After reading some blog posts of Jeffrey Stedfast, though, I decided I would give GMime a try instead of trying to rewrite ad-hoc MIME parser #N.

Ultimately, I wrote a program to investigate the following questions on how messages operate in practice:

While those were the questions I seeked the answers to originally, I did come up with others as I worked on my tool, some in part due to what information I was basically already collecting. The tool I wrote primarily uses GMime to convert the body parts to 8-bit text (no charset conversion), as well as parse the Content-Type headers, which are really annoying to do without writing a full parser. I used ICU to handle charset conversion and detection. RFC 2047 decoding is done largely by hand since I needed very specific information that I couldn't convince GMime to give me. All code that I used is available upon request; the exact dataset is harder to transport, given that it is some 5.6GiB of data.

Other than GMime being built on GObject and exposing a C API, I can't complain much, although I didn't try to use it to do magic. Then again, in my experience (and as this post will probably convince you as well), you really want your MIME library to do charset magic for you, so in doing well for my needs, it's actually not doing well for a larger audience. ICU's C API similarly makes me want to complain. However, I'm now very suspect of the quality of its charset detection code, which is the main reason I used it. Trying to figure out how to get it to handle the charset decoding errors also proved far more annoying than it really should.

Some final background regards the biases I expect to crop up in the dataset. As the approximately 1 million messages were drawn from the python set iterator, I suspect that there's no systematic bias towards or away from specific groups, excepting that the ~11K messages found in the eternal-september.* hierarchy are completely represented. The newsserver I used, Eternal September, has a respectably large set of newsgroups, although it is likely to be biased towards European languages and under-representing East Asians. The less well-connected South America, Africa, or central Asia are going to be almost completely unrepresented. The download process will be biased away towards particularly heinous messages (such as exceedingly long lines), since nntplib itself is failing.

This being news messages, I also expect that use of 8-bit will be far more common than would be the case in regular mail messages. On a related note, the use of 8-bit in headers would be commensurately elevated compared to normal email. What would be far less common is HTML. I also expect that undeclared charsets may be slightly higher.

Charsets

Charset data is mostly collected on the basis of individual body parts within body messages; some messages have more than one. Interestingly enough, the 1,025,587 messages yielded 1,016,765 body parts with some text data, which indicates that either the messages on the server had only headers in the first place or the download process somehow managed to only grab the headers. There were also 393 messages that I identified having parts with different charsets, which only further illustrates how annoying charsets are in messages.

The aliases in charsets are mostly uninteresting in variance, except for the various labels used for US-ASCII (us - ascii, 646, and ANSI_X3.4-1968 are the less-well-known aliases), as well as the list of charsets whose names ICU was incapable of recognizing, given below. Unknown charsets are treated as equivalent to undeclared charsets in further processing, as there were too few to merit separate handling (45 in all).

For the next step, I used ICU to attempt to detect the actual charset of the body parts. ICU's charset detector doesn't support the full gamut of charsets, though, so charset names not claimed to be detected were instead processed by checking if they decoded without error. Before using this detection, I detect if the text is pure ASCII (excluding control characters, to enable charsets like ISO-2022-JP, and +, if the charset we're trying to check is UTF-7). ICU has a mode which ignores all text in things that look like HTML tags, and this mode is set for all HTML body parts.

I don't quite believe ICU's charset detection results, so I've collapsed the results into a simpler table to capture the most salient feature. The correct column indicates the cases where the detected result was the declared charset. The ASCII column captures the fraction which were pure ASCII. The UTF-8 column indicates if ICU reported that the text was UTF-8 (it always seems to try this first). The Wrong C1 column refers to an ISO-8859-1 text being detected as windows-1252 or vice versa, which is set by ICU if it sees or doesn't see an octet in the appropriate range. The other column refers to all other cases, including invalid cases for charsets not supported by ICU.

DeclaredCorrectASCIIUTF-8 Wrong C1OtherTotal
ISO-8859-1230,526225,6678838,1191,035466,230
Undeclared148,0541,11637,626186,796
UTF-875,67437,6001,551114,825
US-ASCII98,238030498,542
ISO-8859-1567,52918,527086,056
windows-125221,4144,3701543,31913029,387
ISO-8859-218,6472,13870712,31923,245
KOI8-R4,61642421,1126,154
GB23121,3075901121,478
Big562260801741,404
windows-125634310045398
IBM437842570341
ISO-8859-1331160317
windows-125113197161290
windows-12506969014101253
ISO-8859-7262600131183
ISO-8859-9127110017155
ISO-2022-JP766903148
macintosh67570124
ISO-8859-16015101116
UTF-7514055
x-mac-croatian0132538
KOI8-U282030
windows-125501800624
ISO-8859-4230023
EUC-KR0301619
ISO-8859-14144018
GB180301430017
ISO-8859-800001616
TIS-620150015
Shift_JIS840113
ISO-8859-391111
ISO-8859-10100010
KSC_56013609
GBK4206
windows-1253030025
ISO-8859-510034
IBM8500404
windows-12570303
ISO-2022-JP-22002
ISO-8859-601001
Total421,751536,3732,22611,52344,8921,016,765

The most obvious thing shown in this table is that the most common charsets remain ISO-8859-1, Windows-1252, US-ASCII, UTF-8, and ISO-8859-15, which is to be expected, given an expected prior bias to European languages in newsgroups. The low prevalence of ISO-2022-JP is surprising to me: it means a lower incidence of Japanese than I would have expected. Either that, or Japanese have switched to UTF-8 en masse, which I consider very unlikely given that Japanese have tended to resist the trend towards UTF-8 the most.

Beyond that, this dataset has caused me to lose trust in the ICU charset detectors. KOI8-R is recorded as being 18% malformed text, with most of that ICU believing to be ISO-8859-1 instead. Judging from the results, it appears that ICU has a bias towards guessing ISO-8859-1, which means I don't believe the numbers in the Other column to be accurate at all. For some reason, I don't appear to have decoders for ISO-8859-16 or x-mac-croatian on my local machine, but running some tests by hand appear to indicate that they are valid and not incorrect.

Somewhere between 0.1% and 1.0% of all messages are subject to mojibake, depending on how much you trust the charset detector. The cases of UTF-8 being misdetected as non-UTF-8 could potentially be explained by having very few non-ASCII sequences (ICU requires four valid sequences before it confidently declares text UTF-8); someone who writes a post in English but has a non-ASCII signature (such as myself) could easily fall into this category. Despite this, however, it does suggest that there is enough mojibake around that users need to be able to override charset decisions.

The undeclared charsets are described, in descending order of popularity, by ISO-8859-1, Windows-1252, KOI8-R, ISO-8859-2, and UTF-8, describing 99% of all non-ASCII undeclared data. ISO-8859-1 and Windows-1252 are probably over-counted here, but the interesting tidbit is that KOI8-R is used half as much undeclared as it is declared, and I suspect it may be undercounted. The practice of using locale-default fallbacks that Thunderbird has been using appears to be the best way forward for now, although UTF-8 is growing enough in popularity that using a specialized detector that decodes as UTF-8 if possible may be worth investigating (3% of all non-ASCII, undeclared messages are UTF-8).

HTML

Unsuprisingly (considering I'm polling newsgroups), very few messages contained any HTML parts at all: there were only 1,032 parts in the total sample size, of which only 552 had non-ASCII characters and were therefore useful for the rest of this analysis. This means that I'm skeptical of generalizing the results of this to email in general, but I'll still summarize the findings.

HTML, unlike plain text, contains a mechanism to explicitly identify the charset of a message. The official algorithm for determining the charset of an HTML file can be described simply as "look for a <meta> tag in the first 1024 bytes. If it can be found, attempt to extract a charset using one of several different techniques depending on what's present or not." Since doing this fully properly is complicated in library-less C++ code, I opted to look first for a <meta[ \t\r\n\f] production, guess the extent of the tag, and try to find a charset= string somewhere in that tag. This appears to be an approach which is more reflective of how this parsing is actually done in email clients than the proper HTML algorithm. One difference is that my regular expressions also support the newer <meta charset="UTF-8"/> construct, although I don't appear to see any use of this.

I found only 332 parts where the HTML declared a charset. Only 22 parts had a case where both a MIME charset and an HTML charset and the two disagreed with each other. I neglected to count how many messages had HTML charsets but no MIME charsets, but random sampling appeared to indicate that this is very rare on the data set (the same order of magnitude or less as those where they disagreed).

As for the question of who wins: of the 552 non-ASCII HTML parts, only 71 messages did not have the MIME type be the valid charset. Then again, 71 messages did not have the HTML type be valid either, which strongly suggests that ICU was detecting the incorrect charset. Judging from manual inspection of such messages, it appears that the MIME charset ought to be preferred if it exists. There are also a large number of HTML charset specifications saying unicode, which ICU treats as UTF-16, which is most certainly wrong.

Headers

In the data set, 1,025,856 header blocks were processed for the following statistics. This is slightly more than the number of messages since the headers of contained message/rfc822 parts were also processed. The good news is that 97% (996,103) headers were completely ASCII. Of the remaining 29,753 headers, 3.6% (1,058) were UTF-8 and 43.6% (12,965) matched the declared charset of the first body part. This leaves 52.9% (15,730) that did not match that charset, however.

Now, NNTP messages can generally be expected to have a higher 8-bit header ratio, so this is probably exaggerating the setup in most email messages. That said, the high incidence is definitely an indicator that even non-EAI-aware clients and servers cannot blindly presume that headers are 7-bit, nor can EAI-aware clients and servers presume that 8-bit headers are UTF-8. The high incidence of mismatching the declared charset suggests that fallback-charset decoding of headers is a necessary step.

RFC 2047 encoded-words is also an interesting statistic to mine. I found 135,951 encoded-words in the data set, which is rather low, considering that messages can be reasonably expected to carry more than one encoded-word. This is likely an artifact of NNTP's tendency towards 8-bit instead of 7-bit communication and understates their presence in regular email.

Counting encoded-words can be difficult, since there is a mechanism to let them continue in multiple pieces. For the purposes of this count, a sequence of such words count as a single word, and I indicate the number of them that had more than one element in a sequence in the Continued column. The 2047 Violation column counts the number of sequences where decoding words individually does not yield the same result as decoding them as a whole, in violation of RFC 2047. The Only ASCII column counts those words containing nothing but ASCII symbols and where the encoding was thus (mostly) pointless. The Invalid column counts the number of sequences that had a decoder error.

CharsetCountContinued2047 ViolationOnly ASCIIInvalid
ISO-8859-156,35515,6104990
UTF-836,56314,2163,3112,7049,765
ISO-8859-1520,6995,695400
ISO-8859-211,2472,66990
windows-12525,1743,075260
KOI8-R3,5231,203120
windows-125676556800
Big551146280171
ISO-8859-71652603
windows-12511573020
GB2312126356051
ISO-2022-JP10285049
ISO-8859-13784500
ISO-8859-9762100
ISO-8859-471200
windows-1250682100
ISO-8859-5662000
US-ASCII3810380
TIS-620363400
KOI8-U251100
ISO-8859-16221022
UTF-7172183
EUC-KR174409
x-mac-croatian103010
Shift_JIS80003
Unknown7207
ISO-2022-KR70000
GB1803061001
windows-12554000
ISO-8859-143000
ISO-8859-32100
GBK20002
ISO-8859-61100
Total135,95143,3603,3613,33810,096

This table somewhat mirrors the distribution of regular charsets, with one major class of differences: charsets that represent non-Latin scripts (particularly Asian scripts) appear to be overdistributed compared to their corresponding use in body parts. The exception to this rule is GB2312 which is far lower than relative rankings would presume—I attribute this to people using GB2312 being more likely to use 8-bit headers instead of RFC 2047 encoding, although I don't have direct evidence.

Clearly continuations are common, which is to be relatively expected. The sad part is how few people bother to try to adhere to the specification here: out of 14,312 continuations in languages that could violate the specification, 23.5% of them violated the specification. The mode-shifting versions (ISO-2022-JP and EUC-KR) are basically all violated, which suggests that no one bothered to check if their encoder "returns to ASCII" at the end of the word (I know Thunderbird's does, but the other ones I checked don't appear to).

The number of invalid UTF-8 decoded words, 26.7%, seems impossibly high to me. A brief check of my code indicates that this is working incorrectly in the face of invalid continuations, which certainly exaggerates the effect but still leaves a value too high for my tastes. Of more note are the elevated counts for the East Asian charsets: Big5, GB2312, and ISO-2022-JP. I am not an expert in charsets, but I belive that Big5 and GB2312 in particular are a family of almost-but-not-quite-identical charsets and it may be that ICU is choosing the wrong candidate of each family for these instances.

There is a surprisingly large number of encoded words that encode only ASCII. When searching specifically for the ones that use the US-ASCII charset, I found that these can be divided into three categories. One set comes from a few people who apparently have an unsanitized whitespace (space and LF were the two I recall seeing) in the display name, producing encoded words like =?us-ascii?Q?=09Edward_Rosten?=. Blame 40tude Dialog here. Another set encodes some basic characters (most commonly = and ?, although a few other interpreted characters popped up). The final set of errors were double-encoded words, such as =?us-ascii?Q?=3D=3FUTF-8=3FQ=3Ff=3DC3=3DBCr=3F=3D?=, which appear to be all generated by an Emacs-based newsreader.

One interesting thing when sifting the results is finding the crap that people produce in their tools. By far the worst single instance of an RFC 2047 encoded-word that I found is this one: Subject: Re: [Kitchen Nightmares] Meow! Gordon Ramsay Is =?ISO-8859-1?B?UEgR lqZ VuIEhlYWQgVH rbGeOIFNob BJc RP2JzZXNzZW?= With My =?ISO-8859-1?B?SHVzYmFuZ JzX0JhbGxzL JfU2F5c19BbXiScw==?= Baking Company Owner (complete with embedded spaces), discovered by crashing my ad-hoc base64 decoder (due to the spaces). The interesting thing is that even after investigating the output encoding, it doesn't look like the text is actually correct ISO-8859-1... or any obvious charset for that matter.

I looked at the unknown charsets by hand. Most of them were actually empty charsets (looked like =??B?Sy4gSC4gdm9uIFLDvGRlbg==?=), and all but one of the outright empty ones were generated by KNode and really UTF-8. The other one was a Windows-1252 generated by a minor newsreader.

Another important aspect of headers is how to handle 8-bit headers. RFC 5322 blindly hopes that headers are pure ASCII, while RFC 6532 dictates that they are UTF-8. Indeed, 97% of headers are ASCII, leaving just 29,753 headers that are not. Of these, only 1,058 (3.6%) are UTF-8 per RFC 6532. Deducing which charset they are is difficult because the large amount of English text for header names and the important control values will greatly skew any charset detector, and there is too little text to give a charset detector confidence. The only metric I could easily apply was testing Thunderbird's heuristic as "the header blocks are the same charset as the message contents"—which only worked 45.2% of the time.

Encodings

While developing an earlier version of my scanning program, I was intrigued to know how often various content transfer encodings were used. I found 1,028,971 parts in all (1,027,474 of which are text parts). The transfer encoding of binary did manage to sneak in, with 57 such parts. Using 8-bit text was very popular, at 381,223 samples, second only to 7-bit at 496,114 samples. Quoted-printable had 144,932 samples and base64 only 6,640 samples. Extremely interesting are the presence of 4 illegal transfer encodings in 5 messages, two of them obvious typos and the others appearing to be a client mangling header continuations into the transfer-encoding.

Conclusions

So, drawing from the body of this data, I would like to make the following conclusions as to using charsets in mail messages:

  1. Have a fallback charset. Undeclared charsets are extremely common, and I'm skeptical that charset detectors are going to get this stuff right, particularly since email can more naturally combine multiple languages than other bodies of text (think signatures). Thunderbird currently uses a locale-dependent fallback charset, which roughly mirrors what Firefox and I think most web browsers do.
  2. Let users override charsets when reading. On a similar token, mojibake text, while not particularly common, is common enough to make declared charsets sometimes unreliable. It's also possible that the fallback charset is wrong, so users may need to override the chosen charset.
  3. Testing is mandatory. In this set of messages, I found base64 encoded words with spaces in them, encoded words without charsets (even UNKNOWN-8BIT), and clearly invalid Content-Transfer-Encodings. Real email messages that are flagrantly in violation of basic spec requirements exist, so you should make sure that your email parser and client can handle the weirdest edge cases.
  4. Non-UTF-8, non-ASCII headers exist. EAI not withstanding, 8-bit headers are a reality. Combined with a predilection for saying ASCII when text is really ASCII, this means that there is often no good in-band information to tell you what charset is correct for headers, so you have to go back to a fallback charset.
  5. US-ASCII really means ASCII. Email clients appear to do a very good job of only emitting US-ASCII as a charset label if it's US-ASCII. The sample size is too small for me to grasp what charset 8-bit characters should imply in US-ASCII.
  6. Know your decoders. ISO-8859-1 actually means Windows-1252 in practice. Big5 and GB1232 are actually small families of charsets with slightly different meanings. ICU notably disagrees with some of these realities, so be sure to include in your tests various charset edge cases so you know that the decoders are correct.
  7. UTF-7 is still relevant. Of the charsets I found not mentioned in the WHATWG encoding spec, IBM437 and x-mac-croatian are in use only due to specific circumstances that limit their generalizable presence. IBM850 is too rare. UTF-7 is common enough that you need to actually worry about it, as abominable and evil a charset it is.
  8. HTML charsets may matter—but MIME matters more. I don't have enough data to say if charsets declared in HTML are needed to do proper decoding. I do have enough to say fairly conclusively that the MIME charset declaration is authoritative if HTML disagrees.
  9. Charsets are not languages. The entire reason x-mac-croatian is used at all can be traced to Thunderbird displaying the charset as "Croatian," despite it being pretty clearly not a preferred charset. Similarly most charsets are often enough ASCII that, say, an instance of GB2312 is a poor indicator of whether or not the message is in English. Anyone trying to filter based on charsets is doing a really, really stupid thing.
  10. RFCs reflect an ideal world, not reality. This is most notable in RFC 2047: the specification may state that encoded words are supposed to be independently decodable, but the evidence is pretty clear that more clients break this rule than uphold it.
  11. Limit the charsets you support. Just because your library lets you emit a hundred charsets doesn't mean that you should let someone try to do it. You should emit US-ASCII or UTF-8 unless you have a really compelling reason not to, and those compelling reasons don't require obscure charsets. Some particularly annoying charsets should never be written: EBCDIC is already basically dead on the web, and I'd like to see UTF-7 die as well.

When I have time, I'm planning on taking some of the more egregious or interesting messages in my dataset and packaging them into a database of emails to help create testsuites on handling messages properly.

March 14, 2014 04:17 AM

March 12, 2014

Brendan Eich

The Web at 25

The World Wide Web is 25 years old today.

The Web is a big deal (as is the Internet on which it is built), I don’t need to tell you! But I did have a few thoughts, solicited by a friend who asked “where [do] you think the future of the Internet will take us in the next 25 years?”

My answer: 25 years is a long time. I expect some big changes (computers inside us monitoring body functions), while other things stay remarkably unchanged (no flying cars).

Even now people remark on how much more personal or intimate a smartphone is than a PC (that image still makes me laugh). Think about this when the Internet includes not just your house and most physical artifacts worth hooking up, but yourself.

In such a world, open systems built on open standards and open source are even more important, for all of these reasons:

We have more work to do. Let’s go.

/be

Other voices:

March 12, 2014 02:20 PM

March 08, 2014

Brendan Eich

MWC 2014, Firefox OS Success, and Yet More Web API Evolution

Just over a week ago, I left Barcelona and Mobile World Congress 2014, where Mozilla had a huge third year with Firefox OS.

mwc14-booth

We announced the $25 Firefox OS smartphone with Spreadtrum Communications, targeting retail channels in emerging markets, and attracting operator interest to boot. This is an upgrade for those channels at about the same price as the feature phones selling there today. (Yes, $25 is the target end-user price.)

tarako

We showed the Firefox OS smartphone portfolio growing upward too, with more and higher-end devices from existing and new OEM partners. Peter Bright’s piece for Ars Technica is excellent and has nice pictures of all the new devices.

We also were pleased to relay the good news about official PhoneGap/Cordova support for Firefox OS.

We were above the fold for the third year in a row in Monday’s MWC daily.

(Check out the whole MWC 2014 photo set on MozillaEU’s Flickr.)

As I’ve noted before, our success in attracting partners is due in part to our ability to innovate and standardize the heretofore-missing APIs needed to build fully-capable smartphones and other devices purely from web standards. To uphold tradition, here is another update to my progress reports from last year and from 2012.


First, and not yet a historical curiosity: the still-open tracking bug asking for “New” Web APIs, filed at the dawn of B2G by Andreas Gal.

Next, links for “Really-New” APIs, most making progress in standards bodies:

Yet more APIs, some new enough that they are not ready for standardization:

Finally, the lists of new APIs in Firefox OS 1.1, 1.2, and 1.3:

This is how the web evolves: by implementors championing and testing extensions, with emerging consensus if at all possible, else in a pref-enabled or certified-app sandbox if there’s no better way. We thank colleagues at W3C and elsewhere who are collaborating with us to uplift the Web to include APIs for all the modern mobile device sensors and features. We invite all parties working on similar systems not yet aligned with the emerging standards to join us.

/be

March 08, 2014 06:58 PM

February 01, 2014

Joshua Cranmer

Why email is hard, part 5: mail headers

This post is part 5 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. This post discusses the more general problem of email headers.

Back in my first post, Ludovic kindly posted, in a comment, a link to a talk of someone else's email rant. And the best place to start this post is with a quote from that talk: "If you want to see an email programmer's face turn red, ask him about CFWS." CFWS is an acronym that stands for "comments and folded whitespace," and I can attest that the mere mention of CFWS is enough for me to start ranting. Comments in email headers are spans of text wrapped in parentheses, and the folding of whitespace refers to the ability to continue headers on multiple lines by inserting a newline before (but not in lieu of) a space.

I'll start by pointing out that there is little advantage to adding in free-form data to headers which are not going to be manually read in the vast majority of cases. In practice, I have seen comments used for only three headers on a reliable basis. One of these is the Date header, where a human-readable name of the timezone is sometimes included. The other two are the Received and Authentication-Results headers, where some debugging aids are thrown in. There would be no great loss in omitting any of this information; if information is really important, appending an X- header with that information is still a viable option (that's where most spam filtration notes get added, for example).

For this feature of questionable utility in the first place, the impact it has on parsing message headers is enormous. RFC 822 is specified in a manner that is familiar to anyone who reads language specifications: there is a low-level lexical scanning phase which feeds tokens into a secondary parsing phase. Like programming languages, comments and white space are semantically meaningless [1]. Unlike programming languages, however, comments can be nested—and therefore lexing an email header is not regular [2]. The problems of folding (a necessary evil thanks to the line length limit I keep complaining about) pale in comparison to comments, but it's extra complexity that makes machine-readability more difficult.

Fortunately, RFC 2822 made a drastic change to the specification that greatly limited where CFWS could be inserted into headers. For example, in the Date header, comments are allowed only following the timezone offset (and whitespace in a few specific places); in addressing headers, CFWS is not allowed within the email address itself [3]. One unanticipated downside is that it makes reading the other RFCs that specify mail headers more difficult: any version that predates RFC 2822 uses the syntax assumptions of RFC 822 (in particular, CFWS may occur between any listed tokens), whereas RFC 2822 and its descendants all explicitly enumerate where CFWS may occur.

Beyond the issues with CFWS, though, syntax is still problematic. The separation of distinct lexing and parsing phases means that you almost see what may be a hint of uniformity which turns out to be an ephemeral illusion. For example, the header parameters define in RFC 2045 for Content-Type and Content-Disposition set a tradition of ;-separated param=value attributes, which has been picked up by, say, the DKIM-Signature or Authentication-Results headers. Except a close look indicates that Authenticatin-Results allows two param=value pairs between semicolons. Another side effect was pointed out in my second post: you can't turn a generic 8-bit header into a 7-bit compatible header, since you can't tell without knowing the syntax of the header which parts can be specified as 2047 encoded-words and which ones can't.

There's more to headers than their syntax, though. Email headers are structured as a somewhat-unordered list of headers; this genericity gives rise to a very large number of headers, and that's just the list of official headers. There are unofficial headers whose use is generally agreed upon, such as X-Face, X-No-Archive, or X-Priority; other unofficial headers are used for internal tracking such as Mailman's X-BeenThere or Mozilla's X-Mozilla-Status headers. Choosing how to semantically interpret these headers (or even which headers to interpret!) can therefore be extremely daunting.

Some of the headers are specified in ways that would seem surprising to most users. For example, the venerable From header can represent anywhere between 0 mailboxes [4] to an arbitrarily large number—but most clients assume that only one exists. It's also worth noting that the Sender header is (if present) a better indication of message origin as far as tracing is concerned [5], but its relative rarity likely results in filtering applications not taking it into account. The suite of Resent-* headers also experiences similar issues.

Another impact of email headers is the degree to which they can be trusted. RFC 5322 gives some nice-sounding platitudes to how headers are supposed to be defined, but many of those interpretations turn out to be difficult to verify in practice. For example, Message-IDs are supposed to be globally unique, but they turn out to be extremely lousy UUIDs for emails on a local system, even if you allow for minor differences like adding trace headers [6].

More serious are the spam, phishing, etc. messages that lie as much as possible so as to be seen by end-users. Assuming that a message is hostile, the only header that can be actually guaranteed to be correct is the first Received header, which is added by the final user's mailserver [7]. Every other header, including the Date and From headers most notably, can be a complete and total lie. There's no real way to authenticate the headers or hide them from snoopers—this has critical consequences for both spam detection and email security.

There's more I could say on this topic (especially CFWS), but I don't think it's worth dwelling on. This is more of a preparatory post for the next entry in the series than a full compilation of complaints. Speaking of my next post, I don't think I'll be able to keep up my entirely-unintentional rate of posting one entry this series a month. I've exhausted the topics in email that I am intimately familiar with and thus have to move on to the ones I'm only familiar with.

[1] Some people attempt to be to zealous in following RFCs and ignore the distinction between syntax and semantics, as I complained about in part 4 when discussing the syntax of email addresses.
[2] I mean this in the theoretical sense of the definition. The proof that balanced parentheses is not a regular language is a standard exercise in use of the pumping lemma.
[3] Unless domain literals are involved. But domain literals are their own special category.
[4] Strictly speaking, the 0 value is intended to be used only when the email has been downgraded and the email address cannot be downgraded. Whether or not these will actually occur in practice is an unresolved question.
[5] Semantically speaking, Sender is the person who typed the message up and actually sent it out. From is the person who dictated the message. If the two headers would be the same, then Sender is omitted.
[6] Take a message that's cross-posted to two mailing lists. Each mailing list will generate copies of the message which end up being submitted back into the mail system and will typically avoid touching the Message-ID.
[7] Well, this assumes you trust your email provider. However, your email provider can do far worse to your messages than lie about the Received header…

February 01, 2014 03:57 AM

January 24, 2014

Joshua Cranmer

Charsets and NNTP

Recently, the question of charsets came up within the context of necessary decoder support for Thunderbird. After much hemming and hawing about how to find this out (which included a plea to the IMAP-protocol list for data), I remembered that I actually had this data. Long-time readers of this blog may recall that I did a study several years ago on the usage share of newsreaders. After that, I was motivated to take my data collection to the most extreme way possible. Instead of considering only the "official" Big-8 newsgroups, I looked at all of them on the news server I use (effectively, all but alt.binaries). Instead of relying on pulling the data from the server for the headers I needed, I grabbed all of them—the script literally runs HEAD and saves the results in a database. And instead of a month of results, I grabbed the results for the entire year of 2011. And then I sat on the data.

After recalling Henri Svinonen's pesterings about data, I decided to see the suitability of my dataset for this task. For data management reasons, I only grabbed the data from the second half of the year (about 10 million messages). I know from memory that the quality of Python's message parser (which was used to extract data in the first place) is surprisingly poor, which introduces bias of unknown consequence to my data. Since I only extracted headers, I can't identify charsets for anything which was sent as, say, multipart/alternative (which is more common than you'd think), which introduces further systematic bias. The end result is approximately 9.6M messages that I could extract charsets from and thence do further research.

Discussions revealed one particularly surprising tidbit of information. The most popular charset not accounted for by the Encoding specification was IBM437. Henri Sivonen speculated that the cause was some crufty old NNTP client on Windows using that encoding, so I endeavored to build a correlation database to check that assumption. Using the wonderful magic of d3, I produced a heatmap comparing distributions of charsets among various user agents. Details about the visualization may be found on that page, but it does refute Henri's claim when you dig into the data (it appears to be caused by specific BBS-to-news gateways, and is mostly localized in particular BBS newsgroups).

Also found on that page are some fun discoveries of just what kind of crap people try to pass off as valid headers. Some of those User-Agents are clearly spoofs (Outlook Express and family used the X-Newsreader header, not the User-Agent header). There also appears to be a fair amount of mojibake in headers (one of them appeared to be venerable double mojibake). The charsets also have some interesting labels to them: the "big5\n" and the "(null)" illustrate that some people don't double check their code very well, and not shown are the 5 examples of people who think charset names have spaces in them. A few people appear to have mixed up POSIX locales with charsets as well.

January 24, 2014 12:53 AM

January 11, 2014

Brendan Eich

Trust but Verify

Background

It is becoming increasingly difficult to trust the privacy properties of software and services we rely on to use the Internet. Governments, companies, groups and individuals may be surveilling us without our knowledge. This is particularly troubling when such surveillance is done by governments under statutes that provide limited court oversight and almost no room for public scrutiny.

As a result of laws in the US and elsewhere, prudent users must interact with Internet services knowing that despite how much any cloud-service company wants to protect privacy, at the end of the day most big companies must comply with the law. The government can legally access user data in ways that might violate the privacy expectations of law-abiding users. Worse, the government may force service operators to enable surveillance (something that seems to have happened in the Lavabit case).

Worst of all, the government can do all of this without users ever finding out about it, due to gag orders.

Implications for Browsers

This creates a significant predicament for privacy and security on the Open Web. Every major browser today is distributed by an organization within reach of surveillance laws. As the Lavabit case suggests, the government may request that browser vendors secretly inject surveillance code into the browsers they distribute to users. We have no information that any browser vendor has ever received such a directive. However, if that were to happen, the public would likely not find out due to gag orders.

The unfortunate consequence is that software vendors — including browser vendors — must not be blindly trusted. Not because such vendors don’t want to protect user privacy. Rather, because a law might force vendors to secretly violate their own principles and do things they don’t want to do.

Why Mozilla is different

Mozilla has one critical advantage over all other browser vendors. Our products are truly open source. Internet Explorer is fully closed-source, and while the rendering engines WebKit and Blink (chromium) are open-source, the Safari and Chrome browsers that use them are not fully open-source. Both contain significant fractions of closed-source code.

Mozilla Firefox in contrast is 100% open source [1]. As Anthony Jones from our New Zealand office pointed out the other month, security researchers can use this fact to verify the executable bits contained in the browsers Mozilla is distributing, by building Firefox from source and comparing the built bits with our official distribution.

This will be the most effective on platforms where we already use open-source compilers to produce the executable, to avoid compiler-level attacks as shown in 1984 by Ken Thompson.

Call to Action

To ensure that no one can inject undetected surveillance code into Firefox, security researchers and organizations should:

In the best case, we will establish such a verification system at a global scale, with participants from many different geographic regions and political and strategic interests and affiliations.

Security is never “done” — it is a process, not a final rest-state. No silver bullets. All methods have limits. However, open-source auditability cleanly beats the lack of ability to audit source vs. binary.

Through international collaboration of independent entities we can give users the confidence that Firefox cannot be subverted without the world noticing, and offer a browser that verifiably meets users’ privacy expectations.

See bug 885777 to track our work on verifiable builds.

End-to-End Trust

Beyond this first step, can we use such audited browsers as trust anchors, to authenticate fully-audited open-source Internet services? This seems possible in theory. No one has built such a system to our knowledge, but we welcome precedent citations and experience reports, and encourage researchers to collaborate with us.

Brendan Eich, CTO and SVP Engineering, Mozilla
Andreas Gal, VP Mobile and R&D, Mozilla

[1] Firefox on Linux is the best case, because the C/C++ compiler, runtime libraries, and OS kernel are all free and open source software. Note that even on Linux, certain hardware-vendor-supplied system software, e.g., OpenGL drivers, may be closed source.

January 11, 2014 11:06 PM

December 31, 2013

Brendan Eich

OpenH264 on Github

Sorry, I missed the chance to post a timely follow-up to Cisco’s H.264 Good News: as mentioned on the RTCWeb IETF mailing list, Cisco on the 9th of December released the OpenH264 codec on Github.

Warning: code cleanup in progress (e.g., following RTP correctly in Gecko glue code), do not expect interoperable results in the Firefoxes yet. Please check issues and send PRs. Thanks.

/be

December 31, 2013 03:09 AM

December 18, 2013

Brendan Eich

ORBX.js and related news

ORBX-vs-H264

[UPDATE: see Jim's fair comment below. /be]

I’m pleased to report that OTOY today has announced good news about ORBX.js and the Amazon Web Services ORBX and OctaneCloud AMIs (Amazon Machine Instances, pronounced “AHmees” — who knew?), based on terrific adoption and developer interest:

The deeper meaning here, in my view: a great rift emerged between CPU and GPU in the ’90s, where serial old x86 instruction set compatibility seemed to matter (remember shrink-wrap software?). The need for speed with binary compatibility begot big, power-hungry, superscalar CPUs, while from the SGI diaspora, the GPU went massively parallel.

One consequence of the rift: the rise of ARM on mobile, where binary compatibility did not and does not matter, but power efficiency does.

This rift may yet be healed, and in a way that avoids too much custom hardware (or else we will have to rely on FPGA-on-a-chip).

With enough homogeneity and parallel processing power, always-evolving video codecs, 3D model asset streams, and undreamed-of combinations should be feasible to implement in downloadable, power-efficient, safe code. Perhaps we can even one day kill off some of the video codec patent monsters that are currently burned into silicon.

More to come in the new year; this is just another happy rolling thunder update.

/be

December 18, 2013 04:02 PM

December 04, 2013

Joshua Cranmer

Why email is hard, part 4: Email addresses

This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.

You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.

 

 

 

Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.

The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].

Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.

Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.

RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[127.0.0.1] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].

The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.

At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">

Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.

Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).

Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).

Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.

Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.

RFC 822 addr-spec
Impossible. Don't even try.
RFC 5322 non-obsolete addr-spec production
([^\x00-\x20()\[\]:;@\\,.]+(\.[^\x00-\x20()\[\]:;@\\,.]+)*|"(\\.|[^\\"])*")@([^\x00-\x20()\[\]:;@\\,.]+(.[^\x00-\x20()\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
RFC 5322, unquoted email address
.*@([^\x00-\x20()\[\]:;@\\,.]+(\.[^\x00-\x20()\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
HTML 5's interpretation
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*
Effective EAI-aware version
[^\x00-\x20\x80-\x9f]()\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.

[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ example.com would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.

December 04, 2013 11:24 PM

November 20, 2013

Joshua Cranmer

Why email is hard, part 3: MIME

This post is part 3 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discuses internationalization. This post discusses MIME, the mechanism by which email evolves beyond plain text.

MIME, which stands for Multipurpose Internet Mail Extensions, is primarily dictated by a set of 5 RFCs: RFC 2045, RFC 2046, RFC 2047, RFC 2048, and RFC 2049, although RFC 2048 (which governs registration procedures for new MIME types) was updated with newer versions. RFC 2045 covers the format of related headers, as well as the format of the encodings used to convert 8-bit data into 7-bit for transmission. RFC 2046 describes the basic set of MIME types, most importantly the format of multipart/ types. RFC 2047 was discussed in my part 2 of this series, as it discusses encoding internationalized data in headers. RFC 2049 describes a set of guidelines for how to be conformant when processing MIME; as you might imagine, these are woefully inadequate for modern processing anyways. In practice, it is only the first three documents that matter for building an email client.

There are two main contributions of MIME, which actually makes it a bit hard to know what is meant when people refer to MIME in the abstract. The first contribution, which is of interest mostly to email, is the development of a tree-based representation of email which allows for the inclusion of non-textual parts to messages. This tree is ultimately how attachments and other features are incorporated. The other contribution is the development of a registry of MIME types for different types of file contents. MIME types have promulgated far beyond just the email infrastructure: if you want to describe what kind of file binary blob is, you can refer to it by either a magic header sequence, a file extension, or a MIME type. Searching for terms like MIME libraries will sometimes refer to libraries that actually handle the so-called MIME sniffing process (guessing a MIME type from a file extension or the contents of a file).

MIME types are decomposable into two parts, a media type and a subtype. The type text/plain has a media type of text and a subtype of plain, for example. IANA maintains an official repository of MIME types. There are very few media types, and I would argue that there ought to be fewer. In practice, degradation of unknown MIME types means that there are essentially three "fundamental" types: text/plain (which represents plain, unformatted text and to which unknown text/* types degrade), multipart/mixed (the "default" version of multipart messages; more on this later), and application/octet-stream (which represents unknown, arbitrary binary data). I can understand the separation of the message media type for things which generally follow the basic format of headers+body akin to message/rfc822, although the presence of types like message/partial that don't follow the headers+body format and the requirement to downgrade to application/octet-stream mars usability here. The distinction between image, audio, video and application is petty when you consider that in practice, the distinction isn't going to be able to make clients give better recommendations for how to handle these kinds of content (which really means deciding if it can be displayed inline or if it needs to be handed off to an external client).

Is there a better way to label content types than MIME types? Probably not. X.400 (remember that from my first post?) uses OIDs, in line with the rest of the OSI model, and my limited workings with other systems that use these OIDs is that they are obtuse, effectively opaque identifiers with no inherent semantic meaning. People use file extensions in practice to distinguish between different file types, but not all content types are stored in files (such as multipart/mixed), and the MIME types is a finer granularity to distinguish when needing to guess the type from the start of a file. My only complaints about MIME types are petty and marginal, not about the idea itself.

No, the part of MIME that I have serious complaints with is the MIME tree structure. This allows you to represent emails in arbitrarily complex structures… and onto which the standard view of email as a body with associated attachments is poorly mapped. The heart of this structure is the multipart media type, for which the most important subtypes are mixed, alternative, related, signed, and encrypted. The last two types are meant for cryptographic security definitions [1], and I won't cover them further here. All multipart types have a format where the body consists of parts (each with their own headers) separated by a boundary string. There is space before and after the last parts which consists of semantically-meaningless text sometimes containing a message like "This is a MIME message." meant to be displayed to the now practically-non-existent crowd of people who use clients that don't support MIME.

The simplest type is multipart/mixed, which means that there is no inherent structure to the parts. Attachments to a message use this type: the type of the message is set to multipart/mixed, a body is added as (typically) the first part, and attachments are added as parts with types like image/png (for PNG images). It is also not uncommon to see multipart/mixed types that have a multipart/mixed part within them: some mailing list software attaches footers to messages by wrapping the original message inside a single part of a multipart/mixed message and then appending a text/plain footer.

multipart/related is intended to refer to an HTML page [2] where all of its external resources are included as additional parts. Linking all of these parts together is done by use of a cid: URL scheme. Generating and displaying these messages requires tracking down all URL references in an HTML page, which of course means that email clients that want full support for this feature also need robust HTML (and CSS!) knowledge, and future-proofing is hard. Since the primary body of this type appears first in the tree, it also makes handling this datatype in a streaming manner difficult, since the values to which URLs will be rewritten are not known until after the entire body is parsed.

In contrast, multipart/alternative is used to satisfy the plain-text-or-HTML debate by allowing one to provide a message that is either plain text or HTML [3]. It is also the third-biggest failure of the entire email infrastructure, in my opinion. The natural expectation would be that the parts should be listed in decreasing order of preference, so that streaming clients can reject all the data after it finds the part it will display. Instead, the parts are listed in increasing order of preference, which was done in order to make the plain text part be first in the list, which helps increase readability of MIME messages for those reading email without MIME-aware clients. As a result, streaming clients are unable to progressively display the contents of multipart/alternative until the entire message has been read.

Although multipart/alternative states that all parts must contain the same contents (to varying degrees of degradation), you shouldn't be surprised to learn that this is not exactly the case. There was a period in time when spam filterers looked at only the text/plain side of things, so spammers took to putting "innocuous" messages in the text/plain half and displaying the real spam in the text/html half [4] (this technique appears to have died off a long time ago, though). In another interesting case, I received a bug report with a message containing an image/jpeg and a text/html part within a multipart/alternative [5].

To be fair, the current concept of emails as a body with a set of attachments did not exist when MIME was originally specified. The definition of multipart/parallel plays into this a lot (it means what you think it does: show all of the parts in parallel… somehow). Reading between the lines of the specification also indicates a desire to create interactive emails (via application/postscript, of course). Given that email clients have trouble even displaying HTML properly [6], and the fact that interactivity has the potential to be a walking security hole, it is not hard to see why this functionality fell by the wayside.

The final major challenge that MIME solved was how to fit arbitrary data into a 7-bit format safe for transit. The two encoding schemes they came up with were quoted-printable (which retains most printable characters, but emits non-printable characters in a =XX format, where the Xs are hex characters), and base64 which reencodes every 3 bytes into 4 ASCII characters. Non-encoded data is separated into three categories: 7-bit (which uses only ASCII characters except NUL and bare CR or LF characters), 8-bit (which uses any character but NUL, bare CR, and bare LF), and binary (where everything is possible). A further limitation is placed on all encodings but binary: every line is at most 998 bytes long, not including the terminating CRLF.

A side-effect of these requirements is that all attachments must be considered binary data, even if they are textual formats (like source code), as end-of-line autoconversion is now considered a major misfeature. To make matters even worse, body text for formats with text written in scripts that don't use spaces (such as Japanese or Chinese) can sometimes be prohibited from using 8-bit transfer format due to overly long lines: you can reach the end of a line in as few as 249 characters (UTF-8, non-BMP characters, although Chinese and Japanese typically take three bytes per character). So a single long paragraph can force a message to be entirely encoded in a format with 33% overhead. There have been suggestions for a binary-to-8-bit encoding in the past, but no standardization effort has been made for one [7].

The binary encoding has none of these problems, but no one claims to support it. However, I suspect that violating maximum line length, or adding 8-bit characters to a quoted-printable part, are likely to make it through the mail system, in part because not doing so either increases your security vulnerabilities or requires more implementation effort. Sending lone CR or LF characters is probably fine so long as one is careful to assume that they may be treated as line breaks. Sending a NUL character I suspect could cause some issues due to lack of testing (but it also leaves room for security vulnerabilities to ignore it). In other words, binary-encoded messages probably already work to a large degree in the mail system. Which makes it extremely tempting (even for me) to ignore the specification requirements when composing messages; small wonder then that blatant violations of specifications are common.

This concludes my discussion of MIME. There are certainly many more complaints I have, but this should be sufficient to lay out why building a generic MIME-aware library by itself is hard, and why you do not want to write such a parser yourself. Too bad Thunderbird has at least two different ad-hoc parsers (not libmime or JSMime) that I can think of off the top of my head, both of which are wrong.

[1] I will be covering this in a later post, but the way that signed and encrypted data is represented in MIME actually makes it really easy to introduce flaws in cryptographic code (which, the last time I surveyed major email clients with support for cryptographic code, was done by all of them).
[2] Other types are of course possible in theory, but HTML is all anyone cares about in practice.
[3] There is also text/enriched, which was developed as a stopgap while HTML 3.2 was being developed. Its use in practice is exceedingly slim.
[4] This is one of the reasons I'm minded to make "prefer plain text" do degradation of natural HTML display instead of showing the plain text parts. Not that cleanly degrading HTML is easy.
[5] In the interests of full disclosure, the image/jpeg was actually a PNG image and the HTML claimed to be 7-bit UTF-8 but was actually 8-bit, and it contained a Unicode homograph attack.
[6] Of the major clients, Outlook uses Word's HTML rendering engine, which I recall once reading as being roughly equivalent to IE 5.5 in capability. Webmail is forced to do their own sanitization and sandboxing, and the output leaves something to desire; Gmail is the worst offender here, stripping out all but inline style. Thunderbird and SeaMonkey are nearly alone in using a high-quality layout engine: you can even send a <video> in an email to Thunderbird and have it work properly. :-)
[7] There is yEnc. Its mere existence does contradict several claims (for example, that adding new transfer encodings is infeasible due to install base of software), but it was developed for a slightly different purpose. Some implementation details are hostile to MIME, and although it has been discussed to death on the relevant mailing list several times, no draft was ever made that would integrate it into MIME properly.

November 20, 2013 07:54 PM

November 09, 2013

Tim Chevalier

TMI: Changing Roles

Six weeks short of what would have been my two-year anniversary as a Mozilla full-timer (January 1, 2014), I've decided to leave Mozilla to work at a startup, AlephCloud. I've worked at Mozilla for longer than I've worked at any other full-time job: first six months as an intern, and now a year and ten months as a Research Engineer.

Some people leave a job and say that they're ready to move on. I wasn't ready, and I still had a lot to contribute to Rust. I just didn't see any way that I could continue contributing to it as a Mozilla employee. I'm unlikely to elaborate on that further except in private over a beer or a cup of tea.

I still hope to contribute to Rust as a volunteer. I know it's BS to say "hope" when talking about work; if I end up feeling like I want to do it enough, I know that I will, and if I don't, I know that I won't. That said, if I find myself with time to contribute to any open source project at all, it'll be Rust. For sure, I'm committed to mentoring a Rust intern through GNOME OPW, and I'll still be committed to doing that even as a volunteer. Thus, I'll still be the contact person for anyone still interested in working on a Rust project through OPW. Larissa Shapiro will take over the role coordinator for this round of Mozilla's participation in OPW.

That said, I can't commit to continuing my work on rustpkg while a volunteer, so I'm hoping to be able to identify a new owner for it. I've talked with one person on the core team who would like to take over ownership, and will try to settle the question before my last day.

On a personal level, I've grown a lot over the past two years. I mean both that I've learned a lot about software, and that I've grown as a person. Moreover, I think I understand a lot more about how the software industry works than I did at the beginning of 2011. In general, my colleagues on the Mozilla Research team have set an example both for how to build software, and how to build a community. I wish nothing but success to the Rust project, and what's more, I intend to continue to be there to help the project advance, even if it'll be for a smaller percentage of my time.

My last day as a Mozilla employee will be Friday, November 15, and with just a weekend in between, I'm starting at AlephCloud on Monday, November 18, in the role of Principal Software Engineer. At AlephCloud, I will be programming in Haskell full-time. I wasn't exactly ever expecting to be active in the Haskell community again, but I look forward to doing work that is central to the mission of my company, and it's icing on the cake that the company's language of choice happens to be Haskell. Though I expect I'll be working from various locations, my new office will be in Sunnyvale, so I'm not going far. Especially since my two closest future colleagues both work remotely (outside the Bay Area), I'm hoping to experiment with living outside the Bay Area for a while (student loans don't pay themselves), but I'll only consider that after a couple of months from now.

comment count unavailable comments

November 09, 2013 02:38 AM

November 06, 2013

Brendan Eich

Today I saw the future (Update)

As noted at the Mozilla blog, OTOY and Amazon along with Autodesk and Mozilla have announced the next step in Amazon and OTOY’s GPU/cloud effort.

Demo videos:

This means developers can get started using ORBX.js with GPU-cloud encoding and downloadable decoding on all modern Web clients.

It also means that any of the Hollywood Six can start a streaming video service that reaches the most users across the Web (compared to any other purely Web-based service), using watermarking not DRM. More on this soon, if all goes as I hope.

Note that I’m an OTOY advisor. Not because of any compensation, but because I believe in their approach and their talent.

/be

November 06, 2013 12:59 AM

October 30, 2013

Brendan Eich

Cisco’s H.264 Good News

As I noted last year, one of the biggest challenges to open source software has been the patent status of video codecs. The most popular codec, H.264, is patent-encumbered and licensed by MPEG LA, under terms that prevent distributing it with open source products including Firefox. Cisco has announced today that they are going to release a gratis, high quality, open source H.264 implementation — along with gratis binary modules compiled from that source and hosted by Cisco for download. This move enables any open source project to incorporate Cisco’s H.264 module without paying MPEG LA license fees.
 
We are grateful for Cisco’s contribution, and we will add support for Cisco’s OpenH264 binary modules to Firefox soon. These modules will be usable by downstream distributions of Firefox, as well as by any other project. In addition, we will work with Cisco to put the OpenH264 project on a sound footing and to ensure that it is governed well. We have already been collaborating very closely with Cisco on our WebRTC implementation, and we are excited to see Cisco deepening their commitment to the Open Web.  Or, as Jonathan Rosenberg, Cisco CTO for Collaboration puts it,

Cisco has a long-standing history of supporting and integrating open standards, open formats and open source technologies as a model for delivering greater flexibility and interoperability to users. We look forward to collaborating with Mozilla to help bring H.264 to the Web and to the Internet.

Here’s a little more detail about how things are going to work: Cisco is going to release, under the BSD license, an H.264 stack, and build it into binary modules compiled for all popular or feasibly supportable platforms, which can be loaded into any application (including Firefox). The binary modules will be available for download from Cisco, and Cisco will pay for the patent license from the MPEG LA. Firefox will automatically download and install the appropriate binary module onto each user’s machine when needed, unless disabled in the user’s preferences.
 
Interoperability is critical on the Internet, and H.264 is the dominant video codec on the Web. The vast majority of HTML5 streaming video is encoded using H.264, and most softphones and videoconferencing systems use H.264. H.264 chipsets are widely available and can be found in most current smartphones, including many Firefox OS phones. Firefox already supports H.264 for the video element using platform codecs where they are available, but as noted in my last blog post on the topic, not all OSes ship with H.264 included. Provided we can get AAC audio decoders to match, using Cisco’s OpenH264 binary modules allows us to extend support to other platforms and uses of H.264.
 
While Cisco’s move helps add H.264 support to Firefox on all OSes, we will continue to support VP8, both for the HTML video element and for WebRTC. VP8 and H.264 are both good codecs for WebRTC, and we believe that at this point, users are best served by having both choices.
 
Of course, this is not a not a complete solution. In a perfect world, codecs, like other basic Internet technologies such as TCP/IP, HTTP, and HTML, would be fully open and free for anyone to modify, recompile, and redistribute without license agreements or fees. Mozilla is fully committed to working towards that better future. To that end, we are developing Daala, a fully open next generation codec. Daala is still under development, but our goal is to leapfrog H.265 and VP9, building a codec that will be both higher-quality and free of encumberances. Mozilla has assembled an engineering dream team to develop Daala, including Jean-Marc Valin, co-inventor of Opus, the new standard for audio encoding; Theora project lead Tim Terriberry; and recently Xiph co-founders Jack Moffitt, author of Icecast; and Monty Montgomery, the author of Ogg Vorbis.
 
Cullen Jennings, Cisco Fellow, Collaboration Group, says:

Cisco is very excited about the future of royalty free codecs. Daala is one of the most interesting ongoing technical developments in the codec space and we have been contributing to the project.

At Mozilla we always come back to the question of what’s good for the users and in this case that means interoperation of copious H.264 content across OSes and other browsers. We’ve already started looking at how to integrate the Cisco-hosted H.264 binary module, and we hope to have something ready for users in early 2014.
 
Watch this space for more exciting developments in WebRTC, Daala, and open web video.
 
/be

October 30, 2013 12:15 PM

October 18, 2013

Tim Chevalier

TMI: Rust and Servo participating in the GNOME Outreach Program for Women this year

I've mostly been quiet lately since a lot of what I've been doing is wrestling with Linux test bots that behave slightly differently from other Linux test bots, and that's not very useful to write about (plus, if I did, it would mostly be swearing). However, I do want to make an exciting announcement! Copied verbatim from the mailing list:
Hello,

This year, the Rust and Servo projects are participating in the GNOME
Outreach Program for Women (OPW). OPW is an opportunity for women to
take on a paid, full-time, three-month internship working on one of
several open-source projects. Interns can be located anywhere in the
world and will primarily work from home; some funding will be available
to travel to meet with the project mentors for a brief period of time.

This is the third time that Mozilla has participated as a source of
project mentors, but it's the first time that the Rust and Servo
projects specifically have participated. We're very excited about this
and encourage you to apply if you're a woman who's interested in Rust.
(OPW is an inclusive program: as per their information for applicants,
"This program is open to anyone who is a female assigned at birth and
anyone who identifies as a woman, genderqueer, genderfluid, or
genderfree regardless of gender presentation or assigned sex at birth.
Participants must be at least 18 years old at the start date of the
internship.") If you're not eligible, please spread the word to any
eligible individuals you know who have any interest in Rust or Servo.

No previous experience with Rust is required, but we do request some
experience with C and systems programming, and of course people who have
previously used Rust may make faster progress.

The deadline to apply is November 11.

The details are available at:

https://wiki.mozilla.org/GNOME_Outreach_December2013 (specific to Mozilla)
https://wiki.gnome.org/OutreachProgramForWomen (applying to OPW)

If you are not a woman, but interested in an internship working on Rust,
never fear! If you are currently enrolled in an undergraduate or
graduate program, you can apply for an internship at Mozilla Research
for the winter, spring, or summer term. (Note that OPW is open to both
students and non-students.) We will be announcing this program to the
mailing list as well as soon as the details are finalized.

If you have any questions, please join #opw on irc.mozilla.org and/or
contact Tim Chevalier (tchevalier at mozilla.com, tjc on IRC) (Rust contact
person) or Lars Bergstrom (lbergstrom at mozilla.com, lbergstrom on IRC)
(Servo contact person).
I'm coordinating Mozilla's involvement in OPW this year as well as any potential Rust projects (others may end up mentoring interns depending on the intern's preferred project, but the buck stops with me), so I'd be very grateful if readers would publicize this notice to relevant venues, as well as direct any questions to me (via email, IRC, or in a comment on this post).

comment count unavailable comments

October 18, 2013 10:03 PM

October 11, 2013

Joshua Cranmer

Why email is hard, part 2: internationalization

This post is part 2 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure, as well as the issues I have with it. This post is discussing internationalization, specifically supporting non-ASCII characters in email.

Internationalization is not a simple task, even if the consideration is limited to "merely" the textual aspect [1]. Languages turn out to be incredibly diverse in their writing systems, so software that tries to support all writing systems equally well ends up running into several problems that admit no general solution. Unfortunately, I am ill-placed to be able to offer personal experience with internationalization concerns [2], so some of the information I give may well be wrong.

A word of caution: this post is rather long, even by my standards, since the problems of internationalization are legion. To help keep this post from being even longer, I'm going to assume passing familiarity with terms like ASCII, Unicode, and UTF-8.

The first issue I'll talk about is Unicode normalization, and it's an issue caused largely by Unicode itself. Unicode has two ways of making accented characters: precomposed characters (such as U+00F1, ñ) or a character followed by a combining character (U+006E, n, followed by U+0303, ◌̃). The display of both is the same: ñ versus ñ (read the HTML), and no one would disagree that the share the meaning. To let software detect that they are the same, Unicode prescribes four algorithms to normalize them. These four algorithms are defined on two axes: whether to prefer composed characters (like U+00F1) or prefer decomposed characters (U+006E U+0303), and whether to normalize by canonical equivalence (noting that, for example, U+212A Kelvin sign is equivalent to the Latin majuscule K) or by compatibility (e.g., superscript 2 to a regular 2).

Another issue is one that mostly affects display. Western European languages all use a left-to-right, top-to-bottom writing order. This isn't universal: Semitic languages like Hebrew or Arabic use right-to-left, top-to-bottom; Japanese and Chinese prefer a top-to-bottom, right-to-left order (although it is sometimes written left-to-right, top-to-bottom). It thus becomes an issue as to the proper order to store these languages using different writing orders in the actual text, although I believe the practice of always storing text in "start-to-finish" order, and reversing it for display, is nearly universal.

Now, both of those issues mentioned so far are minor in the grand scheme of things, in that you can ignore them and they will still probably work properly almost all of the time. Most text that is exposed to the web is already normalized to the same format, and web browsers have gotten away with not normalizing CSS or HTML identifiers with only theoretical objections raised. All of the other issues I'm going to discuss are things that cause problems and illustrate why properly internationalizing email is hard.

Another historical mistake of Unicode is one that we will likely be stuck with for decades, and I need to go into some history first. The first Unicode standard dates from 1991, and its original goal then was to collect all of the characters needed for modern transmission, which was judged to need only a 16-bit set of characters. Unfortunately, the needs of ideographic-centric Chinese, Japanese, and Korean writing systems, particularly rare family names, turns out to rather fill up that space. Thus, in 1996, Unicode was changed to permit more characters: 17 planes of 65,536 characters each, of which the original set was termed the "Basic Multilingual Plane" or BMP for short. Systems that chose to adopt Unicode in those intervening 5 years often adopted a 16-bit character model as their standard internal format, so as to keep the benefits of fixed-width character encodings. However, with the change to a larger format, their fixed-width character encoding is no longer fixed-width.

This issue plagues anybody who works with systems that considered internationalization in that unfortunate window, which notably includes prominent programming languages like C#, Java, and JavaScript. Many cross-platform C and C++ programs implicitly require UTF-16 due to its pervasive inclusion into the Windows operating system and common internationalization libraries [3]. Unsurprisingly, non-BMP characters tend to quickly run into all sorts of hangups by unaware code. For example, right now, it is possible to coax Thunderbird to render these characters unusable in, say, your subject string if the subject is just right, and I suspect similar bugs exist in a majority of email applications [4].

For all of the flaws of Unicode [5], there is a tacit agreement that UTF-8 should be the character set to use for anyone not burdened by legacy concerns. Unfortunately, email is burdened by legacy concerns, and the use of 8-bit characters in headers that are not UTF-8 is more prevalent than it ought to be, RFC 6532 notwithstanding. In any case, email explicitly provides for handling a wide variety of alternative character sets without saying which ones should be supported. The official list [6] contains about 200 of them (including the UNKNOWN-8BIT character set), but not all of them see widespread use. In practice, the ones that definitely need to be supported are the ISO 8859-* and ISO 2022-* charsets, the EUC-* charsets, Windows-* charsets, GB18030, GBK, Shift-JIS, KOI8-{R,U}, Big5, and of course UTF-8. There are two other major charsets that don't come up directly in email but are important for implementing the entire suite of protocols: UTF-7, used in IMAP (more on that later), and Punycode (more on that later, too).

The suite of character sets falls into three main categories. First is the set of fixed-width character sets, most notably ASCII and the ISO 8859 suite of charsets, as well as UCS-2 (2 bytes per character) and UTF-32 (4 bytes per character). Since the major East Asian languages are all ideographic, which require a rather large number of characters to be encoded, fixed-width character sets are infeasible. Instead, many choose to do a variable-width encoding: Shift-JIS lets some characters (notably ASCII characters and half-width katakana) remain a single byte and uses two bytes to encode all of its other characters. UTF-8 can use between 1 byte (for ASCII characters) and 4 bytes (for non-BMP characters) for a single character. The final set of character sets, such as the ISO 2022 ones, use escape sequences to change the interpretation of subsequent characters. As a result, taking the substring of an encoding string can change its interpretation while remaining valid. This will be important later.

Two more problems related to character sets are worth mentioning. The first is the byte-order mark, or BOM, which is used to distinguish whether UTF-16 is written on a little-endian or big-endian machine. It is also sometimes used in UTF-8 to indicate that the text is UTF-8 versus some unknown legacy encoding. It is also not supposed to appear in email, but I have done some experiments which suggest that people use software that adds it without realizing that this is happening. The second issue, unsurprisingly [7], is that for some character sets (Big5 in particular, I believe), not everyone agrees on how to interpret some of the characters.

The largest problem of internationalization that applies in a general sense is the problem of case insensitivity. The 26 basic Latin letters all map nicely to case, having a single uppercase and a single lowercase variant for each letter. This practice doesn't hold in general—languages like Japanese lack even the notion of case, although it does have two kana variants that hold semantic differences. Rather, there are three basic issues with case insensitivity which showcase enough of its problems to make you want to run away from it altogether [8].

The simplest issue is the Greek sigma. Greek has two lowercase variants of the sigma character: σ and &varsigma (the "final sigma"), but a single uppercase variant, Σ. Thus mapping a string s to uppercase and back to lowercase is not equivalent to mapping s directly to lower-case in some cases. Related to this issue is the story of German ß character. This character evolved as a ligature of a long and short 's', and its uppercase form is generally held to be SS. The existence of a capital form is in some dispute, and Unicode only recently added it (ẞ, if your software supports it). As a result, merely interconverting between uppercase and lowercase versions of a string does not necessarily lead to a simple fixed point. The third issue is the Turkish dotless i (ı), which is the lowercase variant of the ASCII uppercase I character to those who speak Turkish. So it turns out that case insensitivity isn't quite the same across all locales.

Again unsurprisingly in light of the issues, the general tendency towards case-folding or case-insensitive matching in internationalized-aware specifications is to ignore the issues entirely. For example, asking for clarity on the process of case-insensitive matching for IMAP folder names, the response I got was "don't do it." HTML and CSS moved to the cumbersomely-named variant known as "ASCII-subset case-insensitivity", where only the 26 basic Latin letters are mapped to their (English) variants in case. The solution for email is also a verbose variant of "unspecified," but that is only tradition for email (more on this later).

Now that you have a good idea of the general issues, it is time to delve into how the developers of email rose to the challenge of handling internationalization. It turns out that the developers of email have managed to craft one of the most perfect and exquisite examples I have seen of how to completely and utterly fail. The challenges of internationalized emails are so difficult that buggier implementations are probably more common than fully correct implementations, and any attempt to ignore the issue is completely and totally impossible. In fact, the faults of RFC 2047 are my personal least favorite part of email, and implementing it made me change the design of JSMime more than any other feature. It is probably the single hardest thing to implement correctly in an email client, and it is so broken that another specification was needed to be able to apply internationalization more widely (RFC 2231).

The basic problem RFC 2047 sets out to solve is how to reliably send non-ASCII characters across a medium where only 7-bit characters can be reliably sent. The solution that was set out in the original version, RFC 1342, is to encode specific strings in an "encoded-word" format: =?charset?encoding?encoded text?=. The encoding can either be a 'B' (for Base64) or a 'Q' (for quoted-printable). Except the quoted-printable encoding in this format isn't quite the same quoted-printable encoding used in bodies: the space character is encoded via a '_' character instead, as spaces aren't allowed in encoded-words. Naturally, the use of spaces in encoded-words is common enough to get at least one or two bugs filed a year about Thunderbird not supporting it, and I wonder if this subtle difference between two quoted-printable variants is what causes the prevalence of such emails.

One of my great hates with regard to email is the strict header line length limit. Since the encoded-word form can get naturally verbose, particularly when you consider languages like Chinese that are going to have little whitespace amenable for breaking lines, the ingenious solution is to have adjacent encoded-word tokens separated only by whitespace be treated as the same word. As RFC 6857 kindly summarizes, "whitespace behavior is somewhat unpredictable, in practice, when multiple encoded words are used." RFC 6857 also suggests that the requirement to limit encoded words to only 74 characters in length is also rather meaningless in practice.

A more serious problem arises when you consider the necessity of treating adjacent encoded-word tokens as a single unit. This one is so serious that it reaches the point where all of your options would break somebody. When implementing an RFC 2047 encoding algorithm, how do you write the code to break up a long span of text into multiple encoded words without ever violating the specification? The naive way of doing so is to encode the text once in one long string, and then break it into checks which are then converted into the encoded-word form as necessary. This is, of course, wrong, as it breaks two strictures of RFC 2047. The first is that you cannot split the middle of multibyte characters. The second is that mode-switching character sets must return to ASCII by the end of a single encoded-word [9]. The smarter way of building encoded-words is to encode words by trying to figure out how much text can be encoded before needing to switch, and breaking the encoded-words when length quotas are exceeded. This is also wrong, since you could end up violating the return-to-ASCII rule if your don't double-check your converters. Also, if UTF-16 is used as the basis for the string before charset conversion, the encoder stands a good chance of splitting up creating unpaired surrogates and a giant mess as a result.

For JSMime, the algorithm I chose to implement is specific to UTF-8, because I can use a property of the UTF-8 implementation to make encoding fast (every octet is looked at exactly three times: once to convert to UTF-8, once to count to know when to break, and once to encode into base64 or quoted-printable). The property of UTF-8 is that the second, third, and fourth octets of a multibyte character all start with the same two bits, and those bits never start the first octet of a character. Essentially, I convert the entire string to a binary buffer using UTF-8. I then pass through the buffer, keeping counters of the length that the buffer would be in base64 form and in quoted-printable form. When both counters are exceeded, I back up to the beginning of the character, and encode that entire buffer in a word and then move on. I made sure to test that I don't break surrogate characters by making liberal use of the non-BMP character U+1F4A9 [10] in my encoding tests.

The sheer ease of writing a broken encoder for RFC 2047 means that broken encodings exist in the wild, so an RFC 2047 decoder needs to support some level of broken RFC 2047 encoding. Unfortunately, to "fix" different kinds of broken encodings requires different support for decoders. Treating adjacent encoded-words as part of the same buffer when decoding makes split multibyte characters work properly but breaks non-return-to-ASCII issues; if they are decoded separately the reverse is true. Recovering issues with isolated surrogates is at best time-consuming and difficult and at worst impossible.

Yet another problem with the way encoded-words are defined is that they are defined as specific tokens in the grammar of structured address fields. This means that you can't hide RFC 2047 encoding or decoding as a final processing step when reading or writing messages. Instead you have to do it during or after parsing (or during or before emission). So the parser as a result becomes fully intertwined with support for encoded-words. Converting a fully UTF-8 message into a 7-bit form is thus a non-trivial operation: there is a specification solely designed to discuss how to do such downgrading, RFC 6857. It requires deducing what structure a header has, parsing that harder, and then reencoding the parsed header. This sort of complicated structure makes it much harder to write general-purpose email libraries: the process of emitting a message basically requires doing a generic UTF-8-to-7-bit conversion. Thus, what is supposed to be a more implementation detail of how to send out a message ends up permeating the entire stack.

Unfortunately, the developers of RFC 2047 were a bit too clever for their own good. The specification limits the encoded-words to occurring only inside of phrases (basically, display names for addresses), unstructured text (like the subject), or comments (…). I presume this was done to avoid requiring parsers to handle internationalization in email addresses themselves or possibly even things like MIME boundary delimiters. However, this list leaves out one common source of internationalized text: filenames of attachments. This was ultimately patched by RFC 2231.

RFC 2231 is by no means a simple specification, since it attempts to solve three problems simultaneously. The first is the use of non-ASCII characters in parameter values. Like RFC 2047, the excessively low header line length limit causes the second problem, the need to wrap parameter values across multiple line lengths. As a result, the encoding is complicated (it takes more lines of code to parse RFC 2231's new features alone than it does to parse the basic format [11]), but it's not particularly difficult.

The third problem RFC 2231 attempts to solve is a rather different issue altogether: it tries to conclusively assign a language tag to the encoded text and also provides a "fix" for this to RFC 2047's encoded-words. The stated rationale is to be able to have screen readers read the text aloud properly, but the other (much more tangible) benefit is to ameliorate the issues of Unicode's Han unification by clearly identifying if the text is Chinese, Japanese, or Korean. While it sounds like a nice idea, it suffers from a major flaw: there is no way to use this data without converting internal data structures from using flat strings to richer representations. Another issue is that actually setting this value correctly (especially if your goal is supporting screen readers' pronunciations) is difficult if not impossible. Fortunately, this is an entirely optional feature; though I do see very little email that needs to be concerned about internationalization, I have yet to find an example of someone using this in the wild.

If you're the sort of person who finds properly writing internationalized text via RFC 2231 or RFC 2047 too hard (or you don't realize that you need to actually worry about this sort of stuff), and you don't want to use any of the several dozen MIME libraries to do the hard stuff for you, then you will become the bane of everyone who writes email clients, because you've just handed us email messages that have 8-bit text in the headers. At which point everything goes mad, because we have no clue what charset you just used. Well, RFC 6532 says that headers are supposed to be UTF-8, but with the specification being only 19 months old and part of a system which is still (to my knowledge) not supported by any major clients, this should be taken with a grain of salt. UTF-8 has the very nice property that text that is valid UTF-8 is highly unlikely to be any other charset, even if you start considering the various East Asian multibyte charsets. Thus you can try decoding under the assumption that is UTF-8 and switch to a designated fallback charset if decoding fails. Of course, knowing which designated fallback to use is a different matter entirely.

Stepping outside email messages themselves, internationalization is still a concern. IMAP folder names are another well-known example. RFC 3501 specified that mailbox names should be in a modified version of UTF-7 in an awkward compromise. To my knowledge, this is the only remaining significant use of UTF-7, as many web browsers disabled support due to its use in security attacks. RFC 6855, another recent specification (6 months old as of this writing), finally allows UTF-8 mailbox names here, although it too is not yet in widespread usage.

You will note missing from the list so far is email addresses. The topic of email addresses is itself worthy of lengthy discussion, but for the purposes of a discussion on internationalization, all you need to know is that, according to RFCs 821 and 822 and their cleaned-up successors, everything to the right of the '@' is a domain name and everything to the left is basically an opaque ASCII string [12]. It is here that internationalization really runs headlong into an immovable obstacle, for the email address has become the de facto unique identifier of the web, and everyone has their own funky ideas of what an email address looks like. As a result, the motto of "be liberal in what you accept" really breaks down with email addresses, and the amount of software that needs to change to accept internationalization extends far beyond the small segment interested only in the handling of email itself. Unfortunately, the relative newness of the latest specifications and corresponding lack of implementations means that I am less intimately familiar with this aspect of internationalization. Indeed, the impetus for this entire blogpost was a day-long struggle with trying to ascertain when two email addresses are the same if internationalized email address are involved.

The email address is split nicely by the '@' symbol, and internationalization of the two sides happens at two different times. Domains were internationalized first, by RFC 3490, a specification with the mouthful of a name "Internationalizing Domain Names in Applications" [13], or IDNA2003 for short. I mention the proper name of the specification here to make a point: the underlying protocol is completely unchanged, and all the work is intended to happen at roughly the level of getaddrinfo—the internal DNS resolver is supposed to be involved, but the underlying DNS protocol and tools are expected to remain blissfully unaware of the issues involved. That I mention the year of the specification should tell you that this is going to be a bumpy ride.

An internationalized domain name (IDN for short) is a domain name that has some non-ASCII characters in it. Domain names, according to DNS, are labels terminated by '.' characters, where each label may consist of up to 63 characters. The repertoire of characters are the ASCII alphanumerics and the '-' character, and labels are of course case-insensitive like almost everything else on the Internet. Encoding non-ASCII characters into this small subset while meeting these requirements is difficult for other contemporary schemes: UTF-7 uses Base64, which means 'A' and 'a' are not equivalent; percent-encoding eats up characters extremely quickly. So IDN use a different specification for this purpose, called Punycode, which allows for a dense but utterly unreadable encoding. The basic algorithm of encoding an IDN is to take the input string, apply case-folding, normalize using NFKC, and then encode with Punycode.

Case folding, as I mentioned several paragraphs ago, turns out to have some issues. The ß and &varsigma characters were the ones that caused the most complaints. You see, if you were to register, say, www.weiß.de, you would actually be registering www.weiss.de. As there is no indication of Punycode involved in the name, browsers would show the domain in the ASCII variant. One way of fixing this problem would be to work with browser vendors to institute a "preferred name" specification for websites (much like there exists one for the little icons next to page titles), so that the world could know that the proper capitalization is of course www.GoOgle.com instead of www.google.com. Instead, the German and Greek registrars pushed for a change to IDNA, which they achieved in 2010 with IDNA2008.

IDNA2008 is defined principally in RFCs 5890-5895 and UTS #46. The principal change is that the normalization step no longer exists in the protocol and is instead supposed to be done by applications, in a possibly locale-specific manner, before looking up the domain name. One reason for doing this was to eliminate the hard dependency on a specific, outdated version of Unicode [14]. It also helps fix things like the Turkish dotless I issue, in theory at least. However, this different algorithm causes some domains to be processed differently from IDNA2003. UTS #46 specifies a "compatibility mode" which changes the algorithm to match IDNA2003 better in the important cases (specifically, ß, &varsigma, and ZWJ/ZWNJ), with a note expressing the hope that this will eventually become unnecessary. To handle the lack of normalization in the protocol, registrars are asked to automatically register all classes of equivalent domain names at the same time. I should note that most major browsers (and email clients, if they implement IDN at all) are still using IDNA2003: an easy test of this fact is to attempt to go to ☃.net, which is valid under IDNA2003 but not IDNA2008.

Unicode text processing is often vulnerable to an attack known as the "homograph attack." In most fonts, the Greek omicron and the Latin miniscule o will be displayed in exactly the same way, so an attacker could pretend to be from, say, Google while instead sending you to Gοogle—I used Latin in the first word and Greek in the second. The standard solution is to only display the Unicode form (and not the Punycode form) where this is not an issue; Firefox and Opera display Unicode only for a whitelist of registrars with acceptable polices, Chrome and Internet Explorer only permits scripts that the user claims to read, and Safari only permits scripts that don't permit the homograph attack (i.e., not Cyrillic or Greek). (Note: this information I've summarized from Chromium's documentation; forward any complaints of out-of-date information to them).

IDN satisfies the needs of internationalizing the second half of an email address, so a working group was commissioned to internationalize the first one. The result is EAI, which was first experimentally specified in RFCs 5335-5337, and the standards themselves are found in RFCs 6530-6533 and 6855-6858. The primary difference between the first, experimental version and the second, to-be-implemented version is the removal of attempts to downgrade emails in the middle of transit. In the experimental version, provisions were made to specify with every internalized address an alternate, fully ASCII address to which a downgraded message could be sent if SMTP servers couldn't support the new specifications. These were removed after the experiment found that such automatic downgrading didn't work as well as hoped.

With automatic downgrading removed from the underlying protocol, the onus is on people who generate the emails—mailing lists and email clients—to figure out who can and who can't receive messages and then downgrade messages as appropriate for the recipients of the message. However, the design of SMTP is such that it is impossible to automatically determine if the client can receive these new kinds of messages. Thus, the options are to send them and hope that it works or to rely on the (usually clueless) user to inform you if it works. Clearly an unpalatable set of options, but it is one that can't be avoided due to protocol design.

The largest change of EAI is that the local parts of addresses are specified as a sequence of UTF-8 characters, omitting only the control characters [15]. The working group responsible for the specification adamantly refused to define a Unicode-to-ASCII conversion process, and thus a mechanism to make downgrading work smoothly, for several reasons. First, they didn't want to specify a prefix which could change the meaning of existing local-parts (the structure of local-parts is much less discoverable than the structure of all domain names). Second, they felt that the lack of support for displaying the Unicode variants of Punycode meant that users would have a much worse experience. Finally, the transition period would be hopefully short (although messy), so designing a protocol that supports that short period would worsen it in the long term. Considering that, at the moment of writing, only one of the major SMTP implementations has even a bug filed to support it, I think the working group underestimates just how long transition periods can take.

As far as changes to the message format go, that change is the only real change, considering how much effort is needed to opt-in. Yes, headers are now supposed to be UTF-8, but, in practice, every production MIME parser needs to handle 8-bit characters in headers anyways. Yes, message/global can have MIME encoding applied to it (unlike message/rfc822), but, in practice, you already need to assume that people are going to MIME-encode message/rfc822 in violation of the specification. So, in practice, the changes needed to a parser are to add message/global as an alias to message/rfc822 [16] and possibly tweaking some charset detection heuristics to prefer UTF-8. I would very much have liked the restriction on header line length removed, but, alas, the working group did not feel moved to make those changes. Still, I look forward to the day when I never have to worry about encoding text into RFC 2047 encoded-words.

IMAP, POP, and SMTP are also all slightly modified to take account of the new specifications. Specifically, internationalized headers are supposed to be opt-in only—SMTP are supposed to reject sending to these messages if it doesn't support them in the first place, and IMAP and POP are supposed to downgrade messages when requested unless the client asks for them to not be. As there are no major server implementations yet, I don't know how well these requirements will be followed, especially given that most of the changes already need to be tolerated by clients in practice. The experimental version of internationalization specified a format which would have wreaked havoc to many current parsers, so I suspect some of the strict requirements may be a holdover from that version.

And thus ends my foray into email internationalization, a collection of bad solutions to hard problems. I have probably done a poor job of covering the complete set of inanities involved, but what I have covered are the ones that annoy me the most. This certainly isn't the last I'll talk about the impossibility of message parsing either, but it should be enough at least to convince you that you really don't want to write your own message parser.

[1] Date/time, numbers, and currency are the other major aspects of internalization.
[2] I am a native English speaker who converses with other people almost completely in English. That said, I can comprehend French, although I am not familiar with the finer points that come with fluency, such as collation concerns.
[3] C and C++ have a built-in internationalization and localization API, derived from POSIX. However, this API is generally unsuited to the full needs of people who actually care about these topics, so it's not really worth mentioning.
[4] The basic algorithm to encode RFC 2047 strings for any charset are to try to shift characters into the output string until you hit the maximum word length. If the internal character set for Unicode conversion is UTF-16 instead of UTF-32 and the code is ignorant of surrogate concerns, then this algorithm could break surrogates apart. This is exactly how the bug is triggered in Thunderbird.
[5] I'm not discussing Han unification, which is arguably the single most controversial aspect of Unicode.
[6] Official list here means the official set curated by IANA as valid for use in the charset="" parameter. The actual set of values likely to be acceptable to a majority of clients is rather different.
[7] If you've read this far and find internationalization inoperability surprising, you are either incredibly ignorant or incurably optimistic.
[8] I'm not discussing collation (sorting) or word-breaking issues as this post is long enough already. Nevertheless, these also help very much in making you want to run away from internationalization.
[9] I actually, when writing this post, went to double-check to see if Thunderbird correctly implements return-to-ASCII in its encoder, which I can only do by running tests, since I myself find its current encoder impenetrable. It turns out that it does, but it also looks like if we switched conversion to ICU (as many bugs suggest), we may break this part of the specification, since I don't see the ICU converters switching to ASCII at the end of conversion.
[10] Chosen as a very adequate description of what I think of RFC 2047. Look it up if you can't guess it from context.
[11] As measured by implementation in JSMime, comments and whitespace included. This is biased by the fact that I created a unified lexer for the header parser, which rather simplifies the implementation of the actual parsers themselves.
[12] This is, of course a gross oversimplification, so don't complain that I'm ignoring domain literals or the like. Email addresses will be covered later.
[13] A point of trivia: the 'I' in IDNA2003 is expanded as "Internationalizing" while the 'I' in IDNA2008 is for "Internationalized."
[14] For the technically-minded: IDNA2003 relied on a hard-coded list of banned codepoints in processing, while IDNA2008 derives its lists directly from Unicode codepoint categories, with a small set of hard-coded exceptions.
[15] Certain ASCII characters may require the local-part to be quoted, of course.
[16] Strictly speaking, message/rfc822 remains all-ASCII, and non-ASCII headers need message/global. Given the track record of message/news, I suspect that this distinction will, in practice, not remain for long.

October 11, 2013 04:07 AM

October 08, 2013

Tim Chevalier

TMI: Summit and beyond

The last five days have been one full day of travel, preceded by three full days of Mozilla Summit in Toronto, preceded by another full day of travel. It's been quite a time, with not much sleep involved, but (for me) a surprising amount of code getting written and interesting sessions being attended and pure fun being had (karaoke night Friday and dance party Sunday -- if any of the organizers were reading, both of those were really good ideas). Actually, watching "Code Rush" -- a documentary made in 2000 about the very early history of Mozilla -- with Mozilla folks was quite an experience as well. You can watch it for free online; it's only about an hour, and I encourage you to if you have any interest at all in where we came from. (Passive sexism and other kinds of isms aside.)

So far as sessions, one of the highlights was Jack's talk on the Servo strategy on Saturday -- featuring about 30 minutes of lecture followed by a full 80 minutes of Q&A. The questions were the most focused, on-topic, respectful set of questions I've heard after any talk I've been to in recent memory (and it wasn't just because Jack prepared a list of suggested questions to ask in advance!) The other highlight for me was the excellent "Designing your project for participation" session with David Eaves, Emma Irwin, and Jess Klein on Sunday. One of the reasons why this session was so great was that it was mostly small-group discussion, so everybody got to be more engaged than they might have been if most of the time had been sitting and listening. I came away with a better understanding that a lot of different people are thinking about how to make the volunteer experience better in their specific project, and there are lessons to learn about that that are transferable. For example: volunteers are much more likely to keep participating if they get a prompt review for their first bug; and, it might actually be okay to ask volunteers to commit to finishing a particular task by a deadline.

During downtime, I got a few pull requests in, some of which even landed. One of the bigger ones, though, was #9732, for making automatically checked-out source files read-only and overhauling where rustpkg keeps temporary files. (That one hasn't been reviewed yet.) #9736, making it easier to get started with rustpkg by not requiring a workspace to be created (it will build files in the current directory if there's a crate file with the right name) is also important to me -- alas, it bounced on the Linux bot, like so many of my pull requests have been doing lately, so I'll have to investigate that when I'm back on the office network and able to log in to that bot tomorrow. Finally, I was excited to be able to fix #9756, an error-message-quality bug that's been biting me a lot lately, while I was on the plane back to San Francisco, and surprised that the fix was so easy!

Finally, I request that anyone reading this who's part of the Rust community read Lindsey Kuper's blog post about the IRC channel and how we should be treating each other. Lindsey notes, and I agree, that the incident that she describes is an anomaly in an otherwise very respectful and decorous IRC channel. However, as the Rust community grows, it's my job, and the job of the other core Rust team members, to keep it that way. I know that communities don't stay healthy by themselves -- every community, no matter how small, exists within a kyriarchy that rewards everybody for exercising unearned power and privilege. Stopping people from following those patterns of non-consensual domination and submission -- whether in such a (seemingly) small way as inappropriately calling attention to another person's gender on IRC, or in larger ways like sexual assault at conferences -- requires active effort. I can't change the world as a whole on my own, but I do have a lot of influence over a very small part of the world -- the same is true of most other human beings. So, I'm still thinking about how to put in that effort for Rust. The same is true for community processes as for code: patches welcome.

comment count unavailable comments

October 08, 2013 02:08 AM

October 06, 2013

Lindsey Kuper

Humanity or gtfo

Update, September 2014: Since this post is linked prominently from the Rust subreddit and other places, perhaps some context is in order. Here's a timeline of events:

On August 28, 2013, I said "please remember that not everyone in the IRC channel is a guy" to someone in #rust, in response to them addressing the people in the channel as "Guys". The person responded with "boobs or gtfo". I responded by asking the moderators to kick them out of the channel. (Kicking is a mild penalty; a kicked person can rejoin immediately.) They left the channel a short while later of their own accord. Later, I emailed some people on the Rust team about what happened, and one of them banned the user. You can find the chat logs linked below.

Several weeks later, in October 2013, I wrote this post. I wrote it because I have a long-running yearly tradition of writing about what goes on in my life during the last week in August -- or "had", I suppose; I haven't much felt like continuing the tradition since last year. For the first seven years that I wrote them, these last-week-of-August posts were largely about mundane, day-to-day things. I had expected 2013 to be no different. But when I sat down to write the post in October of that year, the main recollection I had from the last week of August was how bad it had felt to have someone say "boobs or gtfo" to me in the IRC channel of a project to which I had been contributing for two years. So, I just wrote about that.

The post got a lot of attention. A lot of people responded supportively. On the other hand, the top comment on Hacker News characterizes my saying "please remember that not everyone in the IRC channel is a guy" as an "attack", while, in the same breath, describing "boobs or gtfo" as a "poorly chosen attempt at ironic humor". So, you know, the usual.

Some time later, in April 2014, someone came into the Rust IRC channel and began spamming the channel with "boobs or gtfo" and "fuck lindsey" repeatedly. I tweeted about how I was getting IRC abuse for having written about IRC abuse. Various people who had previously been unaware of the whole saga found out about it. Again, I got a lot of attention, and a lot of people responded supportively, including the Rust team. (Among other things, they asked me to be a mod on #rust, and I accepted.) On the other hand, at least one person wrote a blog post about how quote-unquote "fucktarded" I am. (I won't be linking to it here; I'm sure you can find it if you really want to.) So, you know, the usual.

To sum up, there have been people who seem to think that I (or the #rust mods, or the Rust team) consider use of "guys" to be abusive, and who are upset with me (or the #rust mods, or the Rust team) for "attacking" people who use "guys". So, allow me to clarify:

Hope that clears things up a bit, folks! The original post from October 2013 follows.


I'm past due to write this year's My, What A Busy Week! post, but didn't feel like doing it in August and haven't felt like doing it since. Well, it had a good long run. I'd like to mark the passage of that week in what I hope is a less flip and more thoughtful way. The only really notable occurrence in my life during the last Monday-through-Saturday in August this year is that someone in the #rust channel on IRC said "boobs or gtfo" to me. So, let's talk about that for a minute.

I don't often choose the "don't use 'guys' to refer to mixed-gender groups" battle. Over the years, 'guys' has come to grate on me, but only in certain contexts, and even then, it only grates a tiny bit. It's almost never a battle worth choosing. So, when someone showed up in #rust and said [sic] "Guys, can you recommend library for getting something from iternets via http?", I almost let it go, because it just wasn't that big of a deal. But I was emboldened by my friend Adam's recent choice of the same battle on a (private) mailing list we're both on, so I decided to say something.

I was polite, I think. All I said was, "please remember that not everyone in the IRC channel is a guy." The initial response was, "Do you have any proofs?" While I was contemplating whether I'd be the only one who would find it amusing at that point if I sent him a link to the latter forty pages of this document ("Sure! Here are some of my favorites!"), "boobs or gtfo" showed up on my screen.

I said, out loud, "Holy shit!" My officemate, sitting across the room, said, "What?" But I couldn't actually make my voice work to tell him what had just happened, because by then, adrenaline was sweeping through me. It was ten seconds before I remembered how to talk. The only other times in my life that I can remember that happening have been in response to an immediate physical threat, like when a car was about to hit my bike.

It was unexpected. I hadn't been expecting this person to actually correct themselves, of course. But I had been expecting them to respond with something like "uh, 'guys' is a gender-neutral word", in which case my reaction would have been to shrug and say something like, "Yeah, I'm sure that's how you meant it." And I would have left it at that, because, again, it wasn't that big of a deal, and all I really aim to accomplish here is encourage people to think about it a little harder, maybe choose a different word next time. But that's not what happened. What happened was that when I made a polite request, a flood of hate came rushing out at me. And now it's hard for me to continue to pretend or assume that that hate doesn't boil under the surface of our community.

Later on, various members of the full-time Rust staff (who have ops in the channel) apologized to me, and banned the user, and added several more ops from various time zones so that it would be more likely that someone with ops would be around next time (since it had been lunchtime on the west coast and no one had been around). They also added a link to the conduct policy to the channel topic, to help reduce the chances of this sort of thing happening in the future. All that was great. The Rust team is awesome. But here's the thing: every time I point out something like this in a community I'm part of, whether it's the Rust community or any other, there's a part of me that insists on first checking to see how much social capital I have to spend there. How high up am I on the contributors list? Have I contributed to the next release yet? All right, I guess it's okay for me to say something -- as though it hurts the project to speak up about a community problem! And so I have a double-entry accounting system in my head for amount of code contributed and amount of abuse reported, and it's terrible and broken that I feel that that's necessary. The only qualification that any of us should need to be treated with humanity is that we are human.

I've been contributing to Rust for a long time. #rust is my community, and it's been my community since there were only twenty-odd people in it, instead of the three hundred-some we usually have today. I shouldn't have to worry about being attacked while standing on my own ground. And, indeed, right after it happened, a friend who worked on Rust with me last year privmsg'd me to say that it seemed to him that "something sacred had been violated".

Meanwhile, in the channel, my friends joked ruefully about "boobs or gtfo" being some sort of milestone -- that we hadn't been a real language before, and hooray, now we were. I laughed along with them. But I would like to make the wholly unoriginal and unrevolutionary suggestion that, if being a "real language" means that your longtime contributors get harassed by strangers in the official IRC channel for being female, then being a "real language" is a pretty fucking abysmal standard to aspire to. We'd all like to think that the Rust community is safe and welcoming to all, and for the most part, most of the time, I think we do okay -- or, at least, we do okay relative to the absymal standard that's been set. I almost feel sheepish complaining, since the occasional "boobs or gtfo" from some stranger on IRC is a laughably insignificant problem for our community to have. I still can barely believe it happened; it seems surreal. It's certainly not the #rust that I'm accustomed to. But if we want to keep it that way, or maybe even do better, then we'll need to work at it, and we'll certainly need to work harder now then we would have had to work two and a half years ago, and if we're very fortunate and Rust continues to grow in popularity, we'll have to work harder still.

And this is work that I believe is worth doing. As much as I like to talk about pattern matching and default methods and (recently) hygienic macros and everything else in Rust that I like and care about, there's no getting around the fact that the most important feature of any programming language is its community. At the Haskell Symposium two weeks ago, Bastiaan Heeren said, "The strength and weakness of Haskell is GHC and its community." This followed Ken Shan's program chair report in which he exhorted the Haskell community to do a better job of treating one another with humanity. Rust's community is the strength and weakness of Rust, too, and we have a tremendous opportunity to make it more of a strength and less of a weakness, if we want to.

October 06, 2013 11:53 PM

October 02, 2013

Tim Chevalier

TMI: Bringing rustpkg into the real world

Once #9654 lands (after ~ 24 hours in the queue, it finally started building only to fail because of an overly long line... and I thought I checked that every time I pushed a new version, but apparently not), we'll have completed the "build Servo with rustpkg" milestone. The next milestone is the first part of community adoption, which, roughly speaking, is the list of features we think rustpkg has to have in order for it to get widespread use in the Rust community (such as it is).

The first thing on that list, which I started working on today, is #6480: making locally-cached copies of source files that rustpkg automatically fetched from github (or, in the future, elsewhere) read-only. The intent behind this bug is to prevent a sad situation -- that really happened to a Go developer, and could happen again with a similar package manager -- where somebody makes local changes to packages that their projects depends on without realizing they're editing automatically-fetched files, and then loses those changes when they redistribute the code (without their local changes).

One solution is for rustpkg to make any source files that it automatically fetches read-only. Of course, anybody can just change the permissions, but that requires deliberate action and (with hope) will remind people that they might be making a poor life decision. That's easy enough to implement, and in fact, I implemented it today.

But I realized that this really goes along with another issue I filed, #9514. Servo depends on many packages that the Servo developers are also hacking on in tandem with hacking on Servo -- so if we started making everything read-only, that would make it hard to work with local (uncommitted) changes.

I think the right thing to do (and as per discussion in meetings, it sounds like it's the general consensus) is to make rustpkg do two things differently: (1) store fetched sources in a subdirectory of a workspace's build directory, not its src directly (it's less confusing if src is only things you created yourself); and (2) always search for sources under src before searching under build. That way, if you want to make local changes to a package that you refer to with an extern mod that points to a remote package, you can manually git clone the sources for it under a workspace's src directory, and then rustpkg will automatically find your locally-changed copy (which it will also never overwrite with newly fetched sources).

I looked at what Go does, and it looks like it does more or less what I want to do in rustpkg: "By default, get uses the network to check out missing packages but does not use it to look for updates to existing packages." So, the only case where rustpkg will pull a fresh copy of a repository off github is if there's no local copy of the sources, or installed version of the binaries for it, that can be found by searching the RUST_PATH.

comment count unavailable comments

October 02, 2013 06:13 AM

October 01, 2013

Tim Chevalier

TMI: Building C dependencies in custom build scripts

It's been a while since my last Rust blog post. The latest thing that's happened is that I just submitted a pull request that gives rustpkg packages that have custom build scripts the ability to have C dependencies (or, in fact, dependencies on any sort of artifact that you can write a Rust program to build). There's no magic here -- in fact, for the most part, it's just a matter of writing an example to show what needs to be done. Someday there will be a higher-level interface to rustpkg where you can just declare some source file names and rustpkg will do the rest, but not today. In the meantime, the build script explicitly sets up a workcache context and declares foreign source files as inputs, and so on.

The thing I did have to change in rustpkg itself was changing the install_pkg function (in the API that build scripts can call) so as to take a vector of extra dependencies. Currently, the types of dependencies supported are files (any source files, which get checked for freshness based on hashing their contents and metadata) and binaries (any files at all, which get checked for freshness based on their modification times).

Completing this bug means that (once it's reviewed and merged) I'll have completed the "build all of Servo" milestone for rustpkg, and on time, no less (albeit with less than 30 minutes remaining). Jack has been steadily working on actually building Servo with rustpkg, discovering many issues in the process, and once this pull request is merged, we should be able to do the rest.

comment count unavailable comments

October 01, 2013 06:43 AM

September 17, 2013

Tim Chevalier

TMI: The sweet taste of passing tests

I've got tests passing for my fix for #7879, making dependencies work more sensibly. I was proud of my voluminous status update for the past week, so it was a bit sad to only work on one bug today. It was a hard one, though, and I finished it! Aside from fixing tidy errors and such, anyway.

Amusingly/frustratingly enough, I did the real work over the weekend and the work I did today was almost completely centered around testability. I realized I can't use timestamps from stat() to ensure that a given file wasn't rebuilt. Someone else discovered this too: time granularity on Unix is only 1 second, so it's easy for two files that get created in quick succession, but sequentially, to have equal timestamps. Next idea was to compare hashes (which is what workcache already does!), but that won't do since if no dependencies have changed, but a file gets rebuilt anyway, it's likely to be bitwise identical. Jack suggested making the test case set the output to be read-only so that in case the second invocation of rustpkg tries to overwrite it, that would be an error. This required a little bit of fiddling due to details about conditions and tasks that I fought with before (see #9001), but the basic approach works.

So I can get a pull request in for this tomorrow, and happily, it's likely to fix several other bugs without additional work (or with minimal work): #7420, #9112, #8892, #9240, and I'm probably missing some.

comment count unavailable comments

September 17, 2013 02:34 AM

September 14, 2013

Joshua Cranmer

Why email is hard, part 1: architecture

Which is harder, writing an email client or writing a web browser? Several years ago, I would have guessed the latter. Having worked on an email client for several years, I am now more inclined to guess that email is harder, although I never really worked on a web browser, so perhaps it's just bias. Nevertheless, HTML comes with a specification that tells you how to parse crap that pretends to be HTML; email messages come with no such specification, which forces people working with email to guess based on other implementations and bug reports. To vent some of my frustration with working with email, I've decided to post some of my thoughts on what email did wrong and why it is so hard to work with. Since there is so much to talk about, instead of devoting one post to it, I'll make it an ongoing series with occasional updates (i.e., updates will come out when I feel like it, so don't bother asking).

First off, what do I mean by an email client? The capabilities of, say, Outlook versus Gaia Email versus Thunderbird are all wildly different, and history has afforded many changes in support. I'll consider anything that someone might want to put in an email client as fodder for discussion in this series (so NNTP, RSS, LDAP, CalDAV, and maybe even IM stuff might find discussions later). What I won't consider are things likely to be found in a third-party library, so SSL, HTML, low-level networking, etc., are all out of scope, although I may mention them where relevant in later posts. If one is trying to build a client from scratch, the bare minimum one needs to understand first is the basic message formatting, MIME (which governs attachments), SMTP (email delivery), and either POP or IMAP (email receipt). Unfortunately, each of these requires cross-referencing a dozen RFCs individually when you start considering optional or not-really-optional features.

The current email architecture we work with today doesn't have a unique name, although "Internet email" [1] or "SMTP-based email" are probably the most appropriate appellations. Since there is only one in use in modern times, there is no real need to refer to it by anything other than "email." The reason for the use of SMTP in lieu of any other major protocol to describe the architecture is because the heart of the system is motivated by the need to support SMTP, and because SMTP is how email is delivered across organizational boundaries, even if other protocols (such as LMTP) are used internally.

Some history of email, at least that lead up to SMTP, is in order. In the days of mainframes, mail generally only meant communicating between different users on the same machine, and so a bevy of incompatible systems started to arise. These incompatible systems grew to support connections with other computers as networking computers became possible. The ARPANET project brought with it an attempt to standardize mail transfer on ARPANET, separated into two types of documents: those that standardized message formats, and those that standardized the message transfer. These would eventually culminate in RFC 822 and RFC 821, respectively. SMTP was designed in the context of ARPANET, and it was originally intended primarily to standardize the messages transferred only on this network. As a result, it was never intended to become the standard for modern email.

The main competitor to SMTP-based email that is worth discussing is X.400. X.400 was at one time expected to be the eventual global email interconnect protocol, and interoperability between SMTP and X.400 was a major focus in the 1980s and 1990s. SMTP has a glaring flaw, to those who work with it, in that it is not so much designed as evolved to meet new needs as they came up. In contrast, X.400 was designed to account for a lot of issues that SMTP hadn't dealt with yet, and included arguably better functionality than SMTP. However, it turned out to be a colossal failure, although theories differ as to why. The most convincing to me boils down to X.400 being developed at a time of great flux in computing (the shift from mainframes to networked PCs) combined with a development process that was ill-suited to reacting quickly to these changes.

I mentioned earlier that SMTP eventually culminates in RFC 821. This is a slight lie, for one of the key pieces of the Internet, and a core of the modern email architecture, didn't exist. That is DNS, which is the closest thing the Internet has to X.500 (a global, searchable directory of everything). Without DNS, figuring out how to route mail via SMTP is a bit of a challenge (hence why SMTP allowed explicit source routing, deprecated post-DNS in RFC 2821). The documents which lay out how to use DNS to route are RFC 974, RFC 1035, and RFC 1123. So it's fair to say that RFC 1123 is really the point at which modern SMTP was developed.

But enough about history, and on to the real topic of this post. The most important artifact of SMTP-based architecture is that different protocols are used to send email from the ones used to read email. This is both a good thing and a bad thing. On the one hand, it's easier to experiment with different ways of accessing mailboxes, or only supporting limited functionality where such is desired. On the other, the need to agree on a standard format still keeps all the protocols more or less intertwined, and it makes some heavily-desired features extremely difficult to implement. For example, there is still, thirty years later, no feasible way to send a mail and save it to a "Sent" folder on your IMAP mailbox without submitting it twice [2].

The greatest flaws in the modern architecture, I think, lie in particular in a bevy of historical design mistakes which remain unmitigated to this day, in particular in the base message format and MIME. Changing these specifications is not out of the question, but the rate at which the changes become adopted is agonizingly slow, to the point that changing is generally impossible unless necessary. Sending outright binary messages was proposed as experimental in 1995, proposed as a standard in 2000, and still remains relatively unsupported: the BINARYMIME SMTP keyword only exists on one of my 4 SMTP servers. Sending non-ASCII text is potentially possible, but it is still not used in major email clients to my knowledge (searching for "8BITMIME" leads to the top results generally being "how do I turn this off?"). It will be interesting to see how email address internationalization is handled, since it's the first major overhaul to email since the introduction of MIME—the first major overhaul in 16 years. Intriguingly enough, the NNTP and Usenet communities have shown themselves to be more adept to change: sending 8-bit Usenet messages generally works, and yEnc would have been a worthwhile addition to MIME if its author had ever attempted to push it through. His decision not to (with the weak excuses he claimed) is emblematic of the resistance of the architecture to change, even in cases where such change would be pretty beneficial.

My biggest complaint with the email architecture isn't actually really a flaw in the strict sense of the term but rather a disagreement. The core motto of email could perhaps be summed up with "Be liberal in what you accept and conservative in what you send." Now, I come from a compilers background, and the basic standpoint in compilers is, if a user does something wrong, to scream at them for being a bloody idiot and to reject their code. Actually, there's a tendency to do that even if they do something technically correct but possibly unintentionally wrong. I understand why people dislike this kind of strict checking, but I personally consider it to be a feature, not a bug. My experience with attempting to implement MIME is that accepting what amounts to complete crap not only means that everyone has to worry about parsing the crap, but it actually ends up encouraging it. The attitude people get in bugs starts becoming "this is supported by <insert other client>, and your client is broken for not supporting it," even when pointed out that their message is in flagrant violation of the specification. As I understand it, HTML 5 has the luxury of specifying a massively complex parser that makes /dev/urandom in theory reliably parsed across different implementations, but there is no such similar document for the modern email message. But we still have to deal with the utter crap people claim is a valid email message. Just this past week, upon sifting through my spam folder, I found a header which is best described as =?UTF-8?Q? ISO-8859-1, non-ASCII text ?= (spaces included). The only way people are going to realize that their tools are producing this kind of crap is if their tools stop working altogether.

These two issues come together most spectacularly when RFC 2047 is involved. This is worth a blog post by itself, but the very low technically-not-but-effectively-mandatory limit on the header length (to aide people who read email without clients) means that encoded words need to be split up to fit on header lines. If you're not careful, you can end up splitting multibyte characters between different encoded words. This unfortunately occurs in practice. Properly handling it in my new parser required completely reflowing the design of the innermost parsing function and greatly increasing implementation complexity. I would estimate that this single attempt to "gracefully" handle wrong-but-of-obvious-intent scenario is worth 15% or so of the total complexity of decoding RFC 2047-encoded text.

There are other issues with modern email, of course, but all of the ones that I've collected so far are not flaws in the architecture as a whole but rather flaws of individual portions of the architecture, so I'll leave them for later posts.

[1] The capital 'I' in "Internet email" is important, as it's referring to the "Internet" in "Internet Standard" or "Internet Engineering Task Force." Thus, "Internet email" means "the email standards developed for the Internet/by the IETF" and not "email used on the internet."
[2] Yes, I know about BURL. It doesn't count. Look at who supports it: almost nobody.

September 14, 2013 10:50 PM

Tim Chevalier

TMI: Packages vs. crates

I landed four pull requests between middle-of-yesterday and now! See the schedule -- #6408, #8524, #7402, and #8672.

The remaining work that's necessary for building Servo for rustpkg is making dependencies between crates in the same package work right; building C libraries; and (added today) rustpkg test. I worked mostly on the first one today (but worked a little on making a test case for the second one).

A community member reported this bug about rustpkg going into an infinite loop when compiling a package with two crates in it, where one depends on the other. Digging in, I realized that various parts of rustpkg assume that a package contains either only one crate, or several crates that are all independent of each other. The key thing is that it was computing dependencies at the package level, not the crate level. So if you have a package P containing crates A and B, and B depends on A, rustpkg goes back and tries to compile all the crates in P first, rather than just compiling A.

So I'm working on that; it requires some refactoring, but nothing too complicated. I was thinking at first that I would have to topologically sort the dependencies, but after talking to Erick Tryzelaar on IRC, I realized that that approach (the declarative one) isn't necessary in the framework of workcache, where function calls express dependencies. The issue was just that I was treating packages, rather than crates, as the unit of recompilation.

This past week was the Servo work week, so today I also spent some time in person working with Jack to get more Servo packages to build with rustpkg. He found a bug in how extern mod works that I fixed, but only after he had to leave for the airport.

comment count unavailable comments

September 14, 2013 01:33 AM

September 13, 2013

Tim Chevalier

TMI: Busy day

Today I was able to finish the first three items on the "Build all of Servo" milestone on the schedule:
I'm working on putting build output in a target-specific subdirectory now, which is easy except for cleaning up all of my crufty test code that made assumptions in lots of different places about the directory structure. Along those lines, I'm also cleaning up some of the test code to make it less crufty.

I looked at #7879 (infinite loop compiling dependencies) too, and it's harder than I thought; right now, rustpkg assumes it's okay to build the crates in a multi-crate package in any order. That's not necessarily true (as the examples given in the original bug and in my comment show) and so rustpkg can end up telling workcache that there's a circular dependency between two files (when there's not). I think fixing this will require topologically sorting the crates in a package in dependency order, which in turn will actually require... writing a graph library in Rust, since there isn't one. So that will be fun, but I don't have time today.

comment count unavailable comments

September 13, 2013 01:42 AM

September 12, 2013

Tim Chevalier

TMI: Transitive dependencies

Today's pull request: making rustpkg find dependencies recursively. It wouldn't be much of a build system if it didn't do that, and it actually almost did without deliberate effort -- the missing piece was making the code that infers dependencies and searches for them actually search different workspaces other than the workspace of the package that has the dependencies. Other than that, it was just a matter of adding a test case, and then #8524 can close.

comment count unavailable comments

September 12, 2013 06:13 AM

September 10, 2013

Tim Chevalier

TMI: Finishing command-line arguments

I just submitted a pull request that implements most of the commonly-used rustc flags as flags for rustpkg. This took longer than it would have taken if I had taken the approach of just passing any "unknown" flag to rustc, but I think it's useful for rustpkg to do a bit of checking so as to tell you if you're passing nonsensical combinations of flag and command. For example, most rustc flags don't make sense unless you're either building or installing a package, so rustpkg errors out in those cases (for example, if you do rustpkg list --opt-level=3).

I didn't add a test case for every flag, and there's still some refactoring of boilerplate that could be done, but I prioritized getting this done on schedule over completeness.

comment count unavailable comments

September 10, 2013 11:52 PM

TMI: Command-line arguments

Now that workcache is in the queue, I've moved on to teaching rustpkg how to pass command-line flags to rustc. At first, I thought I was going to implement a syntax like:

rustpkg +RUSTC --linker myspeciallinker --emit-llvm -RUSTC build foo

(ripped off GHC's +RTS/-RTS options), where +RUSTC and -RUSTC delimit arguments that rustpkg should just pass straight to rustc. But then I realized that Rust's getopts library doesn't support this kind of flag. I didn't want to change it, and also realized that many flags are only valid when you're actually building or installing a package. So, I decided to enumerate the flags you can use in rustpkg (a subset of rustc's flags), as well as the subset of flags that work only with build, or with build and install.

So far, this approach is working fine; I'm writing a test for each flag, as well as a test that the same flag gets rejected when in the wrong context. I don't expect any surprises with this one.

comment count unavailable comments

September 10, 2013 12:22 AM

September 07, 2013

Tim Chevalier

TMI: workcache: finished!

Short post, since I stayed up late getting the workcache pull request ready. But: the workcache pull request is ready! Once this lands, I won't feel like I have to use finger air quotes anymore when saying that I'm working on a build system.

I thought I might be able to come up with something profound to say about it, but apparently not.

comment count unavailable comments

September 07, 2013 06:56 AM

September 06, 2013

Tim Chevalier

TMI: Concurrent conditions

Today, I finished the initial work to integrate rustpkg with the workcache library, so that it remembers dependencies and only recompiles sources when they or their dependencies have changed. I'll write about that in a later post, though; for now I want to talk about a bug, or infelicity, that I discovered today.

Conditions are Rust's answer to exceptions; they're a lot simpler and more predictable than exceptions, and the conditions tutorial is quite clear, so you should probably just go read that first.

I've tried to exercise conditions a fair bit in rustpkg, since they are a fairly new feature and it's good to be able to see how they work out. For testing purposes, I wrote one regression test that makes sure the right conditions get raised when you try to install a package that doesn't exist.

When I finished with workcache, that test failed. Which was weird at first, since avoiding recompilation shouldn't affect the behavior of a test that never compiles anything. Eventually, I realized the problem: conditions don't propagate to different tasks. Before adding workcache, rustpkg was totally single-task; but, when workcache does a unit-of-work-that-can-be-cached, it spawns a new task for it. That's great, because it means we get parallel compilation for free (at the crate level, anyway). But, it means that if the process of compiling one crate raises a condition, it will manifest as that entire task failing.

I'm not sure what to do about this in the long term, but for now I just changed that test case to use task::try, which catches task failure, and to test something weaker than it did before (that the task fails, rather than that it raises three particular conditions in sequence).

That solves my problem, but I'm not totally sure that this interaction between conditions and tasks was intentional, so I filed an issue. It seems to me like it's confusing to have to know whether a particular piece of code spawns multiple tasks in order to know whether the condition handlers you wrap around it will actually handle all the conditions that that code raises. That seems like an abstraction violation. On the other hand, maybe I'm misguided for thinking that conditions should be used for anything but signaling anomalous cases within a single task.

comment count unavailable comments

September 06, 2013 06:09 AM

September 04, 2013

Tim Chevalier

TMI: workcache progress

I'm continuing to make progress on workcache/rustpkg integration. Today I realized that I'd been doing things a bit wrong, and that it would be important for rustpkg to declare all the inputs (for a particular package) before starting to do actual work (building).

Let me back up a bit, though. workcache is a library for tracking cached function results. A function, in this case, is the process of generating build artifacts (libraries and/or executables) from inputs (source files). workcache knows about declared inputs, discovered inputs, and discovered outputs. In the case of rustpkg:


The code I'd written so far was interleaving declaring inputs with doing actual work (building, which produces discovered output). That won't do, because when you look up a cached function result, you're looking it up by the name of the function paired with the list of declared inputs. If you don't know all the inputs yet, you won't get the right function result, and you also won't save it in a way that lets it be found later.

So, I fixed that, and am putting it all back together now. It's slow going since workcache is not entirely documented (although it has more documentation than some Rust libraries do!) and I've had to add functionality to it.

comment count unavailable comments

September 04, 2013 01:12 AM

August 29, 2013

Tim Chevalier

TMI: Making rustpkg work

I know my daily posts are usually pretty low-level, so today I want to step back just a bit and explain a little bit of how rustpkg works. rustpkg is heavily modeled on the Go package manager.

rustpkg operates on workspaces. A workspace is a directory. It has several subdirectories: src, lib, bin, build. rustpkg creates the last three automatically if they don't exist.

If you are using rustpkg, you probably want to build and install Rust code that you wrote yourself. You probably also want to build and install code that other people wrote, living in version control repositories that are available online (for example, on Github).

When you check out a remote repository to build, you check it out into a workspace you already created. So if you want to use my library called "quux" (which doesn't really exist), you would do:

mkdir -p Workspace/src
cd Workspace/src
git clone http://github.com/catamorphism/quux
cd .. rustpkg install quux

This would make a directory called Workspace/src/quux. Here, quux is a package ID. A workspace can contain several different packages with different IDs.

rustpkg actually makes this a little bit easier for you. You can also do:

mkdir -p Workspace/src
cd Workspace
rustpkg install github.com/catamorphism/quux
n.b. Next paragraph edited for clarity on 8/29/2013

This would make a directory called Workspace/src/github.com/catamorphism/quux. Since the source of the package is a remote repository, here the package ID is a URL-fragment: github.com/catamorphism/quux. Supposing quux is a library, it would be installed in Workspace/.rust/lib/libquux-HASH-VERSION.dylib, where HASH is computed by rustc and VERSION is quux's version (as per the most recent git tag). You might assume that it would be installed in Workspace/lib/..., but since the source code was fetched remotely, rustpkg uses the first workspace in the Rust path as the destination workspace. By default, the first workspace in the Rust path is CWD/.rust,, where CWD is the current working directory. You can change the Rust path by changing the RUST_PATH environment variable, but I'll get to that.

You can use one workspace for many different projects. The idea is that the sources for each project live in its own subdirectory under Workspace/src (or whatever you want to call your workspace). Or, you can make one workspace per project. Then the src/ directory would contain one subdirectory for the sources for your project, and one subdirectory for the sources for each of its dependencies.

The motivation here is to avoid so-called "dependency hell" by compartmentalizing the dependencies for different projects, rather than storing all libraries in one system directory. So if project A needs version 1.2 of library Z, and project B needs version 3.6 of library Z, no problem! The downside is potential multiple copies of libraries; we'll see how that works out in practice.

You do have a choice if you want to avoid multiple copies: you can use the RUST_PATH. RUST_PATH is an environment variable that rustpkg knows about; it's a colon-separated list of workspaces. Like:

declare -x RUST_PATH=/home/tjc/rust:/home/tjc/servo:/home/tjc/miscellaneous-useful-libraries

(if you use bash, anyway.)

It's assumed that if you add a directory to the RUST_PATH, that directory is a workspace: that is, it has a src/ directory with one or more project-specific subdirectories. If it doesn't have a src/ directory, rustpkg won't find any source files -- or installed libraries -- there.

After I implemented all of that, Jack began trying to port some of the Servo submodules to rustpkg. He ran into the issue that rustpkg always installed the build artifacts into the same workspace as the sources, assuming the sources are in a workspace. He wanted to be able to have the source in one workspace, and tell rustpkg to put the build artifacts into a different workspace. As he pointed out, RUST_PATH represents both "where to find binaries" and "where to find sources". This was by design, but that doesn't mean it's necessarily the best design.

Jack wanted to be able to support the Servo submodule scenario without too much restructing. Once ported to rustpkg, the Servo repository will look something like:

servo/src/bin.rs
servo/deps/src/foo/lib.rs
...

bin.rs is the main crate module for the Servo executable. foo is an example submodule of Servo; lib.rs is its crate module. There are many other submodules under servo/deps/src. We want to be able to build foo with rustpkg and put its build output in servo/build and servo/lib, so that they can be found when building Servo with rustpkg.

At this point, I said: why not just put the deps directory under servo/src? Jack's answer is that the Servo repository has many subprojects, so the developers want to organize them somewhat. And if the deps directory has subdirectories, the subdirectory names become part of the package ID. Since these subprojects all have their own git repositories, it starts looking a little strange for the local directory structure to affect how you refer to the package.

Then we considered whether or not to just use the extern mod foo = ... mechanism for everything. That is, don't manually store the subprojects locally; rather, write a directive in the Servo crate like extern mod foo = "github.com/mozilla/foo"; that tells rustpkg "fetch the sources in the repository named mozilla/foo on github.com, cache a local copy, and build them". That way it's up to rustpkg how to organize the local files, and the developers won't have to think about it. However, this solution does not work well if the Servo developers often make local uncommitted changes to the libraries they depend on, concurrently with developing Servo. In this scenario, rustpkg will use the revision it found in the remote repository, and not the locally modified version.

The solution we came up with is to change what's allowed to appear in the RUST_PATH. So when building Servo, if somebody wanted to build a particular submodule -- rust-geom -- first, their RUST_PATH would look like:

RUST_PATH=$HOME/servo:$HOME/servo/src/support/geom/rust-geom

and they would type rustpkg build rust-geom to install rust-geom into the $HOME/servo workspace.

I have a pull request in the queue to implement this change. I wasn't sure whether we wanted it to be transitional -- until, perhaps, the Servo repository gets restructured to make it conform to rustpkg's view of the world -- or permanent, so I put it behind a command-line flag, --rust-path-hack. Perhaps instead, rustpkg should conform to Servo's view of the world, since we are motivated by wanting to build Servo and rustc with rustpkg.

The main advantage of allowing the RUST_PATH to only contain workspaces seems to be simplicity. But for a project with many submodules that may be organized in an arbitrarily complicated way, perhaps making the RUST_PATH more flexible makes it simpler to build packages.

comment count unavailable comments

August 29, 2013 05:46 AM

August 28, 2013

Tim Chevalier

TMI: What I learned about git today

Today, I learned something unpleasant about git. Specifically, that if you:

git clone somegarbage a/b/c
where somegarbage isn't a valid repo, it will happily create a/b anyway, and leave them sitting there, empty.

In the context of rustpkg, this was bad because it resulted in a src directory getting created, which resulted in a non-workspace directory getting assumed to be a workspace. Very confusing. There went several hours.

The right thing to do when we try to git clone a package, I think (Eridius on IRC suggested this first, and I initially rejected it) is to make a temporary directory, clone into that, and if the clone is successful, copy it into the actual destination.

Easy enough, right? No, because we don't have a "rename directory" function in the standard library yet; there's one in std::libc, but it should have a safe wrapper. Rather than taking that on today, though, I'm going home.

This was all in the context of implementing the RUST_PATH hack that I talked about yesterday. Jack says that when this patch lands, he'll be able to port 11 Servo packages to build with rustpkg!
Also, in the team meeting today, we discussed my new proposed schedule for rustpkg, and made some adjustments. I'm glad I mostly had the right idea about which bugs were high-priority.

comment count unavailable comments

August 28, 2013 03:30 AM

August 27, 2013

Tim Chevalier

TMI: I did all these things today!

I started off the day by blowing away my build directory in one of my workspaces because last Friday, I was getting a mysterious compiletest crash, sometimes complaining about "too many open files". Building from scratch did not make the crash go away, but asking Brian did; he said there would be a pull request fixing the problem soon, but in the meantime, I could run make check with RUST_THREADS=2, and that worked.

I landed #8712, which fixes #7241, which is making the scenario where copies of the same package exist in multiple workspaces work right. Normally, rustc errors out when it finds multiple crates that match a particular identifier. I changed filesearch so it treats searching directories in the RUST_PATH specially, and breaks out of the loop if it finds a matching crate in one of the RUST_PATH directories.

In the queue is #8773, which implements the --version flag for rustpkg. It's a little thing, but it's nice to have that work right, and it turned out to be just a matter of calling the right function in rustc-as-a-library.

I'm also progressing on #7075, using workcache. Last week I had the problem that I wanted to use borrowed pointers inside an @-function, which is verboten in Rust. I realized that the only reason I needed an @-function was that I was calling a function in libsyntax called each_view_item that I wrote, which uses the SimpleVisitor trait. Long story short, SimpleVisitor uses @-functions (probably because no one got around to rewriting that code), but Visitor is more general and takes &-functions (more or less any sort of closure). So I rewrote the code to use a Visitor, which also made it shorter, since Visitor is fully default-method-ized. Happiness! Now at least my code compiles, though the behavior isn't right yet.

I also finished a temporary change that Jack requested to make it easier to port Servo submodules to rustpkg; I changed the way that RUST_PATH works so you can list a package directory (that is, a subdirectory of a workspace) and not just a workspace in the RUST_PATH, and rustpkg will find source files there but install to a workspace. In addition, it doesn't behave this way unless you pass in the --rust-path-hack flag. I had Jack try it out, and he pointed out that one scenario worked, but one other scenario -- where the current working directory is not a workspace -- didn't. I guess it would be more consistent to allow the CWD to be a non-workspace when the hack is enabled, so I'll change that.

Finally, I did some work on #6408 (allowing package IDs to point into the middle of a repository), finishing the test case and figuring out where in the code I needed to start extending things. First of all, I can see that I need to fix find_and_install_dependencies, the function that scans through extern mods and infers dependencies, so that if it doesn't find an already-compiled library, it doesn't assume the sources for the library being imported live in the same workspace as the importer. I can't believe I didn't already have a test that would have found that, but that's how it goes.

Also, if you would like to read a much more pedagogical post about Rust that doesn't assume all kinds of obscure knowledge, you should read Lindsey Kuper's post on default methods!

comment count unavailable comments

August 27, 2013 12:35 AM

August 22, 2013

Tim Chevalier

TMI: Miscellaneous

Today: meetings, trying my best to get a miscellaneous pull request (collecting together small changes, including enabling some more rustpkg tests) to go through, working on workcache some. I changed the freshness function for files so that just changing the datestamp, even without changing the contents, counts as changing the file. But, I realized that I'm not declaring any discovered dependencies as inputs to workcache; only the files in the main package we're building. That won't do. Fixing it tomorrow.

comment count unavailable comments

August 22, 2013 02:10 AM

August 17, 2013

Tim Chevalier

TMI: Workcache!

I got rustpkg to use workcache to enough of a degree that one of the existing tests fails because it isn't rebuilding a file when it's not necessary to :-) The old test used "touch" to make a dependency look as if it had changed, then ran rustpkg and checked datestamps. Since workcache works based on a hash of the file contents, the test fails even though it works correctly.

So that's exciting! Despite Graydon explaining it to me, I hadn't quite grokked how workcache works till now. It's a quite neat idea: most build systems (including) make are declarative. You write a whole other program to explain how to build your program (albeit in a very weird programming language, usually). Workcache, and a system called fbuild that it's based on, are imperative. You express your dependencies in your usual programming language, specifying how to determine if a particular target needs to be rebuilt. That made more sense in my head than it did when I tried to write it down just now.

In any case, the missing piece that I filled in today was the freshness map, which is part of the state that you have to pass around for workcache to work. The freshness map is a hash table that maps "kinds" (just strings) onto functions. The "kinds" in this case are "source file", "executable or library file", "URL", and maybe a few other things. The function takes a name (for example, a file path) and a value (for example, a hash of the file's contents) and returns a boolean indicating whether the given name and hash are up-to-date. The "given" name and hash come from the workcache database (which is persistent) and the freshness function is the code that (in this case) reads in the contents of the file, hashes it, and compares the two hashes for equality.

Most of the time that took was understanding the code well enough to write the previous paragraph; once I understood that, the code was quite simple.

Next step is to change the test cases to reflect that rustpkg won't just rebuild if the timestamp changes[*], and make sure they change the contents of the file if they want to test behavior-if-a-file-changes. But I feel like things are moving now.

[*] Or maybe it should, I'm not sure, but in that case the freshness function would have to be changed so that the hash is a hash of both the file's contents and its metadata.

comment count unavailable comments

August 17, 2013 03:50 AM

August 16, 2013

Tim Chevalier

TMI: While my try build runs...

What I meant to do today: work on workcache.

What I did today: un-ignore most of the rustpkg tests -- except for the ones that depend on features that aren't implemented yet (workcache, and the do, test, and info commands). A bunch of the tests were xfailed because the tests weren't playing nicely with cross-compilation. That's fixed now, so they should work. But I spent most of today getting one test to work, which was one of the tests for custom build logic. The hard part had nothing to do with custom build logic, but rather to do with library names that contain dashes. Previously if you called your library foo-bar, rustpkg would generate a library called libfoo_bar-xxxxx-0.1.dylib (on Mac, anyway), where xxxxx is a hash. Now that we separate the name used as an identifier in Rust code from the actual package ID, we can call the library libfoo-bar-xxxxx-0.1.dylib, but since '-' also separates the name, hash, and version, the old parsing code didn't work anymore. Now it does! (I'm not completely sure that we should be naming libraries this way anyway, but hey, everything works now, or should.)

Running a try build now to make sure everything works on the bots where the host and target are potentially different, and then I'll submit a pull request. And, uh, tomorrow I'll work on workcache.

comment count unavailable comments

August 16, 2013 01:42 AM

August 15, 2013

Tim Chevalier

TMI: rustpkg status

Today, I looked at the metabug for rustpkg and noticed that all of the items marked as "necessary to call rustpkg 'ready for use'" are finished. I added one more that was overlooked, which is workcache support, and after discussion with Graydon, added several more issues. Still, most of those should be pretty easy to fix, so we're very close to the point where we'll want to say to the community "hey, start using this and test!" Some people, of course, already have, such as cmr with rustdoc_ng. Which I'm currently trying to build on my system with rustpkg but am making rustc segfault. Must be Wednesday.

comment count unavailable comments

August 15, 2013 01:28 AM

August 14, 2013

Tim Chevalier

TMI: A day of debugging-by-println

I really wanted to isolate that OS X spawning bug from yesterday, and today I filed #8498. It turns out the problem isn't about spawning at all, but rather, with Options that contain vector slices. The ProcessOptions struct in std::run, which represents data needed to spawn a process, has a field with type Option<&[(~str, ~str)]>: that is, it takes an option of a vector of pairs of strings representing name/value pairs for the environment. That's all fine, but the & makes it a borrowed vector, and there is apparently some bug where memory gets corrupted if you construct one and then match on it. I replaced & with ~ (which allocates more memory, since it means the process will own its environment vector, meaning that callers may have to make a copy) and that fixed the problem. Once tests finish, I'll check that in as a temporary workaround.

While browsing not-recently-updated bugs, I found #6702, an unclear-error-message bug in resolve, which was an easy fix.

The option/slice bug, which came up while working on rustpkg, took up some time, but tomorrow I really need to get back to rustpkg itself, hopefully finishing the workcache integration over the next few days.

comment count unavailable comments

August 14, 2013 01:21 AM

August 13, 2013

Tim Chevalier

TMI: Multithreaded Monday

Today: Bug triage! Leading to fixing this bug related to derived type errors, one of the remaining corner cases of something I worked on a while ago. Well, actually that was as a result of a question someone asked on IRC. Close enough! I also improved a type error message to do with static methods, allowing us to close one of the older bugs.

While working on rustpkg, I discovered something was weird with OS X process spawning. For rustpkg I was able to work around it, but I thought I should isolate the test case anyway. The conclusion, so far, is that something is really weird with OS X process spawning.

And at some point tomorrow, I have to finish some paper reviews, documentation for my recent rustpkg changes, preparation for interviewing a candidate on Wednesday, my weekly status report, and checking if any pull requests I could review are in the queue. All of which I meant to do today. Good times.

comment count unavailable comments

August 13, 2013 03:54 AM

August 10, 2013

Tim Chevalier

TMI: 14 green boxes means it's time to go home!

The extern-mod patch finally landed! The last bug: it turns out that something on Mac OS X wasn't right in our run code (that shells out to create a new process), and if you supplied environment variables, it was extending the environment and not overwriting it. On Linux, the code was correct, which meant shell commands failed because among other things, the PATH was getting set to "". I changed my test code that was calling the run code to explicitly fetch and extend the existing environment, and that was the last fix.

The next big thing to finish will be using workcache. I'm realizing that the workcache library itself needs some work before rustpkg can use it. The first thing to do was to change it so that it actually saves and reloads its database to disk, so that you can see changes in between runs of rustpkg, and not just within the same run. I did that, and now it's complaining about something missing from the file when it reads it back in, which is at least progress.

When running up-to-date master on my local machine, I was noticing a segfault while building the docs that obviously didn't happen on the bots, but it seems to be due to the new scheduler and the handling of stack overflows (I git-bisected and saw that the first bad commit was the "Turn on the new runtime" commit). There's a patch in the queue to increase the stack size, so I'm going to go home and hope it'll magically be fixed in the morning.

comment count unavailable comments

August 10, 2013 01:40 AM

August 07, 2013

Tim Chevalier

TMI: Quick update

The extern-mod patch of doom is blocked on... std::path tests? I moved a function from rustc into the standard library (get_relative_to), and moved its test cases as well. This caused the tests to fail. I don't know why.

The workcache branch is also blocked, because I don't understand the rules about what you can and can't capture in a ~ closure. I will probably need to consult with someone tomorrow.

In the background, doing some AST refactoring that is proving harder than I thought (as it always does).

comment count unavailable comments

August 07, 2013 02:06 AM

August 01, 2013

Tim Chevalier

TMI: Done! (until code review, anyway...)

Finished the big pull request for extending what extern mod can name, and other stuff! That was... way more work than I expected, even when taking into account Hofstadter's Law.

I'd say more, but I want to get out of the office. Next priority is workcache (you know, so rustpkg can actually be a non-ridiculous replacement for make...); I should also be able to get #6409 out of the way in the foreseeable future.

No wait, one more thing! The very next thing I do will to be to get rid of the local_path / remote_path hack. See, right now, in rustpkg, package IDs look something like:

struct PackageId {
    short_name: ~str,
    local_path: LocalPath,
    remote_path: RemotePath,
    version: Version
}

LocalPath and RemotePath are both just newtype'd Paths (where Path is a path on the filesystem -- basically a newtype'd string with some extra operations):

struct LocalPath (Path);
struct RemotePath (Path);

which in turn is special syntax Rust has for something like:

struct LocalPath { localpath: Path }

In any case, I implemented this because a lot of package names have dashes (hyphens) in them, but a Rust identifier can't contain a hyphen. Before #8176, because of extern mod directives, a package name had to be a legal Rust identifier. I implemented a hack in rustpkg that turns '-'s into '_'s, but, so it can still find the original directory on disk, keeps the original name with '-'s around as the RemotePath. Now that extern mods can have a path attached, though, I can just get rid of this special case. Hurrah!

comment count unavailable comments

August 01, 2013 02:37 AM

July 31, 2013

Tim Chevalier

TMI: Crawling towards doneness

Still "almost" ready to submit the pull request of doom that implements extern mod foo = something/interesting;. I got all tests passing, then (again, against my better judgment) started cleaning up some stuff, and totally broke filesearch in the process (changing it back to walk directories non-recursively instead of recursively, because I realized installing a library from a package with ID a/b/c as somedir/lib/a/b/libc-xxxxxx-x.x.dylib would be a mistake).

Not very interesting today, hopefully tomorrow will be more interesting.

comment count unavailable comments

July 31, 2013 01:55 AM

July 30, 2013

Tim Chevalier

TMI: So close, and yet...

The pull request I landed on Friday, to make more kinds of refspecs work, landed.

I got the whole test suite to pass with my branch-that-took-forever that extends the extern mod syntax! But as I often do, I decided I should address the many NOTEs[*] I'd added to the code before submitting a pull request. There's always the tension between getting it right and getting it in the codebase, but this time I felt like there were too many embarrassing things, tests passing or not.

One of the things is that I changed the PkgId constructor (shockingly, that's for the type that represents package IDs) to take the current directory as an argument, in case you give it something like foo/bar/baz and that turns out to be a git repository and you want to get the version. Well, that was super silly of me, because as with everything in rustpkg, we want to search the RUST_PATH rather than searching in the current directory in some ad hoc way. So I fixed that (I always enjoy getting paid to undo my own work) and now all but one of the rustpkg tests work, but I'll leave that for tomorrow so I can have a victory early in the day.

comment count unavailable comments

July 30, 2013 01:54 AM

July 27, 2013

Tim Chevalier

TMI: Feminist Friday

Today I headed to the San Francisco office for a day of co-working in the community space there, with Lukas, Liz, and some folks from the community. It was fun and I want to go again!

On the train on the way there, I finally got my extern-mod test case to compile successfully. Among other things, it turns out the Mac OS X dynamic linker isn't very happy when you rename a .dylib file. So now, rustpkg is doing what it should have been doing and preserving the long name (libwhatever-hash-version.dylib) in the installed library, instead of building build/packagename/libwhatever-hash-version.dylib to lib/libwhatever-version.dylib). I'm not sure why I was ever doing it that way, except my own ignorance!

One thing that I still need to do is make rustc print out an error message that suggests using rustpkg to download the package, if the user writes something like extern mod github.com/catamorphism/whatever and there's no locally cached copy. rustc isn't responsible for downloading anything off the network, git is, but currently it'll just treat it like any other path that doesn't exist.

I also rebased my branch to use workcache for avoiding recompilation, now that Graydon's pull request to make the workcache context sendable landed. I'm sure it's not going to work yet, since I just did the minimum necessary to invoke the workcache and make it typecheck. Still waiting for the build there.

Finally, I got my pull request for handling non-numeric tags into the queue. Graydon pointed out I shouldn't assume a given refspec is meant to be a tag, and that git can do a pretty good job of working out whether it's a tag, revision, or so on. So I fixed that, and now it's just waiting for bors to get to it.

comment count unavailable comments

July 27, 2013 02:57 AM

July 25, 2013

Tim Chevalier

TMI: Tag!

Submitted a pull request in the waning hours of the day, which wraps up support for tags. rustpkg now knows about tags that aren't versions. I refactored some of the git-fetching code while I was at it, and changed it so that the build output for foo goes in build/foo and not directly in build (this is more consistent, since rustpkg was already doing that for packages with compound names).

I worked more on extended extern mod names, but didn't finish that. Hacking on metadata::{filesearch, loader}: always a good time.

I also found this delightful bug to do with calling run process functions and specifying an environment, though I'm still not convinced it's not due to something weird happening on my machine. So if someone wanted to try reproducing it, that would be awesome.

comment count unavailable comments

July 25, 2013 01:52 AM

July 24, 2013

Tim Chevalier

TMI: tests passing is a good feeling, though...

I finally fixed all the fallout from changing rustpkg to use the target rustpkg instead of the host rustpkg in the tests (if you understood that sentence, congratulations). So tomorrow, I can actually get back to making rustc itself understand extern mod a = b/c/d/;.

In the background, worked a little more on making rustpkg understand tags that aren't version numbers. After talking to Graydon, we won't support branches after all; just tags. If a tag doesn't look like a version number, we'll match on it exactly rather than treating it as a number that can be compared with other versions.

comment count unavailable comments

July 24, 2013 01:45 AM

July 23, 2013

Tim Chevalier

TMI: More of the same!

I fixed the dependency problem, which was trivial to fix:

diff --git a/mk/tests.mk b/mk/tests.mk
index 770e728..c3659e7 100644
--- a/mk/tests.mk
+++ b/mk/tests.mk
@@ -345,7 +345,8 @@ $(3)/stage$(1)/test/rustctest-$(2)$$(X_$(2)):                                       \
 
 $(3)/stage$(1)/test/rustpkgtest-$(2)$$(X_$(2)):                                        \
                $$(RUSTPKG_LIB) $$(RUSTPKG_INPUTS)              \
-               $$(TLIB$(1)_T_$(2)_H_$(3))/$$(CFG_LIBRUSTC_$(2))
+               $$(TLIB$(1)_T_$(2)_H_$(3))/$$(CFG_LIBRUSTC_$(2)) \
+               $$(TBIN$(1)_T_$(2)_H_$(3))/rustpkg$$(X_$(2))
        @$$(call E, compile_and_link: $$@)

See, totally trivial! Um. Workcache changes haven't landed yet, so I didn't work on that. Rather, I spent the rest of the day futzing with directories, again, because I haven't done enough of that, as part of making-general-extern-mods work. Admittedly some of that had to do with the above change -- making the test suite able to run rustpkg from the target directory instead of the host directly. (Cross-compilation thing. Makes perfect sense. Yeah.)

Another recompile is running, but it's time to go home.

comment count unavailable comments

July 23, 2013 01:30 AM

July 20, 2013

Tim Chevalier

TMI: link names and hidden dependencies

Distressingly, Graydon pointed out that rustpkg tests were sporadically failing on the bots, and I realized that it was because some of the tests shell out to call the rustpkg executable -- but the Makefiles don't know about that dependency. So, the rustpkg executable doesn't always exist by the time the tests run. I better fix that.

Mostly I worked on #6407 some more. This is harder than it looks; among other things, I have to change what's expected in the link metadata so that a link name can be a qualified path. That's because if you're referring to things like extern mod foo = github.com/whatever/foo, the metadata loader expects the link name in the metadata in the compiled library to be github.com/whatever/foo. So I have to change the driver. And stuff. This necessitated some refactoring in rustpkg with respect to what paths get passed around and how it figures out where to put output files, which is totally unremarkable except in its annoyingness.

comment count unavailable comments

July 20, 2013 02:32 AM

July 16, 2013

Tim Chevalier

TMI: Tiny steps

I started trying to make rustpkg use workcache, but ran into a problem. So, workcache is a library for caching the results of functions. This is obviously a useful thing if you're trying to avoid recompiling outputs whose inputs haven't changed. To make this work, you pass the closure whose result you want to cache into a special wrapper function, called exec. Since if you really want to avoid re-doing work, you need to save the dependencies you've discovered persistently, exec also takes a "context" argument that includes a pointer to a database, among other things (which is just a file on disk). Also, exec takes a ~ closure, meaning that the closure's environment only has sendable things. You can think of sendable as meaning "no @-boxes", because @-boxed things are only task-local. But if the closure wants to invoke itself recursively, it needs to capture the workcache context inside it. And the workcache context isn't sendable; it contains @-boxes.

rustpkg discovers dependencies as it goes along, doing parsing, so you could start installing one package A, find another package B that needs to be installed, go ahead and install B, then go back to building A. So the natural way to rewrite this code using workcache involves passing around the workcache database to recursive calls. But that's not allowed, since it's not sendable and since the workcache wrapper requires a sendable closure.

Graydon agreed it would be better to rewrite the workcache code so that the context is sendable, so now I'm blocked on him doing that.

In the meantime, though, I also figured out the duplicate package problem from yesterday. It was really silly; I was adding in both the .rust subdirectory of the home directory and of the current working directory -- and all ancestors of the CWD. But what if the home directory is an ancestor of the current working directory (as one would normally expect it to be)? sadtrombone.wav

And in the meantime, finishing extern mod is blocked on fixing the duplicate-package bug, which is easier now that I've found it.

comment count unavailable comments

July 16, 2013 11:55 PM

TMI: Caching, duplicate packages, and refactoring

Since my last update, I've landed #7681, which adds support to rustpkg for local git repositories and changes the test suite to use them (getting rid of intermittent failures due to network breakage), and #7749, to implement the list and uninstall commands.

Now I'm working on a few different things at once:



comment count unavailable comments

July 16, 2013 01:03 AM

July 10, 2013

Tim Chevalier

TMI: Think globally, build locally

Submitted a pull request today! It's been a while. This one eliminates the dependency some of the rustpkg tests had on remote URLs, specifically github. Since Github tends to go down a lot, that was causing spurious test failures.

To fix it, I made rustpkg handle fetching from a local git repository, and changed the tests to test that instead. If you supply rustpkg with a package ID referring to a github repo that's outside the RUST_PATH, it will clone that into a workspace in the RUST_PATH and build from there.

I can now go home feeling a bit more accomplished.

comment count unavailable comments

July 10, 2013 01:36 AM

July 09, 2013

Tim Chevalier

TMI: local git repositories

Right now, I'm solely working on getting rid of the sporadic github connectivity errors in the rustpkg test suite. That means changing the tests to create a local github repo, and then changing the behavior of rustpkg build to (when given a local repo) clone it into the first workspace in RUST_PATH, then build that copy.

It's finicky, just getting all this automation of command-line stuff to work and making sure I'm not getting confused about which directories are which. Also, still emerging from the fog resulting from the four-day weekend.

comment count unavailable comments

July 09, 2013 03:01 AM

July 04, 2013

Tim Chevalier

TMI: Oops

After talking to Graydon, I determined that the work I've been doing on list and uninstall is misguided; there shouldn't be a database file, since it can get out of sync with the filesystem. Rather, list should enumerate the installed packages in the workspaces in the RUST_PATH, and uninstall should just delete the installed files.

In the meantime, the rustpkg build tests were failing because of sporadic github connectivity errors that were causing unrelated tests to fail, so I have to remove the tests that contact github. This is kind of annoying, since I'd like to have tests for URL-like package IDs.

Four-day weekend starting tomorrow! ...though I'll probably spend a lot of it working, since I'm behind.

comment count unavailable comments

July 04, 2013 12:13 AM

July 02, 2013

Tim Chevalier

TMI: Still with the package database

Hacked on the package database more. rustpkg list works! Still with the totally simple text-file scheme I mentioned yesterday, of course.

Now I'm implementing the uninstall command while making sure that uninstalling deletes the package from the package database. uninstall is also mildly interesting because the database stores the exact version (filling in 0.1 as the default), but I don't want you to have to specify an exact version when uninstalling; so rustpkg uninstall foo should delete foo-0.5 if there's exactly one version of foo, 0.5, installed. But then, what if there were multiple versions of foo? Should this delete all versions, the most recently-installed one, or error out asking you to be more specific? I'm not sure yet.

I'm still anxiously waiting for bors to merge the pull request I submitted yesterday, for rustpkg build with no arguments; the build infrastructure has been a bit finicky lately.

comment count unavailable comments

July 02, 2013 11:42 PM

TMI: Another month, another office

Today was my first day back in Mountain View, from which I've been gone since mid-March. Everything is pretty much the same. Except the new people. And there are gummy bears in the kitchens now.

I re-submitted #7419, my pull request to make rustpkg build with no argument work, after some frustrating fumbling around trying to figure out a problem with a test case hanging that turned out to be due to a bad merge that I did. I accidentally un-ignored a test I didn't mean to enable; oops. But with any luck, this change will make it into the impending 0.7 release!

I also worked on getting the package database to work. I decided to start small, by creating the package database either in ~/.rust or (if there's a RUST_PATH) the first directory in the path. The package database, for now, is just a text file that contains a list of package IDs separated by newlines. Simple as can be. Of course, we'll want to store more information about each package, and probably use JSON or something, but I just want to get something working first. I took the "just write the test!" advice to heart -- somehow writing a test is almost always more manageable than writing the non-test code -- and wrote a test that installs a package, then calls rustpkg list and checks that the package ID is in the output.

comment count unavailable comments

July 02, 2013 02:19 AM

June 28, 2013

Tim Chevalier

TMI: Miscellaneous cleanup

Spent most of today trying to figure out why tests failed on my RUST_PATH pull request. I eventually narrowed it down: I was using "" as the default when the RUST_PATH environment variable wasn't set, and using an iterator to parse the result (which is a colon-separated list of strings). When given the empty string, the iterator returns [""] rather than (as I'd expect) []. I wonder if that's actually a bug in the library. In any case, it was easy to work around once I saw what the problem was.

Also figured out why the non-rustpkg test cases I was trying to add yesterday were failing, or at least: the bug from yesterday went away when I rebased, but a new bug happened. That one turned out to be due to recent changes to terminfo that affect how warnings get rendered. The totally-obvious fix was to set my TERM to xterm (oh, emacs shell, why must you always be problematic).

It doesn't feel like a very productive day, but hey, I have four pull requests in the queue. Tomorrow, working on the package database, I guess?!

comment count unavailable comments

June 28, 2013 01:27 AM

June 27, 2013

Tim Chevalier

TMI: rustpkg and searching the current directory

Today, implemented #6405; in general, rustpkg commands should do something without an explicit package ID given on the command line. If you just say rustpkg build (or any other command), rustpkg should default to treating the current directory as a subdirectory of the source directory in a workspace, and building the package in the current directory. I submitted a pull request to fix that.

I also went through the 0.7 bugs and added test cases for the ones that had been mysteriously-fixed, but my branch where I'm adding the test cases fails in tests that don't have anything to do with anything I changed (and anyway, I only changed other test cases) so... for now, I'm stumped on that. Is anyone else seeing the rusttest unit tests for the rust command-line tool ICE?)

In the meantime, bors is slow and even my pull requests from yesterday haven't built yet :-(

comment count unavailable comments

June 27, 2013 01:01 AM

June 26, 2013

Tim Chevalier

TMI: RUST_PATH and package scripts

Today I finished two pull requests! The first one implements searching for packages using directories given by the RUST_PATH environment variable. The second one makes package scripts more useful by allowing them to call the default build and install logic after doing whatever custom steps are necessary.

Both of these pull requests are incomplete -- there are parts of the original bugs they address that aren't implemented. I didn't implement using RUST_PATH to determine the target directory for installs, and I also didn't teach the rustc driver itself about RUST_PATH. I opened up separate bugs for those (#7402 and #7398).

Likewise, for the package-script patch, I assumed in rustpkg that package scripts only can implement a custom hook for the install command, not for any other commands. I opened up #7401 for that.

I'm not really sure, from a bug-tracker-philosophy point of view, whether it's better to submit several pull requests over time for the same bug, or close a given bug after I've done a significant amount of work on it and open up newer, more specific bugs. Something just feels better about having older bugs closed.

Alex Crichton did some work on making sure rustc uses LLVM in a thread-safe way, which will allow us to actually re-enable the rustpkg tests (!) So I'm looking forward to that being finished too -- there's a pull request on that pending, and then I think all that's left is manually wiping the LLVM build directories on all the bots so that they rebuild LLVM.

comment count unavailable comments

June 26, 2013 05:40 AM

June 25, 2013

Tim Chevalier

TMI: rustpkg (what else?)

After a fun week of conferencing, I spent today working on implementing RUST_PATH and invoking the default build logic from package scripts (the latter is kind of necessary to make package scripts useful). The second one works now, and it's just a matter of automating the test case. The first one, well, it's not done yet and also the test case deadlocks mysteriously, so I have to sort that out.

I'm hacking from the LA area this week, and my first day back in Mountain View is a week from today.

comment count unavailable comments

June 25, 2013 05:32 AM

June 21, 2013

Tim Chevalier

TMI: Day 3 of Open Source Bridge, and a little bit of rustpkg

Open Source Bridge was the first open source conference I've attended, and it left me with high expectations for any other open source conferences I might attend! While sadly I don't think all other conferences are going to have primarily vegan food and stickers about intersectional feminism, I've been glad to be able to connect with people like me. That is to say: people who do open source software and care about equality and liberation from oppression. Most of the time I feel like I either have to hide the aspect of me that gets engrossed in puzzles and problem-solving, or else hide my queer, trans, dominant-paradigm-questioning self. This conference was the few places I've ever been where being both doesn't just seem like something that's vaguely tolerated, but rather, the norm.

...and that's the case even though the only session I attended in its entirety today was Ashe Dryden's keynote on diversity and open source culture :-D Listening to Ashe speak, it was hard to imagine how anybody could deny that the homogeneity of open-source participants is a real problem or that solving it is worth the effort.

While sitting around in the hacker lounge with the other Geek Feminism folks, I started working on the RUST_PATH environment variable. One problem we ran into was that the rustpkg tests shell out to run rustpkg and rustc, and because all the Rust tests get run in parallel, and there's an issue right now with rustc not using llvm thread-safely, stuff goes wrong on some platforms. Since the tests I'll have to add for RUST_PATH will modify the environment, that issue will also block me from checking in this work.

But for now, I'm off to the diversity birds-of-a-feather session.

comment count unavailable comments

June 21, 2013 01:27 AM

June 19, 2013

Tim Chevalier

TMI: Open Source Bridge talk: given!

Here are the slides for my talk, "Rust: a Friendly Introduction". I'd love to clarify anything that needs it in comments on this post.

comment count unavailable comments

June 19, 2013 11:57 PM

TMI: Day 1 of Open Source Bridge

The first day of Open Source Bridge was great! I noted on Twitter that I attended five talks in a row by women, and could have gone to a sixth if I hadn't wanted to sit out the last session to work on my slides. Also, at least two and maybe three different talks that mentioned intersectionality! This is the only open-source conference I've been to, but I'm getting the feeling it's not a typical one. Also, it's awesome that all the food is vegetarian (and almost all is vegan); it's wonderful to be able to go to a conference and be able to enjoy lunch without worrying that I'll accidentally eat something with meat in it.

For me, the highlight of the day was [personal profile] synecdochic's talk "Kicking Impostor Syndrome in the Head", which was as awesome as you'd expect. The best part was at the end where people in the audience named things our impostor syndrome had said to us at some point and everybody who'd also experienced the same thing raised their hand. Also great was Kronda Adair's talk "Expanding Your Empathy", about the basics of challenging oppression in conversations that happen in the tech world.

Tomorrow I'm looking forward to [personal profile] skud's keynote speech as well as the panel on diversity in open source. And, uh, I'm giving a talk, starting in less than 12 hours. I'll put the slides online soon after and there will be a recording on the conference web site eventually.

comment count unavailable comments

June 19, 2013 05:38 AM

June 18, 2013

Tim Chevalier

TMI: Practice Talk 2: Electric Boogaloo

I arrived in Portland yesterday and spent today working on my talk slides, then giving a practice talk at the New Relic office in downtown Portland. I don't think I'd ever been in that tall a building in Portland before (28th floor), and the audience exceeded my expectations both in size and attentiveness and quality of questions! Thanks to Tom Lee for, somehow, bringing in many interested people. I'm just hoping when I do a final run through the slides tomorrow I'll remember some of the questions and take them into account into adjusting the slides.

I had to skip over quite a few slides at the end to stay within the time limit, and while I probably won't get quite as many questions during the conference as I did during the practice talk, I suspect part of my time tomorrow is going to be spent tightening up the earlier part of the talk; I don't want to run out of time and have to skip pointers altogether!

Tomorrow, some combination of attending Open Source Bridge talks and finishing my talk (as is the day-before-talk conference tradition in general, I think), probably while taking advantage of the hacker lounge there.

comment count unavailable comments

June 18, 2013 06:03 AM

June 15, 2013

Tim Chevalier

TMI: Finishing a pull request... I remember what that was like

Accomplishment of tonight: pull request with new rustpkg tests. I have no idea how this took so long, but apparently automating things and manipulating directories and files is hard? I guess. I am so behind schedule.

With that finished (sort of; despite how long that took, there are still tons of test things to do), that gives me all of Saturday and Monday to finish my talk slides. I'm not terrified or anything, really.

comment count unavailable comments

June 15, 2013 05:53 AM

June 12, 2013

Tim Chevalier

TMI: 12 tests down, 11 to go

I spent all day getting this rustpkg test to work:
fn do_rebuild_dep_dates_change() {
    let workspace = create_local_package_with_dep("foo", "bar", &NoVersion);
    command_line_test("rustpkg", [~"build", ~"foo"], &workspace);
    let bar_date = datestamp(&lib_output_file_name(&workspace, "build", "bar"));
    touch_source_file(&workspace, "bar");
    command_line_test("rustpkg", [~"build", ~"foo"], &workspace);
    let new_bar_date = datestamp(&lib_output_file_name(&workspace, "build", "bar"));
    assert!(new_bar_date > bar_date);
}

(create_local_package_with_dep, command_line_test, datestamp, are touch_source_file are all functions I had to write as part of the source file.)

It works now! Trivially, because rustpkg always rebuilds everything -- it doesn't do the thing that's the point of make yet, of rebuilding files only when their dependencies change. So if everything else works right, the timestamps will always change. Getting everything else to work right was what took all day. The details are too boring for even me to write about; lots of wrangling with file paths.

I also worked a tiny bit more on my OS Bridge slides, but I need to put some serious time into that before Practice Talk 2: Electric Boogaloo on Monday.

comment count unavailable comments

June 12, 2013 12:05 AM

June 08, 2013

Tim Chevalier

TMI: One practice talk down, one to go...

Today:

Worked on my talk some more. Got through all the slides except for some extra ones at the end that I suspected I wasn't going to use anyway.

Gave a practice talk, via videoconference, for my team. Felt about 100 times better about the talk after that. Among other things, the amount of time I'd estimated it would take was correct down to the minute. (So I probably don't need those extra slides.) Got lots of helpful feedback, esp. from Graydon and Brian.

Cleaned up my bit-rotted pull requests, all of which are now currently in the bors queue (crosses fingers).

Over the weekend I'll hopefully wrap up the Great Rustpkg Test Overhaul and implement RUST_PATH. And work on the talk more, since I have one more practice talk to go in a week-and-change.

comment count unavailable comments

June 08, 2013 02:45 AM

June 07, 2013

Tim Chevalier

TMI: More talk preparations

Almost done with a very rough draft of my slides for my Open Source Bridge talk! Visually, they're crap, and reminding me why I'm not a graphic designer even though I pretend sometimes to have opinions about visual presentation. But they're good enough for me to give my practice talk to my team tomorrow; hopefully I'll have enough time to clean them up that the form won't distract everyone from the content.

The main thing I'm frustrated about is my own inability to find Good Examples. I could, of course, write examples that illustrate what I want to teach; mainly, I want to show how Rust's trait system works and how borrowed, managed, and owned pointers work. I think traits and the memory model are two of the biggest advantages of Rust, and rather than trying to cover everything about the language in an hour and a half (impossible), I want to focus on two features that are likely to be at least somewhat different from languages the audience members will have used. That's all very well, but I've had a hard time finding particularly good examples in existing code -- I found some small examples, but the slides I worked on today were ones I'd planned to use to present two "extended examples" that would each take several slides to cover. Traits are used less in existing code than I would have thought (being a relatively new feature) and are often just used as containers for methods rather than abstractions that have meaning that could be explained independently of code. Borrowed pointers are used everywhere, but most uses of them are actually pretty simple.

So that I would have something there, I used code from the Container and Map traits and the HashMap module in the standard library, but that feels unsatisfying since hash maps are such a textbook example. I really want to show what makes Rust different! So, I posted a query on the mailing list, and already got two good suggestions -- one from Jeaye pointing me to glfw-rs and one from Daniel Micay suggesting using iterators rather than hash maps if I'm going to use an example from the standard library. I think that's an especially good suggestion; iterators almost qualify as a third central-aspect-of-Rust aside from traits and borrowed pointers.

I give talks infrequently enough that I forget how much work preparing one is. rustpkg is getting sadly neglected, at least over the past two days. I envision weekend work on the horizon, and a lot of it.

comment count unavailable comments

June 07, 2013 05:02 AM

June 05, 2013

Tim Chevalier

TMI: Talk preparations

Today I worked a bit more on issue 5683 (which, for now, just means turning more of the rustpkg test cases into executable #[test] tests and not just English descriptions in a text file. We'll do something fancier later.

I attempted to figure out why my pull request for rustpkg versions failed tests on the Linux bot, but got nowhere, since the test that failed passes on my Mac, and rather than trying to remember how to ssh into one of our Linux machines, I decided to work on my talk instead.

My talk is in a little over two weeks. I got through about half the slides today, and will probably try to blow through the other half tomorrow. I mean just getting them into a rough form without any "fill this in" -- I expect to spend a lot of my time between now and June 19 working on the slides. I'm trying to turn off my inner critic and just write stuff down for now -- it's much easier to edit something that's rough than to turn nothing into something, which is what I'm doing now -- but it's hard, as usual. Also hard is striking the right balance between accessibility and realism with code examples. So far, I've found tiny examples to use from Servo, sprocketnes, and the Rust standard libraries. It's hard to find a non-trivial example that doesn't require explaining too many new concepts at once, though. So the ones I've pulled are very simple. Given that, I'm wondering how much it adds to be using real examples; I wonder if I should stick to using examples from Rust standard libraries or just make up my own examples. I originally thought the talk would be more interesting if I used examples from real Rust applications, but I just don't know if the details of those applications (not so interesting until you spend time immersed in them) justify the benefits of verisimilitude.

comment count unavailable comments

June 05, 2013 12:54 AM