A technical note on content versus metadata

I'm a software developer by day (and sometimes by late, late night); my pundit career here is only a hobby. As such, I know a bit more about the inner workings of the internet than most, and in particular, I've written software applicable to both websites and email systems. I've now seen two particular things pop up about the NSA surveillance tool XKeystore that warrant further explanation for those who aren't familiar with how internet protocols work.
Fundamentally, internet protocols operate in layers. Each layer is designed to be read and processed by a different part of the network stack. Some of these layers deal with, for example, how to send and receive the electronic signals (we'll call it a packet) over the physical wire or wireless device between your computer and the rest of the internet. The next level up tells the device processing the packet where the packet came from and where the packet is going, and has a chunk of data that it carries with those addresses.

Those addresses are called IP (Internet Protocol) addresses. They are the first level of meaningful metadata for the NSA's purposes. If the NSA only collects metadata, they could be talking about IP addresses. A good analogy for an IP address is a telephone number -- it identifies a specific electronic device that can make calls (send packets) and receive calls (receive packets). An IP address usually doesn't change often, but does change sometimes; and it only identifies a device, not a specific person.

Traditionally law enforcement has found it easier to get judicial approval for a pen register (a device which records the calls to and from a particular phone number) than a wiretap (which records the voice content of phone calls). As such, when they approach internet surveillance requests, they like to draw analogies to a pen register and say that they want the "pen register" for the Internet activity of a specific person because that's easier to get than a full wiretap.

Following that analogy strictly, they would get a list of dates, times, and a to/from IP address pair. Lots of them -- millions, possibly billions. The IP address pair is a set of numbers, and about the only useful thing that can be determined from those numbers is what entity owns the IP address. Usually an ISP or business of some kind, which can then be asked via National Security Letter who actually uses that IP address.

At that point, the NSA knows who you are, or at least is close to finding out. They know which other machines on the internet you have been talking to, in a broad sense, but nothing about what either side was saying. It's the closest thing to a pen register for the internet you can think of. If you have been talking directly to computers owned by known terrorists, then they have probable cause to investigate further, and that's really all they need to know if they are only interested in investigating terrorists.

But the slides for XKeystore show that it doesn't stop there. The slides show data from something called HTTP headers, and language in the article suggests that it can collect email addresses from inside an email message too:

When a Query is put into the system for a specific email account, XKeystore returns all the "metadata" -- addressees and, importantly, the subject line, which, of course, usually summarizes the basic content of the email -- and also scans the email for additional email addresses inside the email. Like if someone said "Contact these other parties" and then listed some emails. And even here, a slide indicates that sometimes the system makes an error and includes a few words it mistook for an email address.

HTTP (ie, the web) and SMTP (ie, email) are functionally very simple protocols. That's part of what has driven their success. In terms of the protocols themselves, though they can transmit binary data (ie, a computer file) the protocols themselves run in plain text, and the basic data structures sent over both HTTP and SMTP are structured plain text. By structured, I mean the contents are easily readable by a human being but follow basic grammar rules strictly enough for a machine to understand them. Most other internet protocols (eg, for chat rooms) follow similar rules and principles.

Remember the layered approach to a data packet? At the lowest (interesting) level, the metadata consists of just the IP address of the sender and recipient. All that tells you is who operates the computer at either end (and perhaps what type of communication is taking place, ie, email, web, chat, etc); it's up to the higher layers to make the data mean something, and often the real person sending and receiving communications is not the same person as the one who owns and operates the computer on the other end.

For example, looking just at the IP layer for an email would likely tell you the identity (by IP address) of the sender and the identity of that person's email server. You would then have to look at that email server, see who it is communicating with, and make an educated guess which message the server sends out corresponds to the one you are interested in. That message is sent to the recipient's email server, and you have to then wait for the recipient to check their email to find out their IP address.

HTTP and SMTP operate several layers above the basic IP address system, and they are the level when the data starts to mean something to human beings. Since we know the NSA's program collects email addresses, we know that they are looking at the SMTP layer at least -- possibly higher up, but definitely to the SMTP layer.

The application layer also contains metadata, and this is the metadata the NSA is talking about. It's important to understand this because the application layer metadata contains a lot more information than just who sent a packet and where the packet is going. Let's look at a typical email sent over SMTP:

S: 220 ESMTP


S: 250


S: 250 ok


S: 250 ok


S: 250 ok


[ Next layer of email data is sent ]

C: .

S: 250 ok


S: 221

Connection closed.

You can see from the above message that you can identify the (claimed) sender, the intended recipient(s), and maybe (with some extensions) the size of the message. That's not much information, but it's easier than trying to track a single email message via the IP layer. You have everything you need in one package. Also, if the connection is not encrypted but does authenticate the user sending the email, the user's login and password may be available as metadata. Let's look at a sample HTTP transaction:


S: HTTP/1.1 200 OK

S: Server: Apache-Coyote/1.1

S: Content-Type: text/html

S: Transfer-Encoding: chunked

S: Vary: Accept-Encoding

S: Date: Thu, 01 Aug 2013 22:05:29 GMT


S: 2000


S: [ next layer data -- content of the web page ]

The above exchange contains the only metadata available at the HTTP layer. The interesting thing here is that we have the URL requested (""). For a public webpage, anyone can plug in that URL and get, probably, the same or similar content as the user was looking at, if the server does not require a login. Remember that the URL identifies not just the server that is hosting the web page being loaded, it identifies the specific website and the specific page on that website being requested. Starting only with this layer of metadata, it is trivially possible (if not error-free) to retrieve the actual content provided to the user. If the server does require a login, the metadata at this layer may actually expose the user's credentials to the NSA (depending on the server configuration).

So, with regard to HTTP connections, the fact that the NSA can independently retrieve the full content of the pages being viewed means that only storing "metadata" is an almost meaningless distinction. They can turn that metadata into a reasonable-probability copy of the actual page viewed trivially.

But it gets worse. Take a look back at the sample SMTP connection above, and then reread this description of the XKeystore's capabilities:

When a Query is put into the system for a specific email account, XKeystore returns all the "metadata" -- addressees and, importantly, the subject line, which, of course, usually summarizes the basic content of the email -- and also scans the email for additional email addresses inside the email. Like if someone said "Contact these other parties" and then listed some emails. And even here, a slide indicates that sometimes the system makes an error and includes a few words it mistook for an email address.

Did you see anything in that SMTP connection which could not be readily identified as either being an email address or being some other part of the protocol? Anything that could possibly represent a user saying "Contact these other parties"? Anything that could be mistaken for an email address without actually breaking the SMTP protocol? Anything at all that is user-entered data besides the sender and recipient email addresses?

Clearly we need to go up one more layer. The NSA is looking through the actual content of the email message at this point. Here's a typical email message from a programmer's point of view:



Subject: This is a message about cake

Date: Thu, 01 Aug 2013 22:05:29 GMT

Contact these other parties



Tell them the cake is a lie. There is no cake @ the end of the org chart meeting.

The only thing at this layer that could be credibly described as "metadata" is the header of the message. As before, the header is structured data. The program you use to read your email needs to read and understand the lines on top, so they are designed to follow a strict format that can be easily parsed.

The header part of the email ends at the first blank line, and everything after that is entered freeform by the user -- ie, content, not metadata by any possible argument. (It is possible to add more layers on top of this -- but for a simple email, this is the end of the line).

If you are only looking at metadata, you need to look at the headers and ignore everything after the blank line. But everything in the headers is structured. The user enters some of the data (particularly the subject line, which can be revealing of the email's contents), but the email fields themselves need to be machine-readable and are mostly generated by the email application itself, with validation performed on the email addresses entered by the user.

In other words, a competent programmer can reliably parse out email addresses from the structured header fields with effectively no chance of getting user-entered content by mistake, unless the user was hand-crafting the email. All they have to do is stop reading the message at the first blank line (as I have marked in the example with a dividing line).

In order to get occasional cases where the Xkeystore retrieves "metadata" in the form of email addresses that turns out to be user-entered content instead, the NSA must be retrieving and parsing the content of the email. They may have coded their application to only show what they think are email addresses, but they are extracting those email addresses from the content, not from the headers. Which means they must be collecting and analyzing the content, not just the metadata.

It's like a pretty girl who wants to change clothes in your bedroom. Does she trust you not to look or does she find a screen or use a bathroom or closet so that you can't look? Does it matter if you promise not to look?

Clearly, the NSA has the ability to intercept email content, not just metadata; just as clearly, they are actually intercepting the full email content and collecting it for analysis. They are asking us to trust them not to look at the content, even though they already have it. Maybe they have built their application so that they can't look without getting permission, but according to Snowden, the permission system is a joke and a rubber stamp. We already know that Homeland Security does keyword scanning of content, and I'm betting the NSA is doing the same thing with its application, and if the right keywords are there -- or the right sender or recipient, two or three degrees away from a "suspected" terrorist -- the content is flagged for a closer look. Or the NSA analyst can make up his own justification and get it rubber stamped.

And we can't see how their application works, or have any way of knowing that it does what it says it does. In this analogy, the NSA is the guy wearing a nice Google Glass device, and he tells the pretty girl in his bedroom she can strip down right there in front of him and she will be perfectly safe -- he's written his own privacy app, you see, and when it detects a pretty girl in his field of view it doesn't let him look. He's just watching you to keep you safe, you see. He's not recording the whole thing and uploading it to his friends. (Or then again, maybe he is...)

I'm no pretty girl, and no terrorist either, but I sure as hell don't trust them not to look at my data if I say something with enough juicy keywords in it. Their own slides prove they have it, and they look at it, and we have only their word that their privacy app does what they say it does.

I don't trust them, and I don't trust their privacy app. The constitution says they can't have our data without a warrant issued by probable cause. It doesn't matter if they promise not to look; if they don't have it, they can't look.

It's time to defund the NSA.

Would you trust them with your daughter?

This entry was published Fri Aug 02 11:21:44 CDT 2013 by TriggerFinger and last updated 2014-03-16 02:50:19.0. [Tweet]

comments powered by Disqus

This website is an Amazon affiliate and will receive financial compensation for products purchased from Amazon through links on this site.