I recently started another blog on data analysis at www.dis-order.net. I'm going to post those articles here too, so I can keep everything in one place. Sorry if there's anyone out there who reads both!
Over the last few days, Viacom and Google have been in the news quite a bit due to their trial. Viacom is suing Google for $1 billion due to the amount of copyrighted material being posted on YouTube (which Google owns).
The Associated Press reports, "U.S. District Judge Louis L. Stanton authorized full access to the YouTube logs after Viacom Inc. and other copyright holders argued that they needed the data to show whether their copyright-protected videos are more heavily watched than amateur clips."
The EFF has statements from both sides.
While such legal issues aren't my specialty, I wanted to write about it because of the political ramifications of such a release of data. Also, the limitations behind data anonymisation are concerning here, even though Viacom says this data will be "anonymised" and not used to target specific individuals or users.
As far as I understand, Google will be handing over approximately 12 terabytes of data, in a database that includes when a video is played, each viewer's user name, and also their IP address. At this point, both sides have argued that a user's IP address cannot lead to identification of a specific person.
IP Addresses and User Names
If nothing is changed within the database, Viacom's lawyers will be able to see individual user names that watched videos. I imagine it will be easy for them to also find out who posted the videos, either by simply visiting the site or making a few basic assumptions about the data (e.g. "The first person to watch a video is likely to be the one who posted it.") that can be empirically tested.
While this process will not compromise everyone's privacy, there will be users who can be tracked down through the information above. There are a number of ways to do this:
People with obvious user names. Some users use their real names, while others are building brands around their user names. For example, if you knew what videos lonelygirl15 watched, you can probably guess who watched them. The same is true for many other YouTube users. Furthermore, a great deal of users also post personal videos, and made accounts without ever expecting to have their viewing patterns analyzed by lawyers. Now if I see that User123 watches Colbert Report videos on YouTube, I can check if he or she posted personal videos (say, from a trip to Costa Rica or playing beer pong) and easily track that person down.
IP address contain geographic information. In sparsely populated areas, your IP address can't be connected to you specifically, but can narrow the list of potential people by quite a bit. If you live in New York City, you might be okay.
Usage patterns. Remember when AOL released "anonymised" search queries of a few hundred thousand users? Based on the terms that were put into the search engine, a news agency was able to track down specific users based on this data. A similar issue occurred when researchers de-anonymised Netflix data by comparisons to publicly available information on IMDB. The same can be done on YouTube. What's worse, one can easily use social networks based on comments, favourites, feeds, etc. to build a community-level view. Using this information, you can find clusters of users who are sharing or watching illegal content.
In my final year of university, I did a project focusing on community identification on YouTube and was able to build a basic crawler that found communities of users that would share anime clips. These were communities that were not formally organized, but could be found by analyzing comment patterns under each YouTube video. I won't go into the math, but it's very easy to do. This method might not find a specific individual, but can definitely find groups of friends (say, from your local high school) or fan clubs... Viacom could easily use a similar approach to track down groups of people who regularly view illegally uploaded content.
The Anonymity Myth
Suppose Google modified its database so that: (1) user names became a set of random characters, (2) so did IP addresses. This is often what people do when releasing data -- it's what AOL and Netflix did, for example.
Unfortunately, the last two methods above can still be used to track individuals down, because they depend on the underlying social network of the website, and the video content as well.
To really anonymise the data, one would need to randomize the underlying network structure as well. This would get rid of community structure and make it harder to track down groups of people with similar interests or backgrounds. Since Viacom is allegedly not interested in such information (and by the judge's ruling, is not allowed to search for it even if it wants to), getting rid of links and scrambling the network structure is fair game.
Modifying the video content would be trickier. Viacom is trying to make the argument that illegally posted videos are more popular on YouTube than legal videos. To label a video as "legal" or "illegal", one would need to watch the content, and there is no technology that exists to do this automatically (if there was, then Google should just use that and get this lawsuit over with).
To get over the problem of content comprising users' privacy, one option is to have Viacom submit a list of videos they feel contain illegal content. Google can then scramble all the video IDs and return a list showing which scrambled IDs represent illegal content, as judged by Viacom. This will ensure that Viacom cannot visit YouTube and track down who posted those videos or connect their data with other databases (say, check if users posting videos are also commenting elsewhere, or participating in other sites). While this process may sound tedious, Viacom will need to label data in such a way for the company / team / group / lawyers they hire to actually analyze the data.
In the End, Does It Matter?
What bothers me most about all of this, however, is that if Viacom simply wants to prove that illegal content is more popular than legal content, can't they use simple view statistics, or have Google calculate the number of unique viewers per video? I imagine that's more than enough information to draw such a conclusion.
Clearly, there's more behind this than meets the eye. Viacom initially wanted to see YouTube's source code, arguing that YouTube might be treating illegal content in a different way than legal content. Luckily the judge didn't feel that the company's source code was as relevant in this discussion.
We'll see what happens with this data, and how it is analyzed...