Hello World: July 2008

Thursday, July 24, 2008

Why I Love Linux

... ever since I got my Macbook and have been using OS X, my life feels a little different. Just last week, I had to pay for software, and this made me feel a bit uncomfortable. No, I don't steal software licenses -- I've been using Linux for the past two years and got used to just downloading free and open source software.

I won't try and convince you that Linux is something you should try. I've had many of these discussions and the fact that we promote it through Five Minutes to Midnight is enough for me.

Instead, I'll just say this: setting up an FTP server on Ubuntu is so easy it makes me smile. In fact, I set up two today, just for kicks. Thank you, ProFTPd, for making my file transferring easier today.

Wednesday, July 23, 2008

Linux Journal - August 2008

For the past year, I've been subscribing to the Linux Journal, and the August 2008 issue is one of the best I've seen for any magazine. This issue is devoted to "cool projects" (on Linux).

This is a pretty big deal to me, someone who isn't very focused on hardware development but loves programming. Why? First, this issue provides information on how to use the Wiimote as an input device. This isn't anything majorly new, but it's nice to have a step-by-step guide.

Secondly, there's an overview of Bug Labs' Linux computer, which is a basic cellphone-sized computer which you can attach other modules to, allowing you to build new gadgets and even swap them as the software is running. If you have a few hundred dollars to spend, this is a great way to start fooling around with gadgets -- or so I read.

A more familiar face in the magazine is gumstix, which provides very small computers with various components, such as Bluetooth, wifi, USB connections, and so on. This is very similar to Bug Labs, and in fact the costs are about the same, too.

Finally, there's E-Ink, which one can use for low-power displays. This is very promising, though the prototyping set costs $3000, which I imagine is a bit steep for most

The great thing about these tools is how it may be possible to use them and build completely new gadgets. For someone like me, who isn't well-versed in hardware development, this is a great opportunity to get involved in actual prototype development rather than just making graphics and writing code on my laptop.

Even better is the fact that companies like Bug Labs are starting to really promote and focus on the idea of open hardware. I'm curious to see how long it takes for people to start playing around with this. Hopefully I'll get into it too, sometime soon.

Note: click on image for credits.

Scanning the Web for Diseases

From www.dis-order.net.

Let's start with two over-simplifications. There's a lot of information on the Web... And, it's hard to analyze it. Indeed, regardless of which field of study you work in, the Web probably applies to you -- whether you're an economist studying auction systems, a computer scientist looking at technical infrastructure, or pretty much anything in between.

One very relevant application that tries to solve this problem for health practitioners is HealthMap. Funded by Google, it's a perfect example of the intersection between data mining and public policy.

HealthMap: An Overview

HealthMap scans various health sites and news directories, constantly looking for news related to health and diseases. It does this by scanning the actual text of the articles and, using a text classifier, tries to categorize every article into (1) a specific disease, and (2) a specific region. This is much harder than it sounds, as the software needs to know the difference between a team of American doctors studying a new outbreak in England, and a team of English doctors studying a new outbreak in America. While this is easy for humans to do, computers are often quite terrible at telling the difference.

Once this information is collected and synthesized, the site displays information on outbreaks as a Google Maps mashup, making it easy to check where outbreaks are happening and what is going on in specific regions.

An overview of the technology is provided in a recent paper in the Journal of the American Medical Informatics Association.

The great thing about this is that using open source web crawling tools like Nutch, WVTool for text analysis, and Weka for model generation, you can build prototypes of HealthMap-like tools in several weeks. Of course, the accuracy of your classifiers and mapping articles to specific regions and diseases is often the hardest (and most important!) part.

The Challenge of Unstructured Data

HealthMap is a great response to the deluge of data that people and organizations have to deal with on the Web. Collecting data, analyzing it, and making it accessible is a major challenge in almost every field of study. Another example of a response to this is Issue Crawler, which allows one to explore political discussions online.

The main difference between these approaches and tools like Wikipedia and Who Is Sick? is that the latter use distributed networks of people to collect and organize information. I imagine that one great opportunity in the next few years will be combining the use of such "people power" with machine learning to build web services that help us deal with all this information and data.

Tuesday, July 22, 2008

Data Inaccuracies in Polls and Surveys

From www.dis-order.net.

Salon.com published an interesting article yesterday by Paul Maslin and Jonathan Brown, discussing an inaccuracy in the standard approach to political polling. They say that phone surveys only focus on landlines, which ignore people who only have cell phones. They have a fairly detailed discussion on why this is the case, and how much this can affect polls -- essentially, as the number of people who only use cell phones increases, polls can become less and less accurate. This is especially true since a specific type of demographic owns cell phones and avoids land lines (younger, more technical people), meaning the polls can become quite biased (and thus inaccurate).

Alternatives to Political Polling

So the first question that comes up is, "Are there alternatives to phone calls?" Even with the rise of the Internet, political polling is still very dependent on random phone calls. The basic problem is getting a random sample -- you can't do that with e-mails or site visits. So never trust those CNN or Fox News polls.

One alternative is using a prediction market. These act just like stock markets, but people buy and sell shares in a specific event -- you then make a profit if the event takes place, and lose money if it does not.

What is exciting about prediction markets is that, with enough people participating, they aggregate individuals' knowledge and can provide a reasonably accurate probability around a specific event. In fact, markets have been known to provide better predictions than those of experts. A lot of major companies, such as Google, HP, Best Buy, and many others, are using these now, and one can get a good overview by exploring Google's work, and Wolfers' and Zitzewitz's paper.

So how accurate are these markets for the upcoming elections? Well, the Iowa Electronic Market pretty much shows a 50-50 split on the 2008 Presidential race, while Intrade.com gives Obama a 2-to-1 lead. Of course, not everyone participates in these markets, and I'm sure it is easy to argue that Obama supporters (read: younger, more technology-friendly people) are more likely to use sites like this.

Is This Cell Phone Problem An Isolated One?

When reading newspapers or magazines, people often feel more comfortable with numbers than they do with qualitative or subjective discussions. This is a major problem -- yes, numbers do not lie, but the definitions used to get those numbers can often be misleading. The way surveys are designed, and the way "random" samples are chosen, can often bias results quite a bit.

One area where this is a very big problem is poverty measurements. Poverty is often defined with regards to how much of a family's income is spent on food and shelter. International comparisons, however, are murky -- the way you define baskets of goods (e.g. nutritional requirements, staple foods, etc.) can change quite a bit between countries. One of the biggest criticisms of surveys focusing on poverty has been that they are household surveys -- people without homes are often missed. Indeed, finding such people can be very tricky in the first place.

Oftentimes, running surveys and collecting data is extremely difficult. A great overview of this, in an international development context, is Martin Ravallion's "How Well Can Method Substitute for Data? Five Experiments in Poverty Analysis". Statisticians, mathematics, and other researchers are constantly trying to find new analytical tools to make models and analysis more accurate, but bad data can rarely be fixed after it has been collected.

In general the important thing is to critically analyze the definitions and methods used in surveys and polls. The best piece of advice I ever got on this issue was that numbers and methodologies tell stories just like words do, and it is important to read between the lines.

YouTube, Viacom, and Data Concerns

I recently started another blog on data analysis at www.dis-order.net. I'm going to post those articles here too, so I can keep everything in one place. Sorry if there's anyone out there who reads both!

Over the last few days, Viacom and Google have been in the news quite a bit due to their trial. Viacom is suing Google for $1 billion due to the amount of copyrighted material being posted on YouTube (which Google owns).

The Associated Press reports, "U.S. District Judge Louis L. Stanton authorized full access to the YouTube logs after Viacom Inc. and other copyright holders argued that they needed the data to show whether their copyright-protected videos are more heavily watched than amateur clips."

The EFF has statements from both sides.

While such legal issues aren't my specialty, I wanted to write about it because of the political ramifications of such a release of data. Also, the limitations behind data anonymisation are concerning here, even though Viacom says this data will be "anonymised" and not used to target specific individuals or users.

As far as I understand, Google will be handing over approximately 12 terabytes of data, in a database that includes when a video is played, each viewer's user name, and also their IP address. At this point, both sides have argued that a user's IP address cannot lead to identification of a specific person.

IP Addresses and User Names

If nothing is changed within the database, Viacom's lawyers will be able to see individual user names that watched videos. I imagine it will be easy for them to also find out who posted the videos, either by simply visiting the site or making a few basic assumptions about the data (e.g. "The first person to watch a video is likely to be the one who posted it.") that can be empirically tested.

While this process will not compromise everyone's privacy, there will be users who can be tracked down through the information above. There are a number of ways to do this:

People with obvious user names. Some users use their real names, while others are building brands around their user names. For example, if you knew what videos lonelygirl15 watched, you can probably guess who watched them. The same is true for many other YouTube users. Furthermore, a great deal of users also post personal videos, and made accounts without ever expecting to have their viewing patterns analyzed by lawyers. Now if I see that User123 watches Colbert Report videos on YouTube, I can check if he or she posted personal videos (say, from a trip to Costa Rica or playing beer pong) and easily track that person down.

IP address contain geographic information. In sparsely populated areas, your IP address can't be connected to you specifically, but can narrow the list of potential people by quite a bit. If you live in New York City, you might be okay.

Usage patterns. Remember when AOL released "anonymised" search queries of a few hundred thousand users? Based on the terms that were put into the search engine, a news agency was able to track down specific users based on this data. A similar issue occurred when researchers de-anonymised Netflix data by comparisons to publicly available information on IMDB. The same can be done on YouTube. What's worse, one can easily use social networks based on comments, favourites, feeds, etc. to build a community-level view. Using this information, you can find clusters of users who are sharing or watching illegal content.

In my final year of university, I did a project focusing on community identification on YouTube and was able to build a basic crawler that found communities of users that would share anime clips. These were communities that were not formally organized, but could be found by analyzing comment patterns under each YouTube video. I won't go into the math, but it's very easy to do. This method might not find a specific individual, but can definitely find groups of friends (say, from your local high school) or fan clubs... Viacom could easily use a similar approach to track down groups of people who regularly view illegally uploaded content.

The Anonymity Myth

Suppose Google modified its database so that: (1) user names became a set of random characters, (2) so did IP addresses. This is often what people do when releasing data -- it's what AOL and Netflix did, for example.

Unfortunately, the last two methods above can still be used to track individuals down, because they depend on the underlying social network of the website, and the video content as well.

To really anonymise the data, one would need to randomize the underlying network structure as well. This would get rid of community structure and make it harder to track down groups of people with similar interests or backgrounds. Since Viacom is allegedly not interested in such information (and by the judge's ruling, is not allowed to search for it even if it wants to), getting rid of links and scrambling the network structure is fair game.

Modifying the video content would be trickier. Viacom is trying to make the argument that illegally posted videos are more popular on YouTube than legal videos. To label a video as "legal" or "illegal", one would need to watch the content, and there is no technology that exists to do this automatically (if there was, then Google should just use that and get this lawsuit over with).

To get over the problem of content comprising users' privacy, one option is to have Viacom submit a list of videos they feel contain illegal content. Google can then scramble all the video IDs and return a list showing which scrambled IDs represent illegal content, as judged by Viacom. This will ensure that Viacom cannot visit YouTube and track down who posted those videos or connect their data with other databases (say, check if users posting videos are also commenting elsewhere, or participating in other sites). While this process may sound tedious, Viacom will need to label data in such a way for the company / team / group / lawyers they hire to actually analyze the data.

In the End, Does It Matter?

What bothers me most about all of this, however, is that if Viacom simply wants to prove that illegal content is more popular than legal content, can't they use simple view statistics, or have Google calculate the number of unique viewers per video? I imagine that's more than enough information to draw such a conclusion.

Clearly, there's more behind this than meets the eye. Viacom initially wanted to see YouTube's source code, arguing that YouTube might be treating illegal content in a different way than legal content. Luckily the judge didn't feel that the company's source code was as relevant in this discussion.

We'll see what happens with this data, and how it is analyzed...

Thursday, July 17, 2008

1, 2, 3, 4...

What's better than a famous Canadian singer?

... A famous Canadian singer singing a remake of her song on Sesame Street!

And what's better than a Canadian singer singing a remake of her song on Sesame Street?

... A famous Canadian singer singing a remake of her song on Sesame Street and teaching children about the number four. Yes, that's two squared!

I love it.