Tuesday, May 19, 2009

Political Blog Networks and the U.S. Presidential Election

I had the great fortune of giving a talk at the Nuffield Networks seminar series today, and the talk was titled "Political Blog Networks and the US Presidential Election". It was meant as an overview of some of the work I did at the IBM TJ Watson Research Center last year, though also went into new work I'm doing in sentiment analysis and complex networks. Overall, I'm quite pleased with the talk. It allowed me to find a focus for some of the work I've been doing over the last few weeks. Below is a brief list of some of the important points I wanted to raise during this talk.

Machine Learning is Important for Social Science

I think the most exciting part of my presentation, though also a part that was quite low-key, is the potential that machine learning holds for social science research. Labeling blog posts by hand is useful, but fairly intense and sometimes expensive. Tools like Amazon's Mechanical Turk provide a cheap alternative for labeling, but even this method if not scalable to two million or more blog posts.

I will not argue that machine learning can be the saviour of such social research, but rather that if it is used intelligently and correctly, it can help elucidate some of the trends within massive social systems (such as the blogosphere). By no means is this the death knell for human labels or qualitative research. Instead, I see the two working hand-in-hand.

Forget Word Vectors... Use Graph Theory

The subtitle is a bit strong, and maybe a little sensationalist. No, we shouldn't be avoiding word frequencies, multinomial distributions, or natural language processing... Keep these wonderful things, but also include the graph structure behind the blog posts and other data sets you are using! I remember writing a bit about some potential tools before, and still think it is quite important.

On a related point, predicting edges between nodes, while much harder (in my opinion) than predicting sentiment of a specific blog post, is still worth trying. There's a great paper that will be presented at the upcoming International Conference on Machine Learning, and it is worth reading.

Accuracy is Dead! Long Live Accuracy!

One of the biggest challenges in terms of this type of approach is how difficult it is to actually make predictions, and more importantly, how to validate models that predict rare events. When you're predicting hyperlinks between bloggers, you can have a model with 99% accuracy by simply saying that every blogger will not hyperlink to anyone. Accurate? Yes. Useless? Definitely.

Unfortunately, it's a bit difficult to justify the use of inaccurate machine learning models for social science research. That being said, I'm confident some creative and interesting solutions exist to this problem.

No comments: