Wednesday, July 23, 2008

Scanning the Web for Diseases


Let's start with two over-simplifications. There's a lot of information on the Web... And, it's hard to analyze it. Indeed, regardless of which field of study you work in, the Web probably applies to you -- whether you're an economist studying auction systems, a computer scientist looking at technical infrastructure, or pretty much anything in between.

One very relevant application that tries to solve this problem for health practitioners is HealthMap. Funded by Google, it's a perfect example of the intersection between data mining and public policy.

HealthMap: An Overview

HealthMap scans various health sites and news directories, constantly looking for news related to health and diseases. It does this by scanning the actual text of the articles and, using a text classifier, tries to categorize every article into (1) a specific disease, and (2) a specific region. This is much harder than it sounds, as the software needs to know the difference between a team of American doctors studying a new outbreak in England, and a team of English doctors studying a new outbreak in America. While this is easy for humans to do, computers are often quite terrible at telling the difference.

Once this information is collected and synthesized, the site displays information on outbreaks as a Google Maps mashup, making it easy to check where outbreaks are happening and what is going on in specific regions.

An overview of the technology is provided in a recent paper in the Journal of the American Medical Informatics Association.

The great thing about this is that using open source web crawling tools like Nutch, WVTool for text analysis, and Weka for model generation, you can build prototypes of HealthMap-like tools in several weeks. Of course, the accuracy of your classifiers and mapping articles to specific regions and diseases is often the hardest (and most important!) part.

The Challenge of Unstructured Data

HealthMap is a great response to the deluge of data that people and organizations have to deal with on the Web. Collecting data, analyzing it, and making it accessible is a major challenge in almost every field of study. Another example of a response to this is Issue Crawler, which allows one to explore political discussions online.

The main difference between these approaches and tools like Wikipedia and Who Is Sick? is that the latter use distributed networks of people to collect and organize information. I imagine that one great opportunity in the next few years will be combining the use of such "people power" with machine learning to build web services that help us deal with all this information and data.

No comments: