MedTweets - overview
Named Entity Recognition (NER) has been an important task in Natural Language Processing. It detects entities such as organizations, locations and person names in plain text. This project aims to recognize entities from medical information in social media. For instance, "penicillin" will be identified as a drug in tweet "Getting a bit bored of this headache, hope the penicillin kicks in soon". Social media amasses a huge volumes of user generated data including many discussions on health problems by users, and this has offered the project an excellent source for health data mining and exploration. Nonetheless, NER on tweets is non-trivial, because social media is generally much nosier than conventional datasets (e.g., articles in medical databases). Non-standard words, incorrect capitalization and ungrammatical sentences all degrade the performance of off-the-shelf NER tools operating on social media data.
This project aims to detect medical entities in English Twitter messages. We primarily target three types of entities: disease, symptoms and pharmacological drugs. We have compiled a annotated dataset and built a live (IBM internal) demo to show the trends of medical entities over time and across different locations. We are continuously improving the accuracy of this medical entity tagger by exploring novel features and methods. The utility of this tagger includes (but not limited to) following areas:
- Bio-surveillance such as influenza monitoring and bio-terrorist attacks.
- Detection of localized health threats such as water contamination and disease monitoring.
- Adverse-drug reaction such as immunization reaction and tracking of illegal drugs.