OJPHI: Vol. 5


Journal Information Journal ID (publisher-id): OJPHI ISSN: 1947-2579 Publisher: University of Illinois at Chicago Library	Article Information ©2013 the author(s) open-access: This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes. Electronic publication date: Day: 4 Month: 4 Year: 2013 collection publication date: Year: 2013 Volume: 5E-location ID: e65 Publisher Id: ojphi-05-65

#wheezing: A Content Analysis of Asthma-Related Tweets

Gwendolyn Gillingham*¹

Michael A. Conway²

Wendy W. Chapman²

Michael B. Casale³

Kathryn B. Pettigrew³

1Linguistics, UCSD, La Jolla, CA, USA;

2UCSD - Division of Biomedical Informatics, La Jolla, CA, USA;

3West Health Institute, La Jolla, CA, USA

*Gwendolyn Gillingham, E-mail: gwen.gillingham@ling.ucsd.edu

Abstract

Objective

We present a Content Analysis project using Natural Language Processing to aid in Twitter-based syndromic surveillance of Asthma.

Introduction

Recently, a growing number of studies have made use of Twitter to track the spread of infectious disease. These investigations show that there are reliable spikes in traffic related to keywords associated with the spread of infectious diseases like Influenza [¹], as well as other Syndromes [²]. However, little research has been done using Social Media to monitor chronic conditions like Asthma, which do not spread from sufferer to sufferer. We therefore test the feasibility of using Twitter for Asthma surveillance, using techniques from NLP and machine learning to achieve a deeper understanding of what users Tweet about Asthma, rather than relying only on keyword search.

Methods

We retrieved a large volume of Tweets from the Twitter API. Search terms included “asthma,” and several misspellings of that word; terms for common medical devices associated with Asthma such as “inhaler” and “nebulizer”; and names of prescription drugs used to treat the condition, including “albuterol” and “Singulair.” A randomly sampled subset of these Tweets (N=3511) was annotated for content, based on an annotation scheme that coded for the following elements: the Experiencer of Asthma symptoms (Self, Family, Friend, Named Other, Unidentified, and All-Non-Self, which was the union of these last four categories); aspects of the type of information being conveyed by each Tweet (Medication, Triggers, Physical Activity, Contacting of a Medical Practitioner, Allergies, Questions, Suggestions, Information, News, Spam); as well as Negative Sentiment, Future temporality, and Non-English content. Further details on the annotation scheme used can be found at http://idiom.ucsd.edu/∼ggilling/annotation.pdf. Inter-annotator agreement on a subset of the Tweets (N=403) fell in an acceptable range for all categories (Cohen’s Kappa >0.6). Once annotation was complete, the Tweets’ texts were stemmed and converted into vectors of unigram and bigram counts. These were then stripped of sparse terms (all those words appearing in fewer than 1 in 200 Tweets), which left multi-dimensional vectors consisting of the counts of the remaining words in all Tweets. Statistical machine-learning classifiers including K-nearest neighbors, Naive Bayes and Support Vector Machines were then trained on the unigram and bigram models.

Results

SVM with 10-fold cross-validation achieved greatest prediction accuracy with the unigram model, as shown in Table 1. Categories that showed the greatest reduction in classification error using the unigram model were Non-English, Self, All-Non-Self, Medication, Symptoms and Spam. The majority of these categories showed very high Precision, as well as fairly high Recall for the unigram model. Unexpectedly, the bigram model faired far worse than the Unigram model, which suggests that individual words in these Tweets were more reliably predictive of content than pairs of words, which occurred less frequently.

Conclusions

Text-classification increases the utility of Twitter as a data-source for studying chronic conditions such as Asthma. Using these methods, we can automatically reject Tweets that are non-English or Spam. We can also determine who is experiencing symptoms: the Twitter user or another individual. Fairly simple models are able to predict with good certainty whether a user is talking about their Symptoms, their Medication, or Triggers for their Asthma, as well as whether they are expressing Negative sentiment about their condition. We demonstrate that Social Media such as Twitter is a promising means by which to conduct surveillance for chronic conditions such as Asthma.

[TableWrap ID: t1-ojphi-05-65] Table 1:

Performance of Classifiers on Unigram and Bigram Models

Dimension	Baseline Error (= 1-Majority Classification/N)	Unigram Model Error	Bigram Model Error	Unigram Precision	Unigram Recall
Non-English	0.19	0.07	0.17	0.9	0.82
Sell	0.22	0.16	0.18	0.94	0.76
All-Non-Self	0.19	0.14	0.17	0.94	0.59
Medication	0.15	0.098	0.16	0.89	0.77
Symptoms	0.21	0.16	0.17	0.9	0.76
Spam	0.07	0.055	0.17	0.93	0.43


*1..*	Chew C, Eysenbach G. 2010;Pandemics in the Age of Twitter: Content Analysis of Tweets in the H1N1 OutbreakPLoS ONE 5(11):e14118.
*2..*	Collier N, Doan S. 2011Syndromic Classification of Twitter MessagesProc. eHealth 2011Malaga, SpainNovember 21–23


Article Categories: ISDS 2012 Conference Abstracts Keywords: social media, natural language processing, asthma, content analysis.