TY - JOUR AU - Briscoe, Erica AU - Appling, Scott AU - Clarkson, Edward AU - Lipskiy, Nikolay AU - Tyson, James AU - Burkholder, Jacqueline PY - 2017/05/02 Y2 - 2024/03/28 TI - Semantic Analysis of Open Source Data for Syndromic Surveillance JF - Online Journal of Public Health Informatics JA - OJPHI VL - 9 IS - 1 SE - Language processing, classifiers, and syndrome definitions DO - 10.5210/ojphi.v9i1.7651 UR - https://ojphi.org/ojs/index.php/ojphi/article/view/7651 SP - AB - <div style="left: 82px; top: 337.802px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.08211);" data-canvas-width="58.10914814814815">Objective</div><div style="left: 95.6667px; top: 351.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.960998);" data-canvas-width="346.215388888889">The objective of this analysis is to leverage recent advances in</div><div style="left: 82px; top: 366.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.993739);" data-canvas-width="360.8265740740741">natural language processing (NLP) to develop new methods and</div><div style="left: 82px; top: 381.794px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03848);" data-canvas-width="359.6248944444442">system capabilities for processing social media (Twitter messages)</div><div style="left: 82px; top: 396.98px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.993418);" data-canvas-width="360.59424074074064">for situational awareness (SA), syndromic surveillance (SS), and</div><div style="left: 82px; top: 412.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0406);" data-canvas-width="359.6313481481479">event-based surveillance (EBS). Specifically, we evaluated the use</div><div style="left: 82px; top: 427.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02464);" data-canvas-width="359.169262962963">of human-in-the-loop semantic analysis to assist public health (PH)</div><div style="left: 82px; top: 442.535px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02914);" data-canvas-width="359.34222222222235">SA stakeholders in SS and EBS using massive amounts of publicly</div><div style="left: 82px; top: 457.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00163);" data-canvas-width="142.6655740740741">available social media data.</div><div style="left: 82px; top: 486.617px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.11815);" data-canvas-width="75.28890740740741">Introduction</div><div style="left: 95.6667px; top: 500.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.95755);" data-canvas-width="344.11148148148123">Social media messages are often short, informal, and ungrammatical.</div><div style="left: 82px; top: 515.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02223);" data-canvas-width="359.31253518518514">They frequently involve text, images, audio, or video, which makes</div><div style="left: 82px; top: 530.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.991324);" data-canvas-width="360.34641851851825">the identification of useful information difficult. This complexity</div><div style="left: 82px; top: 545.794px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02303);" data-canvas-width="348.1385925925927">reduces the efficacy of standard information extraction techniques</div><div style="left: 430.225px; top: 546.136px; font-size: 7.74444px; font-family: serif;">1</div><div style="left: 434.106px; top: 545.794px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00057);" data-canvas-width="6.453703703703703">.</div><div style="left: 82px; top: 560.979px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02902);" data-canvas-width="361.20347037037044">However, recent advances in NLP, especially methods tailored</div><div style="left: 82px; top: 576.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.98138);" data-canvas-width="82.943">to social media</div><div style="left: 164.98px; top: 576.506px; font-size: 7.74444px; font-family: serif;">2</div><div style="left: 168.925px; top: 576.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.971565);" data-canvas-width="273.15559074074076">, have shown promise in improving real-time PH</div><div style="left: 82px; top: 591.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.985742);" data-canvas-width="189.1967777777778">surveillance and emergency response</div><div style="left: 271.139px; top: 591.691px; font-size: 7.74444px; font-family: serif;">3</div><div style="left: 274.98px; top: 591.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.977209);" data-canvas-width="165.0831592592593">. Surveillance data derived from</div><div style="left: 82px; top: 606.535px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03157);" data-canvas-width="359.2609055555555">semantic analysis combined with traditional surveillance processes</div><div style="left: 82px; top: 621.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.965728);" data-canvas-width="359.95919629629645">has potential to improve event detection and characterization. The</div><div style="left: 82px; top: 636.905px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01055);" data-canvas-width="358.84786851851845">CDC Office of Public Health Preparedness and Response (OPHPR),</div><div style="left: 82px; top: 652.091px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.991285);" data-canvas-width="360.64587037037023">Division of Emergency Operations (DEO) and the Georgia Tech</div><div style="left: 82px; top: 667.276px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02664);" data-canvas-width="359.40934074074073">Research Institute have collaborated on the advancement of PH SA</div><div style="left: 82px; top: 682.461px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02353);" data-canvas-width="359.4648425925925">through development of new approaches in using semantic analysis</div><div style="left: 82px; top: 697.646px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00161);" data-canvas-width="86.38927777777775">for social media.</div><div style="left: 82px; top: 726.543px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.07341);" data-canvas-width="53.062351851851844">Methods</div><div style="left: 95.6667px; top: 740.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.966776);" data-canvas-width="346.13794444444454">To understand how computational methods may benefit SS and</div><div style="left: 82px; top: 755.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04275);" data-canvas-width="359.41837592592594">EBS, we studied an iterative refinement process, in which the data</div><div style="left: 82px; top: 770.535px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.966544);" data-canvas-width="359.90885740740754">user actively cultivated text-based topics (“semantic culling”) in a</div><div style="left: 82px; top: 785.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03383);" data-canvas-width="359.70492037037036">semi-automated SS process. This ‘human-in-the-loop’ process was</div><div style="left: 82px; top: 800.905px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.977219);" data-canvas-width="358.23218518518513">critical for creating accurate and efficient extraction functions in large,</div><div style="left: 82px; top: 816.091px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02438);" data-canvas-width="359.4196666666666">dynamic volumes of data. The general process involved identifying</div><div style="left: 82px; top: 831.276px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.967908);" data-canvas-width="360.14248148148147">a set of expert-supplied keywords, which were used to collect an</div><div style="left: 82px; top: 846.461px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.961809);" data-canvas-width="359.9888833333334">initial set of social media messages. For purposes of this analysis</div><div style="left: 82px; top: 861.646px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.977468);" data-canvas-width="357.8836851851851">researchers applied topic modeling to categorize related messages into</div><div style="left: 82px; top: 876.831px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03509);" data-canvas-width="359.62618518518536">clusters. Topic modeling uses statistical techniques to semantically</div><div style="left: 82px; top: 892.017px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.997253);" data-canvas-width="358.41288888888886">cluster and automatically determine salient aggregations. A user then</div><div style="left: 82px; top: 907.202px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00176);" data-canvas-width="321.89783333333315">semantically culled messages according to their PH relevance.</div><div style="left: 95.6667px; top: 922.387px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00268);" data-canvas-width="341.6977962962963">In June 2016, researchers collected 7,489 worldwide English-</div><div style="left: 82px; top: 937.572px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.978776);" data-canvas-width="360.02631481481484">language Twitter messages (tweets) and compared three sampling</div><div style="left: 82px; top: 952.757px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01405);" data-canvas-width="359.05825925925916">methods: a baseline random sample (C1, n=2700), a keyword-based</div><div style="left: 82px; top: 967.943px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.972169);" data-canvas-width="359.961777777778">sample (C2, n=2689), and one gathered after semantically culling</div><div style="left: 82px; top: 983.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01398);" data-canvas-width="359.0505148148148">C2 topics of irrelevant messages (C3, n=2100). Researchers utilized</div><div style="left: 82px; top: 998.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.968559);" data-canvas-width="192.50107407407407">a software tool, Luminoso Compass</div><div style="left: 274.46px; top: 998.654px; font-size: 7.74444px; font-family: serif;">4</div><div style="left: 278.379px; top: 998.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.936037);" data-canvas-width="163.27612222222228">, to sample and perform topic</div><div style="left: 82px; top: 1013.5px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04263);" data-canvas-width="362.00114814814805">modeling using its real-time modeling and Twitter integration</div><div style="left: 82px; top: 1028.68px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04223);" data-canvas-width="361.9650074074069">features. For C2 and C3, researchers sampled tweets that the</div><div style="left: 82px; top: 1043.87px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00357);" data-canvas-width="358.591011111111">Luminoso service matched to both clinical and layman definitions of</div><div style="left: 82px; top: 1059.05px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.984093);" data-canvas-width="175.0373518518518">Rash, Gastro-Intestinal syndromes</div><div style="left: 256.937px; top: 1059.39px; font-size: 7.74444px; font-family: serif;">5</div><div style="left: 260.769px; top: 1059.05px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.976418);" data-canvas-width="179.33551851851846">, and Zika-like symptoms. Layman</div><div style="left: 82px; top: 1074.24px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.986678);" data-canvas-width="360.33480185185215">terms were derived from clinical definitions from plain language</div><div style="left: 82px; top: 1089.42px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00325);" data-canvas-width="360.75171111111104">medical thesauri. ANOVA statistics were calculated using SPSS</div><div style="left: 464.667px; top: 336.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.982113);" data-canvas-width="360.02631481481495">software, version. Post-hoc pairwise comparisons were completed</div><div style="left: 464.667px; top: 351.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01289);" data-canvas-width="339.82622222222204">using ANOVA Turkey’s honest significant difference (HSD) test.</div><div style="left: 464.667px; top: 380.32px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.08548);" data-canvas-width="46.62155555555556">Results</div><div style="left: 478.333px; top: 393.942px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.997038);" data-canvas-width="344.47676111111116">An ANOVA was conducted, finding the following mean relevance</div><div style="left: 464.667px; top: 409.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02786);" data-canvas-width="361.75590740740745">values: 3% (+/- 0.01%), 24% (+/- 6.6%) and 27% (+/- 9.4%)</div><div style="left: 464.667px; top: 424.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01176);" data-canvas-width="358.93822037037035">respectively for C1, C2, and C3. Post-hoc pairwise comparison tests</div><div style="left: 464.667px; top: 439.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02001);" data-canvas-width="359.2479981481481">showed the percentages of discovered messages related to the event</div><div style="left: 464.667px; top: 454.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03397);" data-canvas-width="359.4261203703704">tweets using C2 and C3 methods were significantly higher than for</div><div style="left: 464.667px; top: 469.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03037);" data-canvas-width="359.41192222222156">the C1 method (random sampling) (p&lt;0.05). This indicates that the</div><div style="left: 464.667px; top: 485.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01837);" data-canvas-width="361.0305111111111">human-in-the-loop approach provides benefits in filtering social</div><div style="left: 464.667px; top: 500.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0292);" data-canvas-width="359.2247648148147">media data for SS and ESB; notably, this increase is on the basis of</div><div style="left: 464.667px; top: 515.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01527);" data-canvas-width="359.06858518518516">a single iteration of semantic culling; subsequent iterations could be</div><div style="left: 464.667px; top: 530.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00552);" data-canvas-width="170.97151851851856">expected to increase the benefits.</div><div style="left: 464.667px; top: 559.505px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.10391);" data-canvas-width="77.45735185185185">Conclusions</div><div style="left: 478.333px; top: 573.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.06084);" data-canvas-width="341.89140740740726">This work demonstrates the benefits of incorporating non-</div><div style="left: 464.667px; top: 588.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00554);" data-canvas-width="355.3396351851851">traditional data sources into SS and EBS. It was shown that an NLP-</div><div style="left: 464.667px; top: 603.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00564);" data-canvas-width="360.87562222222215">based extraction method in combination with human-in-the-loop</div><div style="left: 464.667px; top: 618.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04243);" data-canvas-width="359.70750185185176">semantic analysis may enhance the potential value of social media</div><div style="left: 464.667px; top: 633.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04529);" data-canvas-width="359.40675925925945">(Twitter) for SS and EBS. It also supports the claim that advanced</div><div style="left: 464.667px; top: 649.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00327);" data-canvas-width="360.77752592592594">analytical tools for processing non-traditional SA, SS, and EBS</div><div style="left: 464.667px; top: 664.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.988486);" data-canvas-width="358.1121462962963">sources, including social media, have the potential to enhance disease</div><div style="left: 464.667px; top: 679.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.987798);" data-canvas-width="358.2579999999998">detection, risk assessment, and decision support, by reducing the time</div><div style="left: 464.667px; top: 694.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00129);" data-canvas-width="203.27875925925932">it takes to identify public health events.</div> ER -