TY - JOUR AU - Lix, Lisa AU - Munakala, Sree Nihit AU - Singer, Alexander PY - 2017/05/02 Y2 - 2024/03/28 TI - Automated Classification of Alcohol Use by Text Mining of Electronic Medical Records JF - Online Journal of Public Health Informatics JA - OJPHI VL - 9 IS - 1 SE - Language processing, classifiers, and syndrome definitions DO - 10.5210/ojphi.v9i1.7648 UR - https://ojphi.org/ojs/index.php/ojphi/article/view/7648 SP - AB - <div style="left: 82px; top: 289.209px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.08211);" data-canvas-width="58.10914814814815">Objective</div><div style="left: 95.6667px; top: 302.831px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02034);" data-canvas-width="345.53775">The research objective was to develop and validate an automated</div><div style="left: 82px; top: 318.017px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.963927);" data-canvas-width="357.85787037037034">system to extract and classify patient alcohol use based on unstructured</div><div style="left: 82px; top: 333.202px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00175);" data-canvas-width="343.01435185185176">(i.e., free) text in primary care electronic medical records (EMRs).</div><div style="left: 82px; top: 359.061px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.11815);" data-canvas-width="75.28890740740741">Introduction</div><div style="left: 95.6667px; top: 372.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.969831);" data-canvas-width="346.6013203703703">EMRs are a potentially valuable source of information about a</div><div style="left: 82px; top: 387.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03192);" data-canvas-width="359.4170851851851">patient’s history of health risk behaviors, such as excessive alcohol</div><div style="left: 82px; top: 403.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.990435);" data-canvas-width="360.76590925925916">consumption or smoking. This information is often found in the</div><div style="left: 82px; top: 418.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.957393);" data-canvas-width="359.97984814814816">unstructured (i.e., free) text of physician notes. It may be difficult</div><div style="left: 82px; top: 433.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.961533);" data-canvas-width="360.11021296296286">to classify and analyze health risk behaviors because there are no</div><div style="left: 82px; top: 448.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01185);" data-canvas-width="273.9080925925926">standardized formats for this type of information</div><div style="left: 355.838px; top: 448.951px; font-size: 7.74444px; font-family: serif;">1</div><div style="left: 359.817px; top: 448.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.983558);" data-canvas-width="83.15984444444445">. As well, the</div><div style="left: 82px; top: 463.794px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.975152);" data-canvas-width="360.02631481481484">completeness of the data may vary across clinics and physicians.</div><div style="left: 82px; top: 478.98px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.986753);" data-canvas-width="360.51034259259234">The application of automated classification tools for this type of</div><div style="left: 82px; top: 494.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03459);" data-canvas-width="361.6242518518519">information could be useful for describing patterns within the</div><div style="left: 82px; top: 509.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00124);" data-canvas-width="300.07140740740726">population and developing disease risk prediction models.</div><div style="left: 95.6667px; top: 524.535px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.969528);" data-canvas-width="346.39609259259277">Natural Language Processing (NLP) tools are currently used to</div><div style="left: 82px; top: 539.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.983464);" data-canvas-width="358.11214629629615">process EMR free text in an automated and systematic way. However,</div><div style="left: 82px; top: 554.905px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00374);" data-canvas-width="358.6697462962963">these tools have primarily been applied to classify information about</div><div style="left: 82px; top: 570.091px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.979257);" data-canvas-width="225.68601851851855">the presence or absence of disease diagnoses</div><div style="left: 307.825px; top: 570.432px; font-size: 7.74444px; font-family: serif;">2</div><div style="left: 311.669px; top: 570.091px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.973708);" data-canvas-width="128.30221111111112">. The application of NLP</div><div style="left: 82px; top: 585.276px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.965438);" data-canvas-width="359.8830425925924">tools to health risk behaviors, particularly alcohol use information</div><div style="left: 82px; top: 600.461px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00142);" data-canvas-width="334.4567407407405">from primary care EMRs, has thus far received limited attention.</div><div style="left: 82px; top: 626.32px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.07341);" data-canvas-width="53.062351851851844">Methods</div><div style="left: 95.6667px; top: 639.943px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01596);" data-canvas-width="347.7616962962964">Study data were from the Manitoba regional network of the</div><div style="left: 82px; top: 655.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.977243);" data-canvas-width="359.6907222222222">Canadian Primary Care Sentinel Surveillance Network (CPCSSN)</div><div style="left: 82px; top: 670.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.990578);" data-canvas-width="358.24251111111124">for the period from 1998 to 2016. CPCSSN is a national primary care</div><div style="left: 82px; top: 685.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03128);" data-canvas-width="359.36545555555534">surveillance network for chronic diseases comprised of 11 regional</div><div style="left: 82px; top: 700.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01207);" data-canvas-width="358.8814277777779">networks with publicly funded healthcare systems. Currently, a total</div><div style="left: 82px; top: 715.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00784);" data-canvas-width="358.7742962962962">of 53 clinics and more than 260 physicians provide data to CPCSSN</div><div style="left: 82px; top: 731.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.960053);" data-canvas-width="359.57197407407386">in Manitoba. We classified each record based on unstructured text</div><div style="left: 82px; top: 746.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.975874);" data-canvas-width="358.1457055555553">from physician notes into the following mutually exclusive categories:</div><div style="left: 82px; top: 761.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01031);" data-canvas-width="266.8735555555556">current drinker, not a current drinker, and unknown</div><div style="left: 348.869px; top: 761.765px; font-size: 7.74444px; font-family: serif;">1</div><div style="left: 352.741px; top: 761.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00786);" data-canvas-width="88.0027037037037">. A standardized</div><div style="left: 82px; top: 776.609px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.995034);" data-canvas-width="358.2786518518517">de-identification process was applied to each record prior to applying</div><div style="left: 82px; top: 791.794px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00106);" data-canvas-width="123.33027777777784">an NLP tool to the data.</div><div style="left: 95.6667px; top: 806.979px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02453);" data-canvas-width="345.5893796296296">Text classification used a support vector machine (SVM) applied</div><div style="left: 82px; top: 822.165px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.986581);" data-canvas-width="358.12892592592567">to both unigrams (i.e., single words) and mixed grams (i.e., unigrams,</div><div style="left: 82px; top: 837.35px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.999391);" data-canvas-width="358.5316370370371">and pairs of words known as bigrams) from a bag-of-words model in</div><div style="left: 82px; top: 852.535px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.967467);" data-canvas-width="357.5919777777778">which each record is quantified by the relative frequency of occurrence</div><div style="left: 82px; top: 867.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04655);" data-canvas-width="141.91178148148146">of each word in the record</div><div style="left: 223.999px; top: 868.062px; font-size: 7.74444px; font-family: serif;">3</div><div style="left: 227.913px; top: 867.72px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.04803);" data-canvas-width="213.48335555555553">. The training dataset for the SVM was</div><div style="left: 82px; top: 882.905px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.993684);" data-canvas-width="358.20637037037017">comprised of 2000 records classified by two primary care physicians.</div><div style="left: 82px; top: 898.091px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01185);" data-canvas-width="358.97177962962974">These physicians were initially trained using an independent sample</div><div style="left: 82px; top: 913.276px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00344);" data-canvas-width="355.2505740740738">of 200 EMR text strings containing specific reference to alcohol use.</div><div style="left: 95.6667px; top: 928.461px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0378);" data-canvas-width="345.82816666666673">Cohen’s kappa statistic, a chance-adjusted measure, was used to</div><div style="left: 82px; top: 943.646px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03322);" data-canvas-width="359.4932388888889">estimate agreement. Internal validation of the SVM was conducted</div><div style="left: 82px; top: 958.831px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03415);" data-canvas-width="359.73073518518504">using 10-fold cross-validation techniques. Model performance was</div><div style="left: 82px; top: 974.017px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.986465);" data-canvas-width="358.1031111111108">assessed using recall and precision statistics, as well as the F-measure</div><div style="left: 82px; top: 989.202px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.98782);" data-canvas-width="360.75171111111104">statistic, which is a function of their average. All analyses were</div><div style="left: 82px; top: 1004.39px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0016);" data-canvas-width="277.79322222222214">conducted using the R open-source software package.</div><div style="left: 82px; top: 1033.28px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.08548);" data-canvas-width="46.62155555555556">Results</div><div style="left: 95.6667px; top: 1046.91px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02979);" data-canvas-width="345.6306833333334">A total of 57,663 unique records were included in the study. The</div><div style="left: 82px; top: 1062.09px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0421);" data-canvas-width="359.29704629629623">estimate of the kappa statistic for the physician training phase was</div><div style="left: 82px; top: 1077.28px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.995714);" data-canvas-width="358.2412203703704">0.98, indicating excellent agreement. Subsequent classification of the</div><div style="left: 82px; top: 1092.46px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.980806);" data-canvas-width="358.1250537037037">training dataset by the physicians resulted in 1.7% of records assigned</div><div style="left: 82px; top: 1107.65px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.973898);" data-canvas-width="360.2612296296299">as not a current drinker, 16.8% as current drinker, and 81.5% as</div><div style="left: 464.667px; top: 287.646px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.979334);" data-canvas-width="358.07729629629614">unknown. Average estimates of recall for the 10 validation folds using</div><div style="left: 464.667px; top: 302.831px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.984428);" data-canvas-width="358.23347592592586">unigrams were 0.62 for not current drinkers, 0.86 for current drinkers,</div><div style="left: 464.667px; top: 318.017px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.994787);" data-canvas-width="358.1895907407407">and 0.98 for the unknown category. Average estimates of recall using</div><div style="left: 464.667px; top: 333.202px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.96003);" data-canvas-width="359.755259259259">mixed grams were 0.48, 0.84, and 0.97, respectively. Estimates of</div><div style="left: 464.667px; top: 348.387px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.0428);" data-canvas-width="359.52421666666623">precision were higher with mixed grams than unigrams for the not</div><div style="left: 464.667px; top: 363.572px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00872);" data-canvas-width="358.7355740740739">currently drinking category, but the opposite was true for the current</div><div style="left: 464.667px; top: 378.757px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03536);" data-canvas-width="359.4248296296295">drinker category. There was no difference in precision between the</div><div style="left: 464.667px; top: 393.942px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01201);" data-canvas-width="358.8749740740741">two methods for the unknown category. The F-measure statistic was</div><div style="left: 464.667px; top: 409.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.973875);" data-canvas-width="360.15538888888887">higher for classification of current drinkers using unigrams (0.89)</div><div style="left: 464.667px; top: 424.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02978);" data-canvas-width="359.54874074074075">than mixed grams (0.83), although the differences for the unknown</div><div style="left: 464.667px; top: 439.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00062);" data-canvas-width="358.53809074074076">category were negligible (0.98 versus 0.97). Application of the SVM</div><div style="left: 464.667px; top: 454.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.986464);" data-canvas-width="360.667812962963">with unigrams to the entire dataset resulted in 15.3% of records</div><div style="left: 464.667px; top: 469.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00485);" data-canvas-width="358.5484166666666">classified as current drinkers, 2.0% classified as not current drinkers,</div><div style="left: 464.667px; top: 485.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00131);" data-canvas-width="123.67877777777775">and 82.7% as unknown.</div><div style="left: 464.667px; top: 513.95px; font-size: 12.9074px; font-family: sans-serif; transform: scaleX(1.10391);" data-canvas-width="77.45735185185185">Conclusions</div><div style="left: 478.333px; top: 527.572px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.988815);" data-canvas-width="344.4703074074073">This study developed an automated system to classify unstructured</div><div style="left: 464.667px; top: 542.757px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01984);" data-canvas-width="359.2905925925923">text about alcohol consumption into mutually-exclusive alcohol use</div><div style="left: 464.667px; top: 557.943px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.969576);" data-canvas-width="357.9675833333331">categories. However, we found that only a small percentage of primary</div><div style="left: 464.667px; top: 573.128px; font-size: 12.9074px; font-family: serif; transform: scaleX(0.978243);" data-canvas-width="359.6674888888888">care records contained documentation about alcohol consumption,</div><div style="left: 464.667px; top: 588.313px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01342);" data-canvas-width="358.9033703703704">which limits the utility of the automated tool and the data source for</div><div style="left: 464.667px; top: 603.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01938);" data-canvas-width="312.693561111111">disease risk prediction or alcohol use prevalence estimation</div><div style="left: 777.348px; top: 603.839px; font-size: 7.74444px; font-family: serif;">1</div><div style="left: 781.22px; top: 603.498px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.03928);" data-canvas-width="42.800962962962956">. While</div><div style="left: 464.667px; top: 618.683px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01397);" data-canvas-width="359.02986296296297">our automated approach is useful for processing existing EMR data,</div><div style="left: 464.667px; top: 633.868px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01979);" data-canvas-width="359.30737222222217">systematic documentation of alcohol consumption will benefit from</div><div style="left: 464.667px; top: 649.054px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.01421);" data-canvas-width="358.99630370370363">standardized entry fields and terms to produce clinically meaningful</div><div style="left: 464.667px; top: 664.239px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.02782);" data-canvas-width="361.4848518518519">information that will improve the understanding of health risk</div><div style="left: 464.667px; top: 679.424px; font-size: 12.9074px; font-family: serif; transform: scaleX(1.00137);" data-canvas-width="198.95477777777782">behaviors in primary care populations.</div> ER -