Identification of Sufferers of Rare Diseases Using Medical Claims Data

How to Cite

Chen, J., & Dubrawski, A. (2017). Identification of Sufferers of Rare Diseases Using Medical Claims Data. Online Journal of Public Health Informatics, 9(1).


ISDS Annual Conference Proceedings 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License (, permitting all non-commercial use, distribution,and reproduction in any medium, provided the original work is properly cited.38(page number not for citation purposes)ISDS 2016 Conference AbstractsIdentification of Sufferers of Rare Diseases UsingMedical Claims DataJieshi Chen* and Artur DubrawskiAuton Lab, Carnegie Mellon University, Pittsburgh, PA, USAObjectiveTo identify sufferers of a rare and hard to diagnose diseases bydetecting sequential patterns in historical medical claims.IntroductionPatients who suffer from rare diseases can be hard to diagnose forprolonged periods of time. In the process, they are often subjectedto tentative treatments for ailments they do not have, risking anescalation of their actual condition and side effects from therapiesthey do not need. An early and accurate detection of these caseswould enable follow-ups for precise diagnoses, mitigating the costsof unnecessary care and improving patients’ outcomes.MethodsA sequential rule learning algorithm1was applied to a medical claimdataset of about 1,700 patients, who are pre-selected to have medicalhistories indicative of Gaucher Disease (GD) but only 25 of thesepatients were confirmed positives. About 168,000 medical claimsand 142,000 pharmaceutical claims were featurized into sequencesof asynchronous events and regularly sampled time series as inputsfor the model, such that an occurrence of a certain diagnosis code ina medical claim was counted as one event along the timeline of thepatient’s medical history. Similar method was applied to other keyattributes of claims data including procedure codes, National DrugCodes, Diagnosis Related Groupers, etc. These types of events as wellas their temporal statistics, e.g. moving frequencies, peaks, changepoints, etc., formed the input feature space for the algorithm whichwas trained to adjudicate each test case and estimate their likelihoodof having GD. A random forest algorithm was also applied to the samefeature set to comparatively evaluate the utility of sequential aspectsof data. The models were evaluated with 10-fold cross-validation.ResultsFigure 1 shows the Receiver Operating Characteristic (ROC)curves of the temporal rule model with Area Under the Curve scoreexceeding 81% and significantly outperforming the random forestand default models. Considering the practical costs to performfollow-up genetic tests, we prefer a model achieving high positiverecall at low risk of false detection. Our model correctly identifiesmore than 25% of known positive cases well within 0.1% of the falsepositive rate, while the performance of a more popular alternativeis indistinguishable from random. This demonstrates the utility ofsequential structure of medical claims in identifying patients whosuffer from rare diseases.Our algorithm infers from data highly interpretable rules it usesin case adjudication. Figure 2 illustrates one of them. The rootnode of the case adjudication tree (Event.7969) reflects the ICD-9diagnosis code of “Other nonspecific abnormal findings”. Amongthe 14 patients that have this particular ICD-9 code present in theirclaim history, 36% are confirmed GD sufferers. Compared to defaultprevalence in our pre-selected data set of 1.47%, this rule lifts theestimated likelihood of GD 25 times. The rule further developsinto two children nodes. The left child node adds the condition ofhaving any outpatient claim observed within 43 claims recordednearby the occurrence of the root node event. It isolates 5 patientsall of whom are GD-positive. The right child shows that 3 patientswithout Event.7969 in their claim history but prescribed NDC62756-0137-02 (Gabapentin by Sun Pharmaceutical Industries Ltd.)are all GD-positive. This is just one example of a simple and easyto implement business rule that is capable of identifying previouslyundiagnosed sufferers of rare diseases.ConclusionsOur model successfully utilizes sequential relationships amongevents recorded in medical claims data and reveals interpretablepatterns that can identify sufferers of rare diseases with highconfidence. The algorithm scales well to large volumes of medicalclaims data and it remains sensitive in despite of a very low prevalenceof target cases in data.ROC diagrams of models trained to identify GD patients shown with decimallogarithmic scale of the false positive rate axis.
Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes. Share-alike: when posting copies or adaptations of the work, release the work under the same license as the original. For any other use of articles, please contact the copyright owner. The journal/publisher is not responsible for subsequent uses of the work, including uses infringing the above license. It is the author's responsibility to bring an infringement action if so desired by the author.