A Spatial Biosurveillance Synthetic Data Generator in R


  • Drew Levin Sandia National Laboratories, Albuquerque, NM, USA
  • Patrick Finley Sandia National Laboratories, Albuquerque, NM, USA




ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data




How to Cite

Levin, D., & Finley, P. (2017). A Spatial Biosurveillance Synthetic Data Generator in R. Online Journal of Public Health Informatics, 9(1). https://doi.org/10.5210/ojphi.v9i1.7583



Evaluation through epidemic simulation