A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

BackgroundSecondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions.ResultsHaving prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time.ConclusionsOur newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.

[1]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[4]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[5]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[6]  H. Horiguchi,et al.  Incidence of Severe Adverse Events Requiring Hospital Care after Trastuzumab Infusion for Metastatic Breast Cancer: A Nationwide Survey using an Administrative Claim Database , 2011, The breast journal.

[7]  N. Dreyer Making observational studies count: shaping the future of comparative effectiveness research. , 2011, Epidemiology.

[8]  H. Horiguchi,et al.  Risk factors for pulmonary embolism and the effects of fondaparinux after total hip and knee arthroplasty: a retrospective observational study with use of a national database in Japan. , 2011, The Journal of bone and joint surgery. American volume.

[9]  Miguel A Hernán,et al.  With great data comes great responsibility: publishing comparative effectiveness research in epidemiology. , 2011, Epidemiology.

[10]  H. Horiguchi,et al.  207 IMPACT OF HOSPITAL VOLUME AND LASER USE ON IN-HOSPITAL MORTALITY AND MORBIDITY AFTER TRANSURETHRAL SURGERY OF BENIGN PROSTATE HYPERPLASIA; JAPANESE DIAGNOSIS PROCEDURE COMBINATION DATABASE , 2011 .

[11]  Masahiko Sumitani,et al.  Prevalence of Malignant Hyperthermia and Relationship with Anesthetics in Japan: Data from the Diagnosis Procedure Combination Database , 2011, Anesthesiology.

[12]  H. Horiguchi,et al.  Quantitative Assessment of the Advantages of Laparoscopic Gastrectomy and the Impact of Volume-Related Hospital Characteristics on Resource Use and Outcomes of Gastrectomy Patients in Japan , 2011, Annals of surgery.

[13]  W. Ray,et al.  Improving Automated Database Studies , 2011, Epidemiology.

[14]  Noel S Weiss,et al.  The New World of Data Linkages in Clinical Epidemiology: Are We Being Brave or Foolhardy? , 2011, Epidemiology.

[15]  H. Horiguchi,et al.  Impact of hospital volume and laser use on postoperative complications and in-hospital mortality in cases of benign prostate hyperplasia. , 2011, The Journal of urology.

[16]  Til Stürmer,et al.  Nonexperimental comparative effectiveness research using linked healthcare databases. , 2011, Epidemiology.

[17]  H. Horiguchi,et al.  Variation in cancer surgical outcomes associated with physician and nurse staffing: a retrospective observational study using the Japanese Diagnosis Procedure Combination Database , 2012, BMC Health Services Research.

[18]  James Demmel,et al.  Communication costs of Strassen's matrix multiplication , 2014, CACM.