Alphacruncher’s framework for data-driven research
Alexandru Popescu
Back in 1962 John Tukey points in his work ’The Future of Data Analysis’ to the existence of an as-yet unrecognized science, whose subject of interest was learning from the data, or ’data analysis’:1
„… my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.“
Recently, in his 2011 book ’Data Analysis, What Can Be Learned from the Past 50 Years’ Peter J. Huber emphasizes further the need of a holistic approach to data analysis:
„Unfortunately, the data analyst rarely has any control at all over the earliest phases of an analysis, namely over planning and design of the data collection, as well as the act of collecting the data. The situation is not improved by the fact that far too few statisticians are able and willing to take a holistic view of data analysis.“
Alphacruncher’s cloud platform provides an easily accessible, fully integrated, digital research environment, which efficiently supports data-analytic workflows with a holistic approach providing appropriate data services, high-performance computing facilities and unified digital curation.
Integrated Research Environment
What is an integrated research environment?2 Researchers in each digital science need a wide range of tools and services to get their work down – taken together, a service environment. Generally speaking, such an environment potentially includes a very wide set of services, such as:
-
Data Discovery: ensures quick and accurate identification of relevant data.
-
Data Registration: systematically identifies data with a unique Digital Object Identifier (DOI) in an appropriate data model.
-
Data Citation: allows easy referencing to data online, handling, e.g., confidentiality, verification, authentication and access issues.
-
Data Integration: combines different data sources into a unified view of it all.
-
Data Sharing: allows efficient sharing of data and research workflows with other researchers.
-
Scientific Computing: provides a suitable computational framework for data-analytic research workflows.
-
Scientific Workflow Management: characterizes and manages each step of the research process, allowing coordination with that of other researchers.
Alphacruncher’s research environment integrates above services and tools according to established scientic standards into a cloud based infrastructure for data-analytic research. It is based on services organized along three main pillars:
-
Data
-
Computation
-
Curation
Data Services
Data preparation, visualization and consumption is one of the most time-consuming parts of empirical research. Database research applications are slowly finding their way into this domain, following the need to handle large amounts of scientific data.
Alphacruncher’s cloud platform supports the data-analytic research cycle with a wide set of scientific datawarehouse services optimized for empirical research, helping researchers to quickly capture, analyze and curate their data in a hassle-free way. Consistently with the best practices in science, tailored data sets are generated for each research project, which are direclty accessible in a standardized way from the preferred application. Integrated data exploration is further enhanced by intuitive data visualization and discovery tools.
Alphacruncher’s integrated data warehouse relies on Data Vault 2.0 to ensure the extensibility and dimensions of scale needed for modern data-driven research.3 This architecture is suited for structured, semi-structured and unstructured data, regardless of its shape, size or format, and provides a transparent data governance that ensures reproducibility and replicability of scientific results. Reproducibility concerns the calculation of scientific results by independent scientists using the original datasets and methods. Replication independently implements scientific experiments to validate specific findings and is the cornerstone of scientific truth discover.4
Computation Services
The data-deluge experienced by scientific domains is fundamentally changing scientific research. Prominent paradigms for data-intensive applications, high-performance computing (HPC) and Apache-Hadoop, were designed to support high-end parallel computing and cheap data storage and retrieval, respectively. Alphacruncher fully embraced the paradigm of cloud computing, by blending traditional HPC and big data computing into a scalable computing infrastructure, which is fully integrated into Alphacruncher’s data services, is sufficiently user-friendly to imply smooth learning curves also for beginners, and produces exponential productivity gains in data-analytic research projects.5
The National Institute of Standards and Technology (NIST) defines cloud computing as:
„A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.“6
Cloud computing has empowered users to provision virtually unlimited computational resources, accessible over the Internet on demand. This makes cloud computing a compelling technology, which most efficiently tackles the issues generated by the growing size and complexity of scientific applications, such as high variance in usage and large volume of data, high and unpredictable computation loads, flash crowds, as well as time-varying computation and storage requirements.
Curation Services
The weak documentation of analytic research often implies a difficult reproducibility or replicability. Research replicability requires a transparent and consistent digital curation of data structures and research workflows. Research workflows are automatizations of abstract research processes, in which data is processed by different logical data processing activities according to a given set of rules.7 By digitally tracking with open source solutions all relevant aspects of research workflows into transparent data-analytic processes, Alphacruncher curation services preserve the integrity of each research output and data source across all phases of the research cycle, allowing an efficient research dissemination and replicability, both for collaboration or peer review, and a systematic access to verifiable research and reliable data.
-
David Donoho, “50 Years Data Science,” 2015, http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf. ↩
-
Costantino Thanos, “Global Research Data Infrastructures: Towards a 10-Year Vision for Global Research Data Infrastructures. Final Roadmap,” Mar 2012, http://www.grdi2020.eu/Repository/FileScaricati/6bdc07fb-b21d-4b90-81d4-d909fdb96b87.pdf. ↩
-
While the goal of the ’single version of truth’ is to provide an integrated, cleaned version of the organizational information, that is, the aggregated and condensed data in a given context, ’the single version of facts’ provides all the data all the time., Dan Linstedt and Michael Olschimke, Building a Scalable Data Warehouse with Data Vault 2.0: Implementation Guide for Microsoft SQL Server 2014 (Morgan Kaufmann, 2015)↩
-
Roger D. Peng, Francesca Dominici, and Scott L. Zeger, “Reproducible Epidemiologic Research,” American Journal of Epidemiology 163, no. 9 (May 2006): 783–89, doi:10.1093/aje/kwj093. ↩
-
Cloud Computing for Data-Intensive Applications (Springer New York, 2014), http://link.springer.com/10.1007/978-1-4939-1905-5. ↩
-
Peter Mell and Tim Grance, “The NIST Definition of Cloud Computing,” 2011, http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf. ↩
-
Ewa Deelman et al., “Workflows and E-Science: An Overview of Workflow System Features and Capabilities,” Future Generation Computer Systems 25, no. 5 (2009): 528–40, doi:10.1016/j.future.2008.06.012. ↩