Data Science: A New Paradigm in the Age of Big-Data Science and Analytics

As an emergent field of inquiry, Data Science serves both the information technology world and the applied sciences. Data Science is a known term that tends to be synonymous with the term Big-Data; however, Data Science is the application of solutions found through mathematical and computational research while Big-Data Science describes problems concerning the analysis of data with respect to volume, variation, and velocity (3V). Even though there is not much developed in theory from a scientific perspective for Data Science, there is still great opportunity for tremendous growth. Data Science is proving to be of paramount importance to the IT industry due to the increased need for understanding the insurmountable amount of data being produced and in need of analysis. In short, data is everywhere with various formats. Scientists are currently using statistical and AI analysis techniques like machine learning methods to understand massive sets of data, and naturally, they attempt to find relationships among datasets. In the past 10 years, the development of software systems within the cloud computing paradigm using tools like Hadoop and Apache Spark have aided in making tremendous advances to Data Science as a discipline [Z. Sun, L. Sun and K. Strang, Big data analytics services for enhancing business intelligence, Journal of Computer Information Systems (2016), doi: 10.1080/08874417.2016.1220239]. These advances enabled both scientists and IT professionals to use cloud computing infrastructure to process petabytes of data on daily basis. This is especially true for large private companies such as Walmart, Nvidia, and Google. This paper seeks to address pragmatic ways of looking at how Data Science — with respect to Big-Data Science — is practiced in the modern world. We also examine how mathematics and computer science help shape Big-Data Science’s terrain. We will highlight how mathematics and computer science have significantly impacted the development of Data Science approaches, tools, and how those approaches pose new questions that can drive new research areas within these core disciplines involving data analysis, machine learning, and visualization.

[1]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[2]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[3]  Donald H. Cooley,et al.  Possibility-based fuzzy neural networks and their application to image processing , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[4]  R. H. Myers,et al.  Probability and Statistics for Engineers and Scientists , 1978 .

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jennifer Wortman Vaughan,et al.  Efficient Market Making via Convex Optimization, and a Connection to Online Learning , 2013, TEAC.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Vasant Dhar,et al.  Data science and prediction , 2012, CACM.

[10]  William S. Cleveland Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics , 2001 .

[11]  Ludovic Duponchel,et al.  Topological data analysis: A promising big data exploration tool in biology, analytical chemistry and physical chemistry. , 2016, Analytica chimica acta.

[12]  Pat Hanrahan,et al.  Multiscale Visualization Using Data Cubes , 2003, IEEE Trans. Vis. Comput. Graph..

[13]  Mohak Shah,et al.  An architecture for the deployment of statistical models for the big data era , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[14]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[15]  Kenichi Kanatani,et al.  Motion segmentation by subspace separation and model selection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[16]  Robert D. Nowak,et al.  High-Rank Matrix Completion , 2012, AISTATS.

[17]  Anna Goldenberg,et al.  TensorFlow: Biology's Gateway to Deep Learning? , 2016, Cell systems.

[18]  Allen Y. Yang,et al.  Unsupervised segmentation of natural images via lossy data compression , 2008, Comput. Vis. Image Underst..

[19]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Zhixun Su,et al.  Mathematical Problems in Data Science , 2015, Springer International Publishing.

[23]  Yaohui Jin,et al.  K Means of Cloud Computing: MapReduce, DVM, and Windows Azure , 2012, CLOUD 2012.

[24]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[25]  Dong Hyun Jeong,et al.  A survey of cloud-based network intrusion detection analysis , 2016, Human-centric Computing and Information Sciences.

[26]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[27]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[28]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[29]  Robert D. Nowak,et al.  K-subspaces with missing data , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).