Every day, we create 2.5 quintillion bytes of data - so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. The IDC sizing of the digital universe - information that is either created or captured in digital form and then replicated in 2006 - is 161 Exabyte, growing to 988 Exabyte in 2010, representing a compound annual growth rate (CAGR) of 57%. A variety of system architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed relational database management systems which have been available to run on shared nothing clusters of processing nodes for more than two decades. However most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others. 20% of the world's servers go into huge data centers by the “Big 5” - Google, Microsoft, Yahoo, Amazon, eBay [1].
[1]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[2]
Gregor von Laszewski,et al.
Towards building a cloud for scientific applications
,
2011,
Adv. Eng. Softw..
[3]
B. Achiriloaie,et al.
VI REFERENCES
,
1961
.
[4]
Francine Berman,et al.
Got data?: a guide to data preservation in the information age
,
2008,
CACM.
[5]
Reagan Moore,et al.
Data-intensive computing
,
1998
.
[6]
Abraham Silberschatz,et al.
Distributed file systems: concepts and examples
,
1990,
CSUR.