Flexible ingest framework: A scalable architecture for dynamic routing through composable pipelines

In this paper we describe a flexible and scalable big data ingestion framework based on Apache Spark. It is flexible in that meta-information about the data is used to build custom processing pipelines at run-time. It is scalable in that it leverages Apache Spark with minimal additional overhead. These capabilities allow a user to setup custom big data processing pipelines capable of handling changing data types without the need to recompile code in an operational environment. This is particularly advantageous in secure environments where recompilation is undesirable or unattainable.

[1]  Ying Dai,et al.  Gobblin: Unifying Data Ingestion for Hadoop , 2015, Proc. VLDB Endow..

[2]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.