When sweet and cute isn't enough anymore: Solving scalability issues in Python Pandas with Grizzly
暂无分享,去创建一个
The giant panda bear is very popular, not only because of the (seemingly) cute and friendly face and behavior. However, due to their poor diet they are slow and clumsy. In this sense, Python Pandas is very similar to the bears: The framework has a nice and user-friendly appearance, but, under the hood, requires a lot of resources (memory, CPU time) even to process data sets of moderate size. Relational DBMS are highly optimized for storing and querying large amounts of data, but complex analysis tasks are often difficult or even impossible to express in SQL. Thus, easy-to-learn scripting languages such as Python or R became very popular and the de-facto standard for data science tasks. The Pandas DataFrames re-implement operations known from SQL, such as projection, selection, join, grouping etc. Therefore, a sequence of Pandas operations could also be expressed as SQL and executed in a DBMS or SparkSQL.
[1] Hiren Patel,et al. Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..
[2] Kai-Uwe Sattler,et al. Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs , 2018, ADBIS.