论文信息 - TensorFlow-Serving: Flexible, High-Performance ML Serving

TensorFlow-Serving: Flexible, High-Performance ML Serving

We describe TensorFlow-Serving, a system to serve machine learning models inside Google which is also available in the cloud and via open-source. It is extremely flexible in terms of the types of ML platforms it supports, and ways to integrate with systems that convey new models and updated versions from training to serving. At the same time, the core code paths around model lookup and inference have been carefully optimized to avoid performance pitfalls observed in naive implementations. Google uses it in many production deployments, including a multi-tenant model hosting service called TFS^2.

[1] Douglas C. Schmidt,et al. Reactor: an object behavioral pattern for concurrent event demultiplexing and event handler dispatching , 1995 .

[2] Willy Zwaenepoel,et al. Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[3] David E. Culler,et al. SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[4] Christopher Frost,et al. Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[5] Deepak Agarwal,et al. LASER: a scalable response prediction platform for online advertising , 2014, WSDM.

[6] Michael I. Jordan,et al. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[7] D. Sculley,et al. What’s your ML test score? A rubric for ML production systems , 2016 .

[8] Xin Wang,et al. Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[9] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[10] Graham Neubig,et al. On-the-fly Operation Batching in Dynamic Computation Graphs , 2017, NIPS.