A Machine Learning Approach to Detecting Start Reading Location of eBooks

Machine Learning and NLP (Natural Language Processing) have aided the development of new and improved user experience features in many applications. We address the problem of automatically identifying the "Start Reading Location" (SRL) of eBooks, i.e. the location of the logical beginning or start of main content. This improves eBook reading experience by taking users automatically to the logical start location without requiring them to flip through several front-matter sections such as "Dedication" and "About the Author". Automatic identification of SRL is complex since many eBooks do not adhere to any well-defined convention with respect to section naming, formatting and layout patterns. We formulate SRL as a classification problem based on detailed rule-based and NLP-based classification schemes. Our models are being used in production for Kindle eBooks and have led to a 400% increase in coverage (number of books which had SRL stamped) compared to what could be achieved earlier through an entirely manual process, while also maintaining a high accuracy of 95%.

[1]  Muhammad Mahbubur Rahman,et al.  Understanding the Logical and Semantic Structure of Large Documents , 2017, SDM 2017.

[2]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[3]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[4]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[5]  Chih-Wei Chen,et al.  Recognition and Evaluation of Clinical Section Headings in Clinical Documents Using Token-Based Formulation with Conditional Random Fields , 2015, BioMed research international.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[11]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[12]  Anna Kazantseva,et al.  Hierarchical Topical Segmentation with Affinity Propagation , 2014, COLING.

[13]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[14]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[15]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .