Going In-Depth: Finding Longform on the Web

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

[1]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[2]  Evgeniy Gabrilovich,et al.  Joint relevance and freshness learning from clickthroughs for news search , 2012, WWW.

[3]  Jiahui Liu,et al.  Personalized news recommendation based on click behavior , 2010, IUI '10.

[4]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[5]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[6]  Matthew Ozga-Lawn,et al.  The Duke in His Domain , 2014 .

[7]  Aidan Finn,et al.  Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[8]  Serge Sharoff Classifying Web corpora into domain and genre using automatic feature identification , 2007 .

[9]  Ani Nenkova,et al.  What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain , 2013, TACL.

[10]  Mark Dredze,et al.  Small Statistical Models by Random Feature Mixing , 2008, ACL 2008.

[11]  Richard Power,et al.  Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages , 2006, ACL.

[12]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[13]  Sihem Amer-Yahia,et al.  Real-time recommendation of diverse related articles , 2013, WWW.

[14]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[15]  Zhaohui Zheng,et al.  Learning to model relatedness for news recommendation , 2011, WWW.

[16]  F. Kaplan,et al.  Obama’s way , 2015 .

[17]  G. Y. Wong,et al.  The Hierarchical Logistic Regression Model for Multilevel Analysis , 1985 .

[18]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[19]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[20]  Rizal Setya Perdana What is Twitter , 2013 .

[21]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[22]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[23]  Susan T. Dumais,et al.  Classification-enhanced ranking , 2010, WWW '10.

[24]  Anthony S. Bryk,et al.  Hierarchical Linear Models: Applications and Data Analysis Methods , 1992 .

[25]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[26]  Barry Smyth,et al.  Genre Classification and Domain Transfer for Information Filtering , 2002, ECIR.

[27]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[28]  Yejin Choi,et al.  Success with Style: Using Writing Style to Predict the Success of Novels , 2013, EMNLP.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  John Corcoran,et al.  String theory , 1974, Journal of Symbolic Logic.