论文信息 - Going In-Depth: Finding Longform on the Web

Going In-Depth: Finding Longform on the Web

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

Isabelle Stanton | Virginia Smith | Miriam Connor

[1] Jure Leskovec,et al. No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[2] Evgeniy Gabrilovich,et al. Joint relevance and freshness learning from clickthroughs for news search , 2012, WWW.

[3] Jiahui Liu,et al. Personalized news recommendation based on click behavior , 2010, IUI '10.

[4] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[5] Alexander J. Smola,et al. Unified analysis of streaming news , 2011, WWW.

[6] Matthew Ozga-Lawn,et al. The Duke in His Domain , 2014 .

[7] Aidan Finn,et al. Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[8] Serge Sharoff. Classifying Web corpora into domain and genre using automatic feature identification , 2007 .

[9] Ani Nenkova,et al. What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain , 2013, TACL.

[10] Mark Dredze,et al. Small Statistical Models by Random Feature Mixing , 2008, ACL 2008.

[11] Richard Power,et al. Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages , 2006, ACL.