BillSum: A Corpus for Automatic Summarization of US Legislation

Automatic summarization methods have been studied on a variety of domains, including news and scientific articles. Yet, legislation has not previously been considered for this task, despite US Congress and state governments releasing tens of thousands of bills every year. In this paper, we introduce BillSum, the first dataset for summarization of US Congressional and California state bills (this https URL). We explain the properties of the dataset that make it more challenging to process than other domains. Then, we benchmark extractive methods that consider neural sentence representations and traditional contextual features. Finally, we demonstrate that models built on Congressional bills can be used to summarize California bills, thus, showing that methods developed on this dataset can transfer to states without human-written summaries.

[1]  Vladimir Eidelman,et al.  How Predictable is Your State? Leveraging Lexical and Contextual Information for Predicting Legislative Floor Action at the State Level , 2018, COLING.

[2]  Mi-Young Kim,et al.  Summarization of Legal Texts with High Cohesion and Automatic Compression Rate , 2012, JSAI-isAI Workshops.

[3]  Dilek Z. Hakkani-Tür,et al.  The ICSI Summarization System at TAC 2008 , 2008, TAC.

[4]  Annette Hautli-Janisz,et al.  CoUSBi : A Structured and Visualized Legal Corpus of US State Bills , 2018, LREC 2018.

[5]  Claire Grover,et al.  The HOLJ Corpus. Supporting Summarisation of Legal Texts , 2004 .

[6]  Yohei Seki,et al.  Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles , 2002, NTCIR.

[7]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[8]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[9]  Sukomal Pal,et al.  Text summarization from legal documents: a survey , 2019, Artificial Intelligence Review.

[10]  Ani Nenkova,et al.  The Impact of Frequency on Summarization , 2005 .

[11]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[12]  Anastassia Kornilova,et al.  Party Matters: Enhancing Legislative Embeddings with Author Attributes for Vote Prediction , 2018, ACL.

[13]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[14]  Johannes Fürnkranz,et al.  Which Scores to Predict in Sentence Regression for Text Summarization? , 2018, NAACL-HLT.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  M. Saravanan,et al.  Automatic Identification of Rhetorical Roles using Conditional Random Fields for Legal Document Summarization , 2008, IJCNLP.

[17]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[18]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[19]  Isabelle Augenstein,et al.  A Supervised Approach to Extractive Summarisation of Scientific Papers , 2017, CoNLL.

[20]  Kai Hong,et al.  Improving the Estimation of Word Importance for News Multi-Document Summarization , 2014, EACL.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[23]  Sean Gerrish,et al.  Predicting Legislative Roll Calls from Text , 2011, ICML.

[24]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[25]  Ani Nenkova,et al.  Facilitating email thread access by extractive summary generation , 2003, RANLP.

[26]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[27]  Bruce A. Desmarais,et al.  Text as Policy: Measuring Policy Similarity through Bill Text Reuse , 2018, Policy Studies Journal.

[28]  Noah A. Smith,et al.  Textual Predictors of Bill Survival in Congressional Committees , 2012, NAACL.

[29]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[30]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[31]  Houfeng Wang,et al.  Learning Summary Prior Representation for Extractive Summarization , 2015, ACL.

[32]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.