The PYTHY Summarization System: Microsoft Research at DUC 2007

PYTHY is a trainable extractive summarization engine that learns a log-linear sentence ranking model by maximizing three metrics of sentence goodness: two of the metrics are based on ROUGE scores against model summaries and one is based on Semantic Content Unit (SCU) weights associated with sentences selected by past peers that were obtained during the Pyramid evaluations. In addition to sentences from the document set, our system considers simplified sentences for inclusion in the generated summaries. The feature weights of the model are optimized on the DUC 2005 data, with the final feature set for the submitted system being determined by ROUGE-2 scores against the DUC 2006 model summaries. For the DUC update task, the model was augmented with a novelty detection classifier.