Predicting Manhole Events in Manhattan A Case Study in Extended Knowledge Discovery

We present an extended knowledge discovery (EKD) process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in Manhattan (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently being used to help prioritize repair work on the Manhattan electrical grid. We were able to apply statistical machine learning to this problem due to three elements that define EKD: very early problem definition grounded in evidence, use of the problem definition to drive data processing and assembly of a database from raw text, and conferencing with domain experts through tools that preserve a link between modeling results, domain entities, and source texts.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  H. William Buttelmann,et al.  American Journal of Computational Linguistics , 1974 .

[3]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[4]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[5]  Richard I. Kittredge,et al.  Sublanguages , 1982, Am. J. Comput. Linguistics.

[6]  Ralph Grishman,et al.  Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments , 1986, Comput. Linguistics.

[7]  Marcia C. Linebarger,et al.  Sentence Fragments Regular Structures , 1988, ACL.

[8]  Marcia C. Linebarger,et al.  The PUNDIT natural-language processing system , 1989, [1989] Proceedings. The Annual AI Systems in Government Conference.

[9]  Owen Rambow,et al.  On the need for domain communication knowledge , 1991 .

[10]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[11]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[12]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[13]  J. C. Steed Condition monitoring applied to power transformers-an REC view , 1995 .

[14]  H. P. Chou,et al.  Monitoring the health of power transformers , 1996 .

[15]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[16]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[19]  Tok Wang Ling,et al.  Exploration mining in diabetic patients databases: findings and conclusions , 2000, KDD '00.

[20]  Haym Hirsh,et al.  Learning to Predict Extremely Rare Events , 2000 .

[21]  H. Cunningham,et al.  A framework and graphical development environment for robust NLP tools and applications. , 2002, ACL 2002.

[22]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Ramasamy Uthurusamy,et al.  Evolving data into mining solutions for insights , 2002, CACM.

[24]  A. Townsend Peterson,et al.  Prioritization of areas in China for the conservation of endangered birds using modelled geographical distributions , 2002 .

[25]  R. Castano,et al.  Machine learning challenges in Mars rover traverse science , 2003 .

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[28]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[29]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[30]  Jay Lee,et al.  PREVENTING FAILURES BY MINING MAINTENANCE LOGS WITH CASE-BASED REASONING , 2005 .

[31]  Andrew Kusiak,et al.  Data-mining-based system for prediction of water chemistry faults , 2006, IEEE Transactions on Industrial Electronics.

[32]  Elizabeth D. Liddy,et al.  Illuminating Trouble Tickets with Sublanguage Theory , 2006, NAACL.

[33]  Philip M. Long,et al.  Predicting Electricity Distribution Feeder Failures Using Machine Learning Susceptibility Analysis , 2006, AAAI.

[34]  Elizabeth D. Liddy,et al.  Sublanguage Analysis Applied to Trouble Tickets , 2006, FLAIRS.

[35]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[36]  C.-C. Jay Kuo,et al.  Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. , 2007, American journal of human genetics.

[37]  Hila Becker,et al.  Real-time ranking with concept drift using expert advice , 2007, KDD '07.

[38]  Haimonti Dutta,et al.  Visualization of Manhole and Precursor-Type Events for the Manhattan Electrical Distribution System , 2008 .

[39]  James A. Landay,et al.  Investigating statistical machine learning as a tool for software development , 2008, CHI.

[40]  Vipin Kumar,et al.  Land cover change detection: a case study , 2008, KDD.

[41]  Cynthia Rudin,et al.  The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List , 2009, J. Mach. Learn. Res..

[42]  Axinia Radeva,et al.  Report Cards for Manholes , 2009 .

[43]  Axinia Radeva,et al.  Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods , 2009, CICLing.

[44]  John C. Stutz,et al.  Classification of Aeronautics System Health and Safety Documents , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).