Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were obtained by running word2vec on an 11M-word corpus that we created from scratch by leveraging freely-accessible online sources of construction-related text. We first explore the embedding space and show that our vectors capture meaningful construction-specific concepts. We then evaluate the performance of our vectors against that of ones trained on a 100B-word corpus (Google News) within the framework of an injury report classification task. Without any parameter tuning, our embeddings give competitive results, and outperform the Google News vectors in many cases. Using a keyword-based compression of the reports also leads to a significant speed-up with only a limited loss in performance. We release our corpus and the data set we created for the classification task as publicly available, in the hope that they will be used by future studies for benchmarking and building on our work.

[1]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[2]  Matthew R. Hallowell,et al.  Construction Safety Risk Modeling and Simulation , 2016, Risk analysis : an official publication of the Society for Risk Analysis.

[3]  Matthew R. Hallowell,et al.  Application of machine learning to construction injury prediction , 2016 .

[4]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[5]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[6]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[7]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[8]  Carlos H. Caldas,et al.  Automating hierarchical document classification for construction management information systems , 2003 .

[9]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[10]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[11]  Jie Gong,et al.  Predicting construction cost overruns using text mining, numerical data and ensemble classifiers , 2014 .

[12]  Mounir El Asmar,et al.  Analyzing Arizona OSHA injury reports using unsupervised machine learning , 2016 .

[13]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[14]  Michalis Vazirgiannis,et al.  GoWvis: A Web Application for Graph-of-Words-based Text Visualization and Summarization , 2016, ACL.

[15]  Matthew R. Hallowell,et al.  Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports , 2016 .

[16]  Wen-der Yu,et al.  Content-based text mining technique for retrieval of CAD documents , 2013 .

[17]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[18]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[19]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Michalis Vazirgiannis,et al.  A Graph Degeneracy-based Approach to Keyword Extraction , 2016, EMNLP.

[25]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[26]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.