A flexible and extensible approach to automated CAD/CAM format classification

There are hundreds of distinct 3D, CAD and engineering file formats. As engineering design and analysis has become increasingly digital, the proliferation of file formats has created many problems for data preservation, data exchange, and interoperability. In some situations, physical file objects exist on legacy media and must be identified and interpreted for reuse. In other cases, file objects may have varying representational expressiveness.We introduce the problem of automated file recognition and classification in emerging digital engineering environments, where all design, manufacturing and production activities are "born digital." The result is that massive quantities and varieties of data objects are created during the product lifecycle.This paper presents an approach to automated identification of engineering file formats. This work operates independent of any modeling tools and can identify families of related file objects as well as variations in versions. This problem is challenging as it cannot assume any a priori knowledge about the nature of the physical file object. Applications for these methods include support for a number of emerging applications in areas such as forensic analysis, data translation, as well as digital curation and long-term data management. Graphical abstractDisplay Omitted HighlightsProvides support for emerging applications in long-term data management.Compression-based classification enables specification-free format identification.Classification accuracy best when NCD distance and the first 16KB of files used.Classifier is highly effective at distinguishing among very similar formats.Computational time is comparable or better than approaches based on known signatures.

[1]  Christiaan J. J. Paredis,et al.  Intelligent Assembly Modeling and Simulation , 2001 .

[2]  Eamonn J. Keogh,et al.  A compression‐based distance measure for texture , 2010, Stat. Anal. Data Min..

[3]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[4]  C. Arms,et al.  Digital Formats : Factors for Sustainability , Functionality , and Quality , 2005 .

[5]  William C. Regli,et al.  Machining feature-based comparisons of mechanical parts , 2001, Proceedings International Conference on Shape Modeling and Applications.

[6]  Peter Bajcsy,et al.  Towards a Universal, Quantifiable, and Scalable File Format Converter , 2009, 2009 Fifth IEEE International Conference on e-Science.

[7]  William C. Regli,et al.  On the long-term retention of geometry-centric digital engineering artifacts , 2011, Comput. Aided Des..

[8]  Kurt Mehlhorn,et al.  LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[9]  Ali Shokoufandeh,et al.  Solid Model Databases: Techniques and Empirical Results , 2001, J. Comput. Inf. Sci. Eng..

[10]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[11]  Andre Burkovski,et al.  Similarity Calculation with Length Delimiting Dictionary Distance , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[12]  William C. Regli,et al.  Managing digital libraries for computer-aided design , 2000, Comput. Aided Des..

[13]  Thomas R. Kramer,et al.  A parser that converts a boundary representation into a features representation , 1989 .

[14]  C. W. Brown,et al.  Using STEP to integrate design features with manufacturing features , 1995 .

[15]  Gregory A. Hall,et al.  Sliding Window Measurement for File Type Identification , 2007 .

[16]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[17]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[18]  Stephen Abrams,et al.  "What? So What": The Next-Generation JHOVE2 Architecture for Format-Aware Characterization , 2008, Int. J. Digit. Curation.

[19]  Daniel M. Gaines,et al.  A Repository of Designs for Process and Assembly Planning | NIST , 1997 .

[20]  William C. Regli,et al.  A repository for design, process planning and assembly , 1997, Comput. Aided Des..

[21]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[22]  Justin Littman A Technical Approach and Distributed Model for Validation of Digital Objects , 2006, D Lib Mag..

[23]  William C. Regli,et al.  Using shape distributions to compare solid models , 2002, SMA '02.

[24]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.