A method for calculation of optimum data size and bin size of histogram features in fault diagnosis of mono-block centrifugal pump

Research highlights? This paper illustrates a method to choose the number of bins and minimum number of samples required to train the classifier with statistical stability. ? Power analysis method was used to find the minimum number of samples required. ? J48 algorithm was used to validate the results of power analysis and to find the optimum number of bins. Mono-block centrifugal pump plays a key role in various applications. Any deviation in the functions of centrifugal pump would lead to a monetary loss. Thus, it becomes very essential to avoid the economic loss due to malfunctioning of centrifugal pump. It is clear that the fault diagnosis and condition monitoring of pumps are important issues that cannot be ignored. Over the past 25years, much research has been focused on vibration based techniques. Machine learning approach is one of the most widely used techniques using vibration signals in fault diagnosis. There are set of connected activities involved in machine learning approach namely, data acquisition, feature extraction, feature selection, and feature classification. Training and testing the classifier are the two important activities in the process of feature classification. When the histogram features are used as the representative of the vibration signals, a proper guideline has not been proposed so far to choose number of bins and number of samples required to train the classifier. This paper illustrates a systematic method to choose the number of bins and the minimum number of samples required to train the classifier with statistical stability so as to get best classification accuracy. In this study, power analysis method was employed to find the minimum number of samples required and a decision tree algorithm namely J48 was used to validate the results of power analysis and to find the optimum number of bins.

[1]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[2]  S B Bull Sample size and power determination for a binary outcome and an ordinal exposure when logistic regression analysis is planned. , 1993, American journal of epidemiology.

[3]  N A Obuchowski,et al.  Computing Sample Size for Receiver Operating Characteristic Studies , 1994, Investigative radiology.

[4]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[5]  S. Pocock Group sequential methods in the design and analysis of clinical trials , 1977 .

[6]  J M Lachin,et al.  Power and sample size evaluation for the McNemar test with application to matched case-control studies. , 1992, Statistics in medicine.

[7]  S L Beal,et al.  Sample size determination for confidence intervals on the population mean and on the difference between two population means. , 1989, Biometrics.

[8]  N. R. Sakthivel,et al.  Application of Support Vector Machine (SVM) and Proximal Support Vector Machine (PSVM) for fault classification of monoblock centrifugal pump , 2010, Int. J. Data Anal. Tech. Strateg..

[9]  F. Hsieh,et al.  Sample size tables for logistic regression. , 1989, Statistics in medicine.

[10]  Alice S. Whittemore,et al.  Sample Size for Logistic Regression with Small Response Probability , 1981 .

[11]  George Stephanopoulos,et al.  Determination of minimum sample size and discriminatory expression patterns in microarray data , 2002, Bioinform..

[12]  W D Dupont,et al.  Power calculations for matched case-control studies. , 1988, Biometrics.

[13]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[14]  David L. Streiner,et al.  Sample-Size Formulae for Parameter Estimation , 1994 .

[15]  A. Gould Planning and revising the sample size for a trial. , 1995, Statistics in Medicine.

[16]  Bo-Suk Yang,et al.  Support vector machine in machine condition monitoring and fault diagnosis , 2007 .

[17]  L. Alfayez,et al.  The application of acoustic emission for detecting incipient cavitation and the best efficiency point of a 60 kW centrifugal pump: case study , 2005 .

[18]  M. Pike,et al.  An improved approximate formula for calculating sample sizes for comparing two binomial distributions. , 1978, Biometrics.

[19]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[20]  Helena Chmura Kraemer,et al.  How many subjects , 1989 .

[21]  S. Day,et al.  Internal pilot studies for estimating sample size. , 1994, Statistics in medicine.

[22]  J H Lubin,et al.  On power and sample size for studying features of the relative odds of disease. , 1990, American journal of epidemiology.

[23]  E. S. Pearson Biometrika tables for statisticians , 1967 .

[24]  J A Bean,et al.  On the sample size for one-sided equivalence of sensitivities based upon McNemar's test. , 1995, Statistics in medicine.

[25]  J Nam,et al.  Sample size determination for case-control studies and the comparison of stratified and unstratified analyses. , 1992, Biometrics.

[26]  Fansen Kong,et al.  A combined method for triplex pump fault diagnosis based on wavelet transform, fuzzy logic and neuro-networks , 2004 .

[27]  R T O'Neill Sample sizes for estimation of the odds ratio in unmatched case-control studies. , 1984, American journal of epidemiology.

[28]  Raghunathan Rengaswamy,et al.  A fast training neural network and its updation for incipient fault detection and diagnosis , 2000 .

[29]  K. Pillai,et al.  On the Moments of the Trace of a Matrix and Approximations to its Distribution , 1959 .

[30]  P. O'Brien,et al.  A multiple testing procedure for clinical trials. , 1979, Biometrics.

[31]  W J Shih,et al.  Design for sample size re-estimation with interim data for double-blind clinical trials with binary outcomes. , 1997, Statistics in medicine.

[32]  S. Day,et al.  Sample size estimation for comparing two or more treatment groups in clinical trials. , 1991, Statistics in medicine.

[33]  M. L. Samuels,et al.  Sample Size Requirements for the Back-of-the-Envelope Binomial Confidence Interval , 1992 .

[34]  Jiangping Wang,et al.  Vibration-based fault diagnosis of pump using fuzzy technique , 2006 .

[35]  N. Buderer,et al.  Statistical methodology: I. Incorporating the prevalence of disease into the sample size calculation for sensitivity and specificity. , 1996, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[36]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[37]  W. Cumberland,et al.  Sample size requirement for repeated measurements in continuous data. , 1992, Statistics in medicine.

[38]  R A Parker,et al.  Sample size for individually matched case-control studies. , 1986, Biometrics.

[39]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[40]  S Lemeshow,et al.  Sample size requirements for studies estimating odds ratios or relative risks. , 1988, Statistics in medicine.

[41]  M. Conlon,et al.  Sample size determination based on Fisher's Exact Test for use in 2 x 2 comparative trials with low event rates. , 1992, Controlled clinical trials.

[42]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[43]  V. Flack,et al.  Sample size determinations using logistic regression with pilot data. , 1993, Statistics in medicine.

[44]  P Roebruck,et al.  Comparison of tests and sample size formulae for proving therapeutic equivalence based on the difference of binomial probabilities. , 1995, Statistics in medicine.

[45]  R. H. Browne On the use of a pilot sample for sample size determination. , 1995, Statistics in medicine.

[46]  Edgar Erdfelder,et al.  G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences , 2007, Behavior research methods.

[47]  J. Whitehead Sample size calculations for ordered categorical data. , 1993, Statistics in medicine.

[48]  G A Satten,et al.  Sample size requirements for interval estimation of the odds ratio. , 1990, American journal of epidemiology.

[49]  N L Geller,et al.  Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners. , 1987, Biometrics.

[50]  J M Nam,et al.  Establishing equivalence of two treatments and sample size requirements in matched-pairs design. , 1997, Biometrics.

[51]  R. Lewis,et al.  An introduction to the use of interim data analyses in clinical trials. , 1993, Annals of emergency medicine.

[52]  P Feigl,et al.  A graphical aid for determining sample size when comparing two independent proportions. , 1978, Biometrics.

[53]  P. D. McFadden,et al.  Early Detection of Gear Failure by Vibration Analysis--I. Calculation of the Time Frequency Distribution , 1993 .

[54]  Yu Zhu,et al.  Hybrid Support Vector Machines-Based Multi-fault Classification , 2007 .

[55]  J D Lantos Sample size: profound implications of mundane calculations. , 1993, Pediatrics.

[56]  A. Mace Sample-Size Determination. , 1964 .

[57]  J. Whitehead,et al.  A FORTRAN program for the design and analysis of sequential clinical trials. , 1983, Computers and biomedical research, an international journal.

[58]  K. I. Ramachandran,et al.  Feature selection using Decision Tree and classification through Proximal Support Vector Machine for fault diagnostics of roller bearing , 2007 .

[59]  K K Lan,et al.  A comparison of sample size methods for the logrank statistic. , 1992, Statistics in medicine.

[60]  N A Obuchowski,et al.  Sample size determination for diagnostic accuracy studies involving binormal ROC curve indices. , 1997, Statistics in medicine.

[61]  D. Schoenfeld,et al.  Nomograms for calculating the number of patients needed for a clinical trial with survival as an endpoint. , 1982, Biometrics.

[62]  S Greenland,et al.  On sample-size and power calculations for studies using confidence intervals. , 1988, American journal of epidemiology.

[63]  P A Lachenbruch,et al.  On the sample size for studies based upon McNemar's test. , 1992, Statistics in medicine.

[64]  Ian Gordon,et al.  The Myth of Continuity-Corrected Sample Size Formulae , 1996 .

[65]  S. N. Kavuri,et al.  Using fuzzy clustering with ellipsoidal units in neural networks for robust fault classification , 1993 .

[66]  H. Toutenburg Fleiss, J. L.: Statistical Methods for Rates and Proportions. John Wiley & Sons, New York‐London‐Sydney‐Toronto 1973. XIII, 233 S. , 1974 .

[67]  S R Lipsitz,et al.  Sample size for repeated measures studies with binary responses. , 1994, Statistics in medicine.

[68]  Chester L. Olson,et al.  Comparative Robustness of Six Tests in Multivariate Analysis of Variance , 1974 .

[69]  P. D. McFadden,et al.  Early detection of gear failure by vibration analysis--ii. interpretation of the time-frequency distribution using image processing techniques , 1993 .

[70]  D. Signorini,et al.  Sample size for Poisson regression , 1991 .

[71]  A Donner,et al.  A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. , 1992, Statistics in medicine.

[72]  J. Haseman,et al.  Exact Sample Sizes for Use with the Fisher-Irwin Test for 2 x 2 Tables , 1978 .

[73]  K Kim,et al.  Sample size determination for group sequential clinical trials with immediate response. , 1992, Statistics in medicine.

[74]  P Royston,et al.  Exact conditional and unconditional sample size for pair-matched studies with binary outcome: a practical guide. , 1993, Statistics in medicine.

[75]  V. Sugumaran,et al.  Fault diagnostics of roller bearing using kernel based neighborhood score multi-class support vector machine , 2008, Expert Syst. Appl..

[76]  Parker Ra,et al.  Sample size for individually matched case-control studies. , 1986, Biometrics.

[77]  Venkat Venkatasubramanian,et al.  Representing and diagnosing dynamic process data using neural networks , 1992 .