Prediction of essential genes in G20 using machine learning model

Despite the exponential growth in bioscience data, one of the key challenges for machine learning engineers remains the incompleteness of bioscience dataset (biodata). For a specific bioscience problem such as (e.g. biofilm formation, drug response, organism survival), it is very difficult to find a good consistent dataset capturing the numerous variables involved in each of these processes. Each systems biology data point is measured with different protocols in different settings, making their integration hard and not reliable. This paper focuses on using machine learning (ML) models and data mining (DM) workflow to perform gene essential prediction in G20. Actually, developing next-generation and nano-scale coatings to control biofilm formation on technologically relevant materials is a great challenge today. This can help to control microbial corrosion on material or engineer better relevant material. To tackle this relevant problem, a detailed understanding of the bacterial survival mechanisms is crucial. Computational methods for predicting essential genes can make it easier and faster to obtain reliable results. Method: The main hypothesis of our work is that a minimal information-driven specific Machine Learning model can outperform an interesting prediction score. To reach our goal, we set up first a completed data mining workflow to extract gene features from G20. We then derive 10192 features from gene sequence and protein sequence divided into 25 relevant subgroups. From each subgroup, we build a couple of interesting machine learning models. Result: We identified 69 relevant subgroups of features using our features selection algorithm. We tested the model performance on each of these subgroups and our predictive result achieved up to 98% accuracy score. These subgroups of features can be used to assist researchers to select good variables for their respective experiments.