There are literally millions of known chemical compounds. Huge numbers of these substances are used in industry, in agriculture, in the home, as medicines, and in countless other applications. Finding novel compounds with specific properties, such as a new pharmaceutical with fewer side effects than the old one, is a major focus of many research teams around the world. Often, software is used to scan databases of known chemicals but can also be used to predict the properties of previously unknown substances that might be synthesised in a laboratory should those properties fit the brief.
Now, writing in the International Journal of Information and Communication Technology, Faisal Saeed of the College of Computer Science and Engineering at Taibah University in Medina, Saudi Arabia, explains that predicting the characteristics of a new molecular structure in silico, in the computer, in other words, still presents many major challenges to drug discovery teams. In his paper, Saeed, suggests that machine learning might open wide the bottleneck by finding new ways to identify novel substances with particular physiological properties that might make them useful as new pharmaceuticals for a wide range of diseases and conditions.
Saeed has demonstrated that a combined effort might work well. He has tested different machine learning methods on diverse molecular datasets, including naïve Bayes, sequential minimal optimisation, Bayesian network, decision tree, support vector machine, K-nearest neighbours, random tree, and reduced error pruning, REPTree. The tests used different combinations of base classifiers to assess how well they would work against different types of dataset.
The K-nearest neighbour (KNN) approach, Saeed found, works far better than any other approach. Moreover, the ensemble learning method Adaboost (KNN) was the most effective of the KNN approaches. The downside is that this type of base classifier approach requires a lot of computer power to process a diverse dataset and to predict the biological activity of the molecules in that dataset. It might be possible in the next step to improve efficiency and reduce computing costs by adding a pre-processing step before the intensive analysis of the dataset is carried out.
Saeed, F. (2022) ‘Machine learning methods for predicting the biological activities of molecules in high diverse databases’, Int. J. Information and Communication Technology, Vol. 21, No. 2, pp.170–180.