Review | Published: 16 June 2021
A comprehensive review on the application of artificial intelligence in drug discovery
Ashrulochan Sahoo & Ghulam Mehdi Dar*
artificial intelligence; computer-aided drug design; deep learning; drug discovery; drug repurposing; healthcare; machine learning
2. Drug discovery and AI
In the past three decades, the drug discovery process has been evolved in many folds. Immense advancements in chemical engineering, biological science, and the active utility of computers lead to the foundation of modern drug discovery. From serendipitous drug discovery to Computer-Aided Drug Discovery (CADD), medicinal chemistry has come a long way via peaks and valleys. Though the past years have seen many hardships to bring a new drug to market through many hurdles, it has been reckoned that in this decade, the use of AI will bring out the best of ever [12, 13]. Figure 1 presents a brief timeline of advancement in the field of AI.
Figure 1: A brief timeline of AI development.
2.1 Approaches in drug discovery
Figure 2: Phases of drug discovery using data from different sources .
The process of drug discovery can be broadly classified into four major stages: (i) target recognition, (ii) target development, (iii) pre-clinical studies, and (iv) clinical studies. Every stage contains many layers of experiments to bring the best fit. Each stage is interdependent to the other . We will go through these stages with respect to the ML approach.
2.2 Need for the drug discovery
Health is the primary asset to a prosperous civilization . As fast as humans are adding revolutionary chapters to civilization, knowingly and unknowingly, some situations are compromised. This then leads to severe disasters, i.e., the recent SARS-2 COVID-19 pandemic. Not only COVID-19 but also the existing chronic diseases like cancer, diabetes, gastroesophageal reflux disease (GERD), various cardiovascular diseases, epilepsy, acquired immune deficiency syndrome caused by Human immunodeficiency virus (HIV-AIDS) infection, is nowadays taking serious turns in every group of the population worldwide . Rare diseases like thalassemia, hemophilia and primary immune deficiency diseases in children are also serious issues. To also address tropical diseases like leishmaniasis, leprosy, lymphatic filariasis, dengue, guinea, etc. [20, 22]. Though we have successfully controlled some deadly diseases of past decades, some new threats and diseases are still coming on the way; and we should be prepared. Drug discovery that includes either novel drug synthesis or drug repurposing remains the two means of our defense against such potential diseases (figure 3).
Figure 3: The need for new drugs; they can be new drugs by rational drug development or repurposed drugs.
2.3 Problems with conventional approaches of drug discovery
The huge advancements in technology and managerial approaches in the research and development R&D sector exceptionally uplifted modern drug discovery operations with new diversifying perspectives. Albeit, the drug to market ratio is very drastic, falling around 80-fold (in terms of inflammation-adjustment) in comparison to the 1970s drug approval rate due to its complexity, longevity, and expensive nature . The process of drug discovery goes through many experiments and researches done by a variety of professionals. Every year, pharmaceutical companies invest money to develop new drugs for diverse diseases; fortunately, one might get a chance to the market out of hundred projects. It takes approximately 12 years with a payout of US$3 billion and lots of manpower to bring out a new drug candidate. The fate of those unsucceeded projects is termed as “R&D inefficiency of the pharmaceutical sector” . This R&D inefficiency of pharmaceutical companies also depends on many factors, i.e., geographical location, the criticality of diseases, market regulation policies, availability of active pharmaceutical Ingredients APIs, etc. According to Scannell et al.., the R&D inefficiency is because of the four factors: (i) the ‘better than Beatles’ problem refers to the resistance by endorsed drug molecules to the upcoming new drugs by setting approval, adoption, and reimbursement barriers; (ii) the ‘cautious regulator’ problem refers to the constant upgradations to the drug safety regulations by respective authorities; (iii) the ‘throw money at it’ tendency is the investment of companies in other sectors by downsizing R&D sector to top the market competition; (iv) ‘basic-research-brute-force’ bias points out the proneness to overestimate the ability of advances in basic drug discovery approaches . Due to these inefficiencies’ consumer faces hefty pricing of medicines. They pay both for the succeeding drugs as well as for those failure projects. With this, even some tropical diseases on their primary level were left underrated and unaware, which gradually became notorious to a group of people [24, 26].
Insufficiency in advanced biomolecular tools such as chemical probes and antibodies is also an important setback in molecular drug discovery. In-depth biological understanding is limited to less numbers of proteins. One in three proteins remains understudied; their function in human biology and role in disease studies remains an enigma. To date there’s only 11% of human proteome has been explicated; this is recognized as a causality dilemma, which keeps a major portion of proteomic and genomic studies in shadows. Ultimately, this causality dilemma slows down the progress of modern drug discovery .
Like a tree, science is growing, and with each passing decade, new disciplines have been emerging from it like branches. Knowledge and experience within a particular field are also increasing swiftly; in fact, thousands of new articles have been added to each branch of science every year. Every year, the MEDLINE data (repositories of medical knowledge) increases around eight hundred thousand plus. The ZINC library data (free database of compounds for VS) has also seen a peak of thousand-folds between 2005-2019 from 700,000 entries to 1.3 billion entries. Accordingly, the pharmacological data, protein data bank PDB entries, in vitro HTS data, molecular drug design data, experimental chemistry data, and toxicology data are also increasing. Each stage of drug R&D involves data mining and analysis to create a hypothesis and experimentally testing them. This data, as a whole, is getting very complicated. In a rapid manner, the human brain individually or in a team is less capable to create and process such amount of multivariable complex hypothesis at million data points flawlessly. It is also exhaustive and time-consuming for humans to monitor every stage of DD. There also exists an experience gap between experts of different subfields; that affects DD in many ways . Figure 4 illustrates the contribution of different scientific disciplines in supplying the necessary data for the molecular drug designing processes.
Figure 4: Data mining for drug discovery from different fields of science.
From target identification to clinical trials and approval, these are the current setbacks in pharmaceutical sectors retarding the progress of new medicines. However, the approach of expert-driven study backed up by a data-driven study in R&D methods has reassured breakthroughs. The adoption of AI in the drug discovery process has given a ray of hope. The evolution of computational tools has proved to be an efficient, cost-effective alternative to conventional drug discovery approaches (table 1) . Gisbert et al. demonstrated the humongous application of ML approaches in chemocentric and molecular informatics studies in three steps: (i) selection of problem-specific descriptor sets to find out the essential properties of involved molecules; (ii) molecular property driven scoring or metric schemes to compare the encoded molecules; (iii) implementation of suitable ML algorithms to identify exceptional features for qualitative and quantitative separation of active compounds. The use of AI enhances speed and ease of scalability in modern drug discovery approaches .
Table 1: Comparison between conventional approach of DD and AI-driven DD.
3. AI in rational drug discovery: a paradigm shift
4. Progress of AI in rational drug discovery
4.1 Use of AI in ligand-receptor binding affinity prediction
There have been different ML and DL-based AIs are proposed to predict the binding score. Two SVM-based models are SVR-Score and ID-Score. Ballester and Mitchell et al. had developed a number of RF-based AIs, i.e., RF-IChem, SFCscoreRF, X-Score, and B2B Score, to study the receptor-ligand binding affinity. Among these, RF-Score had performed better, encoded into large-scale protein-ligand docking website DockThor (https://www.dockthor.lncc.br/v2/). Again, Ballester et al. confirmed the better performance of RF-Score-v3 in comparison to X-Score with respect to 16 classical scoring function sets. RF uses decision trees DTs as base learners. This helps to incorporate the algorithm with much variance and flexibility; this high variance reduces the correlation between trees. Hence, it improves the accuracy of the score prediction in the whole ensemble model. For rescoring purposes, NNScore and CScore ANN-based machines have been developed.
In structure-based drug designing, the tools used for molecular docking operate on diverse sampling algorithms, docking, and simulation methods. These tools also use various scoring parameters and functions to predict the most accurate binding score. The methods and functions depend on the three-dimensional structural features of the ligand, which is evaluated by implementing rotational and translational vectors . The only problem with the above-mentioned ANN, RF, or SVM-based ML machines is their need to represent molecules with fixed-length vectors. The development of convolutional neural networks CNNs in DL minimized the limitation, as it is capable of extracting the features directly from the 2-D and 3-D molecular structures. Cang and Wei et al. developed a multichannel topological neural network TopologyNet using a topological strategy and a CNN model developed by Ragoza et al. While the CNN is used to make a 3D grid for each protein-ligand complex, the topological strategy is used to represent the 3D biomolecular geometry of 1D topological invariants into a reduced-dimensionality formulation. This arrangement occurs without altering the important biological properties of the molecule, and across every grid point, the atom densities are stored.
Albeit the advancements in different ML and DL-based Ais in receptor-ligand affinity prediction have outperformed the classical method; still, it’s a challenging topic in rational drug discovery. Due to some partiality or bias in the scores in every respect, it is hard to depend on one step action and one model (table 2 compares parametric differences between various known classical scoring functions).
Table 2: Comparison between different scoring functions (SFs) [34, 39].
Prediction of the binding score is the crucial step, and its consequences lead to other studies; hence, it should be carried out with utmost attention. The focus is to build a model which can predict accurate binding score with respect to molecular features and stability, despite the inactiveness of the receptor. Among the above-discussed approaches in scoring prediction, DL algorithms have the potential to work in every range of work environment, which will be a great advantage in the rational drug discovery process. ML technique with Gaussian process along with quantum effects and biophysics will also be useful in this regard. Success in protein-ligand docking will raise the curtain from the understudied protein-protein interaction phenomena .
4.2 Use of AI in de novo small molecular drug design
Drug design refers to the molecular arrangements and rearrangements to the obtained hit/lead/fragments. This design is done in a precise manner with respect to the availability of chemicals, with the accurate set of desired interacting groups, for the proposed biological functions, and with an eye to the intellectual property rights IPR and standard safety parameters. The early drug design approaches were structure-based. Most of those drugs were prone to synthetic infeasibility, poor drug metabolism, and less minimized toxicity. The recent de novo drug design approaches are ligand-based. Another approach is called the ‘inverse QSAR’ approach [41-44].
The early studies on deep generative models gained an utmost advantage from the use of RNN to template sets and novel scaffolds [47, 48]. Segler et al. demonstrated the autodidactism of RNNs from the trained data to represent molecules as SMILES (figure 5); that learned the grammars to valid SMILES representation and generated chemical molecules of different scaffolds and similar properties . Yuan et al. reported a new library generation method using character-level recurrent neural networks char-RNN, known as Machine-based Identification of Molecules Inside Characterized Space MIMICS. In MIMICS, the char-RNN was trained to learn the notable features in SMILES strings for the given set of chemicals; thus, it can eliminate molecules with unwanted properties. In 2018, Popova et al. used the stack-augmented recurrent neural network stack-RNN (extension of RNN architecture implemented with persistent memory unit) to generate a library of novel ligands against Janus protein kinase 2 JAK2 (a non-receptor tyrosine kinase) .
Figure 5: RNN based models in drug design. The RNN model learns the features of desired chemicals from SMILES strings and filters the active molecules from inactive compounds .
4.2.3 AAE and VAE based
4.2.5 Limitation of SMILES and rise of molecular graph approach
4.3 Use of AI to predict pharmacological and physicochemical feature of molecules
In the early drug discovery process, without designing all the molecules and actually observing their interaction in vivo or in vitro assay, observing them in silico models has saved time and expenditure in many folds. Big Data, ML, DL, and quantum chemistry approaches are now successively used in the prediction of physicochemical properties, i.e., lipophilicity (log P, log D), aqueous solubility (log S), intrinsic permeability prediction, ionization constant, melting point, boiling point, Pharmacological properties like absorption, distribution, metabolism, excretion and toxicity ADMET (figure 6) [61-63].
Figure 6: Physicochemical and pharmacological feature prediction by ML approach. The structures designed molecular library is converted to .sdf or .sml format and then imported to the machine. The machine is trained on different data points from various sources, i.e., DrugBank, Votano, PAMPA, etc. The AI processes the data using encoded ANN and exports the ADMET and physiological properties as graphs and charts for comparison purposes .
To predict octanol-water partition log P, ALOGPS is a model that uses associative neural networks, the combination of feed-forward network and k-nearest neighbor (kNN). Undirected graph recursive neural networks (UG-RNNs) and graph-based CNN are used to predict aqueous solubility . RS-predictor (using hierarchical descriptor and quantum chemical, atom-based descriptor), SMARTCyp and Xenosite (combining ANN with topological, quantum chemical, and SMARTCyp descriptor), CypRules, MetaSite, Metapred, WhichCyp are available tools to predict sites of metabolism . Many ML methods are used for toxicity study, i.e., SVM, relevance vector machine (RVM), regularized-RF, RVM boosting (RVMBoost), SVM boosting (SVMBoost), AdaBoost, and C5.0 trees. DL-AOT, pkCSM (uses graph-based structural signatures), admetSAR, LimTox, and Toxtree web tools and packages are available for toxicity studies in de novo drug design .
4.4 Use of AI in de novo chemical synthesis
4.5 Use of AI in pre-clinical and clinical trials
In clinical trial failure rate of proposed drugs is very high due to (i) inefficient volunteer selection; and (ii) inability to effectively monitor the observation . ML and DL approaches have been proposed to prepare the study, regulate required parameters, and constantly monitor trial success rates to address these casualties in a clinical trial. Various AI tools are used to predict human-relevant biomarkers of diseases to recruit a specific patient population in Phase II/III trials [77, 78]. The machine is designed in such a manner that it notes down every change in the patient’s medical condition electronically. IBM Watson uses a DL-based clinical trial matching system to maintain and analyze structured and unstructured electronic medical records of patients to create and select suitable patient profiles . PrOCTOR predicts toxicity probability. AiCure is a mobile application used to monitor phase II clinical trial data of schizophrenia patients; it showed 25% improvement in monitoring data compared to traditional ‘modified directly observed therapy’ .
4.6 Use of AI in drug repurposing
Repurposing approved drugs and under development drugs (failed projects) is a new smart and logical approach in the rational drug discovery process; to defend obscure therapeutic prerequisites of unexpected, rare, and ignored diseases. Repurposing of drugs works because (i) different diseases share molecular pathways and genetic factors, and (ii) drugs have multiple targets. Repurposing needs a lot of data from various perspectives [80, 81]. To feed this, computational techniques are the best suit . The important algorithms used for drug repurposing studies are supervised learning, unsupervised learning, and semi-supervised learning. Supervised model, i.e., DTINet; unsupervised models, i.e., MANTRA, semi-supervised model, i.e., LapRLS and advanced NetLapRLS, LPMIHN, BLM with neighbor-based interaction-profile inferring (BLM-NII), and network consistency-based prediction (NetCBP) method, network-based deepDR are in use for drug repurposing [83-87]. Though these models have promised better performance, still there predictions are not conceived yet .
5. Future of AI in drug discovery
The partnership between pharmaceutical sectors with AI organizations is facilitating research (table 3). A number of startups are opening. However, some challenges are still present in the current scenario of rational drug discovery. Peter et al. have demonstrated five major challenges; (i) Data governance, (ii) Lack of a single unifying problem; (iii) insufficient skill sets; (iv) traditional scientific approach; and (v) absence of investment. These challenges are natural yet concerning. It is believed that with time and much-advanced machine learning approaches can address these challenges .
Table 3: Partnership between pharmaceutical industry and AI industry [24, 35, 93-95].
Automation by AI is a burning issue as it will lead to unemployment in a large number of populations . In exports view, the AI that is being used artificial narrow intelligence ANI is not really up to replace humans but will enhance humans and their laziness . With AI, a person can save a lot of time and can use turn that saves time to creativity. Especially in the rational drug discovery process, the robots or machines can compile the data by understanding the subject matter. They can filter those compiled data and present them to the scientist, where the scientists or researchers will have to think on bigger pictures of that study without worrying about the data and wasting time compiling them. The robots or machines will help the drug designer scientist design better, greener chemicals, and the synthesis process can be easier. At every point the machine can monitor and verify the process and data to minimize mistakes.
No potential conflict of interest is being reported by the authors.
Department of Pharmaceutical Sciences and Natural Products, Central University of Punjab, Bathinda - 151401, Punjab, India
Ghulam Mehdi Dar
Department of Biochemistry, Govind Ballabh Pant Institute of Post-graduate Medical Education and Research, Jawahar Lal Nehru Marg, Rajghat, New Delhi - 110002, Delhi, India
Ghulam Mehdi Dar
Cite this article
Sahoo A, Dar GM (2021). A comprehensive review on the application of artificial intelligence in drug discovery. T. Appl. Biol. Chem. J; 2(2):34-48. https://doi.org/10.52679/tabcj.2021.0007
Received Revised Accepted Published
01 May 2021 10 June 2021 12 June 2021 16 June 2021