Productivity and efficiency are watchwords for many industrial sectors in today’s fast-paced, competitive business world, and these themes resonate strongly in the pharmaceutical and biopharmaceutical sectors too. After all, discovery to market for a new drug takes around 10 years and costs an estimated US$2.6 billion. A significant factor to consider is that figure factors-in money spent on the nine out of ten candidate therapies that fail somewhere between phase I clinical trials and regulatory approval.
So, research scientists and pharma business leaders are certainly tuned in to take advantage of any/all-new ways to shorten the process or ensure that they concentrate their efforts on novel molecular entities that have the best chance to make it through to the market.
Despite the adoption of Artificial Intelligence (AI) and Machine Learning (ML) being in their infancy, many believe that these tools offer the promise to revolutionize drug development. Opportunities abound to overcome obstacles in the R&D pipeline, but significant challenges remain to be solved. In this short article, I will look at where AI/ML can best be applied in pharma, and highlight the underlying fundamentals of data storage, access, and computation, that will help to unlock its potential.
Where can AI/ML play?
Involvement for AI/ML is proposed throughout the life cycle of a pharmaceutical product. Along with many other publications, I can recommend a comprehensive review article in the journal Drug Discovery Today (https://dx.doi.org/10.1016%2Fj.drudis.2020.10.010) that presents a great overview of recent successes in this area.
As you can imagine, this is a large topic, so, I will focus here on a discussion of its potential to impact just in drug discovery and development (Fig 1).
Getting data storage right
In order to reap the benefits of AI/ML approaches, the fundamentals of data storage, flexible access, and computational/analytical tools must be considered.
Firstly, datasets are massive – with source files being read that may be hundreds or even thousands of GBs in size. A case in point is the use of large datasets such as the UK Biobank (which includes full genomic sequences, medical histories, laboratory results, and diagnostic imaging, GP notes, etc., on ~ 500,000 individuals) to look for connections between genetic variations and negative or positive health outcomes. Genetic evidence supporting a functional role in disease for a given gene is very powerful because it is “in human” evidence.
This is a form of correlation analysis called ‘association studies’. Researchers use association studies to derive a statistical value of the relationship between a mutation and many phenotypes (PheWAS) and many mutations and a specific phenotype (GWAS).
While it is safe to say that simply handling the amount of data contained in population-scale biobanks requires planning and can be costly, some truly serious effort begins when attempting to perform large-scale PheWAS and GWAS. In addition, making connections between different population-scale datasets, and allowing multi-user access, are important requirements too.
To handle these types of analysis, researchers will typically implement large clusters comprising hundreds of powerful servers working together to provide answers within a reasonable time period. The principal concerns in choosing an implementation strategy should be scalability, ease of use and cost-effectiveness.
Ideally, there is a core cluster that supports data access and small computations, and the ability to rapidly establish a super-cluster of hundreds of machines on demand using “spot” resources that are far cheaper than reserved clusters. On-demand elasticity is one of the greatest benefits of public clouds (such as Amazon Web Services, AWS), which can be seen as a cost-effective approach for storing data, as a shared platform across organizations.
For machine learning and AI, the ability to quickly stand up and shut down resources is even more critical, since the recursive analysis can fluctuate in length, and be used at different points in an analysis. Thus, the entire cycle of using spot workers – turning on, feeding, storing results and shutting down, is a critical cost element in large computation.
A fundamental element in the architecture of an elastic compute environment is elastic storage. There are many commercial offerings, and it behooves the system architect to:
- Understand all of the steps where researchers may need to substitute algorithms, capture intermediate steps in analyses, and iterate through potentially millions of models.
- Benchmark solutions with the appropriate user friendliness: command line, language etc.
- Benchmark before building, since there are substantial savings that can be baked into an approach.
In this way, data can be stored, analysed and shared at a faster and more productive rate for its users, and provide an appropriate basic infrastructure to leverage the potential of AI/ML.
A new computing infrastructure
Against this background of needs and existing infrastructure constraints, Paradigm4 is tackling the data storage issues highlighted here with the creation of a novel elastic cloud file system for flexible and more resource-efficient data storage and sustainable, scalable computing. Head to the website to read more about how we have confronted the challenge here.