How machine learning is revolutionising genomic research
Have you ever wondered why you are as tall as you are? Or how you inherited a physical trait that your sibling didn’t? What about why some people are more susceptible to disease? The answer lies in our biological make-up, the human genome.
By comprehending how human traits and diseases are driven at a gene level, we can not only better understand ourselves, but how we can treat or cure various diseases. However, with more than three billion letters in the human genome and any one of them a potential contributor to a disease or trait, a thorough analysis is a monolithic task.
Traditional methods such as genome-wide association studies (GWAS) have enabled researchers to identify over 50,000 variants associated with conditions such as heart disease, height and diabetes, but many features remained only partially explained.
For example, of a research pool of 22,000 individuals with motor neurone disease, each individual has 2 million genetic differences between them, equalling roughly 80 million features that need analysing, with a resulting matrix of 1.7 trillion statistical examples.
Evaluating such an astronomical amount of data had previously been near-impossible, but a novel approach developed at CSIRO through the collaboration of the Health and Biosecurity and Data61 divisions has revolutionised this.
How machine learning is analysing genetic data
VariantSpark is a software platform that uses a distributed machine learning (ML) framework to generate insights from high-dimensional biological data. The ML component of VariantSpark is an implementation of RandomForest, a well-known machine learning algorithm that can be used for classification and regression problems.
Unlike other methods, the application of RandomForest in VariantSpark scales with the number of features in datasets, enabling the analysis of increasingly large portions of information.
“VariantSpark can analyse 3,000 individuals with over 80 million genetic features in under 30 minutes, which means that we now require 80 per cent fewer genetic samples to detect a statistically significant signal,” says Dr Denis Bauer, Head of Transformational Bioinformatics at CSIRO, and ML specialist.
“VariantSpark has already been used to detect some of the ”mis-spellings” in the human genome that causes dementia and ALS.”- Dr Denis Bauer
VariantSpark is the first method to explore all potential genetic variations and interactions, which could provide the missing puzzle piece in explaining how the genome influences complex diseases such as diabetes or Alzheimer’s.
The platform also uses efficient multi-layer parallelisation, a feature that enables its analysis to scale to whole-genome of population-scale datasets, with a hundred million genomic variants and a hundred thousand samples.
Compared to traditional genome-wide association studies (GWAS), VariantSpark can provide crucial insights 3.6 times faster, more accurately identify genomic variants associated with complex phenotypes, and is the only method able to scale to ultra-high dimensional genomic data in a manageable time.
A world first
In a world-first, VariantSpark processed and analysed one trillion points of genomic data from a synthetic dataset (provided by Amazon Web Services) including 100,000 individuals, ten million variants and 100 thousand samples.
No other technology platform had been able to process such a mammoth number of samples at once, explained Dr Bauer. “Our research shows VariantSpark is the only method able to scale to ultra-high dimensional genomic data in a manageable time.”
“It was able to process this information in 15 hours while it would take the fastest competitors likely more than 100,000 years to process such a volume of data,” said Dr Bauer.
“This is a significant milestone, as it means VariantSpark can be scaled up to analyse population-level datasets and drive better healthcare outcomes.”
The study is available here.
VariantSpark and COVID-19
With a 30,000-letter genome and tens of thousands of sequences, the interactions in the COVID-19 genome were impossible to assess using standard statistical models, says Dr Bauer. By applying machine learning in the form of VariantSpark, researchers at CSIRO identified mutations in the virus’s genome that could influence characteristics of the disease, such as the severity of the symptoms and infection spread.
“Machine learning is a critical tool in our toolbox, because the virus genome works as a whole, with a change in one location having the potential to be amplified in a location several thousand letters away,” she explains.
“As such, the genome needs to be analysed as a whole, which machine learning methods are particularly good at by forming complex interactions and deriving information from them.”
While the analysis is still in its early days, the insights extracted by VariantSpark are playing an essential role in the identification of regions may be susceptible to a drug or vaccine.
“We can basically track the evolution of the virus on a daily basis, giving us insights into which areas of the genome are associated with crucial function and might be a good targets for drugs or vaccines.” – Dr Denis Bauer
How to access VariantSpark
VariantSpark is available on AWS Marketplace and GitHub, allowing for self-serve deployment to national and international organisations. This type of deployment is ideal for organisations looking to become more resilient in a post-COVID world.
The Transformational Bioinformatics teams’ paper on VariantSpark, which can be found here, has been accepted for publication in the Oxford journal GigaScience.