── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
Prepare the data
The GEOquery library allows us to download the normalised methylation beta values from NCBI GEO.
library(GEOquery)library(tidyverse)library(skimr)#retrieve the dataset - note it always returns a list with one element per platform even if only one platform.geo <-getGEO( "GSE164127")[[1]]
Genomics datasets for machine learning tend to have many variables/features e.g CpGs, genes, proteins and relatively few observations.
#have lots of cpgsdim(exprs(geo))
[1] 37554 908
Beta values represent percentage of measured beads with that site methylated. Beta values are between 0 (completely unmethytlated ) and 1 (complely methylated). Note the data is pre-normalised for us. In best practice we’d pre-process the train and test data completely independently, i.e not normalised together at all.
We can extract the table of samples and the beta values of every CpG.
To make this faster to run and to show we can do ML on smaller datasets let’s use just one of the bat species to train on.
Let’s train a model using data from:
To test how generalisable the model is we try to use the model across species to predict the age of:
Big brown bat
#helper function to extract a data matrix for a particular bat speciesprocessData <-function(species,geo){ geo_filtered <- geo[,geo$organism_ch1 == species] methyl_filtered <-as.data.frame(t(exprs(geo_filtered))) methyl_filtered$age <-sqrt(as.numeric(geo_filtered$`age (years):ch1`)+1)return(methyl_filtered)}#Let's keep only 1k random CpGs to help training speed for this workshopset.seed(42)keep <-sample.int(nrow(geo),1000)geo <- geo[keep,]#get the data from model building and testingmethyl_spearbat <-processData("Phyllostomus hastatus",geo)methyl_bigbrownbat <-processData("Eptesicus fuscus",geo)
Create the training split
Keeping 20% of the data for a final test. The remaining 80% will be used to train the parameters of the model.
#Split the data into train and testmethyl_spearbat_split <-initial_split(methyl_spearbat,prop =0.8,strata=age)
Warning: The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 3 breaks instead.
Similar to before we define the outcome and scale-centre the rest of the predictors.
#define the recipemethyl_recipe <-recipe(methyl_spearbat_train) %>%update_role(everything()) %>%update_role(age,new_role ="outcome") %>%step_center(all_predictors()) %>%step_scale(all_predictors())
Select the model
Let’s use a GLMNet model which allows use to penalise the inclusion of variables to prevent overfitting and keep the model sparse. This is useful if we want to identify the minimal panel of biological features that are sufficient to get a good prediction e.g for a biomarker panel.
mixture = 1 is known as a lasso model. In this model we need to tune the penalty (lambda) which controls the downweighting of variables (regulatisation).
tune marks the penalty parameter as needing optimisation.
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.823
What CpGs are important?
We can use the coefficients from the model to determine what CpGs are driving the prediction. We can also see how many variables have been retained in the model using our tuned penalty value.
library(vip)library(cowplot)#get the importance from glmnet using the select penaltyimportance <- final_fitted %>%extract_fit_parsnip() %>%vi(lambda = best_mod$penalty) %>%mutate(Importance =abs(Importance),Variable =fct_reorder(Variable, Importance) )#how many CpGs are retainedtable(importance$Importance>0)
FALSE TRUE
987 13
Plot the importance of the top CpGs and their direction
#plot the top 10 CpGsimportance %>%slice_max(Importance,n=10) %>%ggplot(aes(x = Importance, y = Variable, fill = Sign)) +geom_col() +scale_x_continuous(expand =c(0, 0)) +labs(y =NULL) +theme_cowplot()
Plot the top predictive CpG beta values versus age
This highlights how you can use machine learning to identify a small number of discrimative features.
#helper function to plot a CpG beta values against ageplotCpG <-function(cpg,dat){ggplot(dat,aes(x=!!sym(cpg),y=age)) +geom_point() +theme_cowplot()}#plot the most important CpGsimportance %>%slice_max(Importance,n=4) %>%pull(Variable) %>%as.character() %>%map(plotCpG,methyl_spearbat) %>%plot_grid(plotlist = .)