The training dataset is in the file ‘PNWPredPhylData.csv’. The taxa we will use to build the classifier are Ascaphus montanus and A. truei, Chonaphe armata, Dicamptodon aterrimus and D. tenbrosus, Microtus richardsoni, Prophysaon coeruleum, Plethodon idahoensis and P. vandykei, and Salix melanopsis. (Click on the names to see what they look like, and get some more information on each of the taxa).
The data for taxa to predict is in the file ‘PredictDataFull.csv’. The taxa we will predict on will be Alnus rubra, Haplotrema vancouverense, Prophysaon andersoni, P. dubium, P.vanattae/ P.humile. (Click on the names to see what they look like, and get some more information on each of the taxa).
Let’s import the data!
PNWdata <- read.table("PNWPredPhylData.csv", sep=",", header=T)
For one of the variables in the dataset, max_size, R has recorded the variable as an integer. This will be a problem later, because for some species in the dataset for which we will make predictions, this variable is not an integer. Let’s go ahead and convert this variable to the numeric class.
PNWdata$maxSize <- as.numeric(PNWdata$maxSize)
The randomForest
package implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression (Breiman 2001).
library(randomForest)
Now, let’s build a forest to predict whether or not a species harbors cryptic diversity using bioclimatic data and trait data.
RF_model <- randomForest(group ~ bio1_1 + bio4_1 + bio5_1 + bio6_1 + bio7_1 + bio12_1
+ bio15_1 + bio17_1 + taxon + dispStage + selfOut + dispersion
+ tropicLevel + maxSize, data=PNWdata, ntree=500)
RF_model
Can we accurately predict the presence or absense of cryptic diversity using these variables?.
We can look at the OOB error rates.
RF_model
##
## Call:
## randomForest(formula = group ~ bio1_1 + bio4_1 + bio5_1 + bio6_1 + bio7_1 + bio12_1 + bio15_1 + bio17_1 + taxon + dispStage + selfOut + dispersion + tropicLevel + maxSize, data = PNWdata, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## C NC class.error
## C 6653 0 0
## NC 0 1550 0
plot(RF_model)
It looks like this predictor works very well, and we don’t need to do much to improve our RF classifier. Let’s try a different method of evaluating the classifier.
For this classifier, there are multiple observations per species in the classifier. This may give us overly high confidence based on out of the bag error rates, because other observations for the same species may be driving the prediction for that species. We can use a cross-validation approach to further assess the accuracy of the classifier. Code for this is shown below.
#Create dataframes omitting each taxon from the classifier (for a total of 7 new
# dataframes). We will use these datasets to train each of the RF classifiers
# that we will use for cross-validation.
PNWdata_noascaphus <- PNWdata[which(PNWdata$complex != "Ascaphus.montanus.truei"),]
PNWdata_noconaphe <- PNWdata[which(PNWdata$complex != "Conaphe.armataEW"),]
PNWdata_nodicamptadon <- PNWdata[which(PNWdata$complex != "Dicamptodon.ater.tene"),]
PNWdata_nomicrotus <- PNWdata[which(PNWdata$complex != "Microtus.richardsoni"),]
PNWdata_noplethodon <- PNWdata[which(PNWdata$complex != "Plethodon.idaho.vandy"),]
PNWdata_nopcoer <- PNWdata[which(PNWdata$complex != "Prophysaon.coeruEW"),]
PNWdata_nosalix <- PNWdata[which(PNWdata$complex != "Salix.melanopsis"),]
#Create dataframes with only one taxon (a total of 7 new dataframes). These will
# be used as 'unknown' data in our cross-validation approach.
PNWdata_ascaphus <- PNWdata[which(PNWdata$complex == "Ascaphus.montanus.truei"),]
PNWdata_conaphe <- PNWdata[which(PNWdata$complex == "Conaphe.armataEW"),]
PNWdata_dicamptadon <- PNWdata[which(PNWdata$complex == "Dicamptodon.ater.tene"),]
PNWdata_microtus <- PNWdata[which(PNWdata$complex == "Microtus.richardsoni"),]
PNWdata_plethodon <- PNWdata[which(PNWdata$complex == "Plethodon.idaho.vandy"),]
PNWdata_pcoer <- PNWdata[which(PNWdata$complex == "Prophysaon.coeruEW"),]
PNWdata_salix <- PNWdata[which(PNWdata$complex == "Salix.melanopsis"),]
#Build seven new RF classifiers, each one omitting a different taxon.
RF_model_noAscaphus <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_noascaphus,
ntree = 500, importance = T)
RF_model_noconaphe <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_noconaphe,
ntree = 500, importance = T)
RF_model_nodicamptadon <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_nodicamptadon,
ntree = 500, importance = T)
RF_model_nomicrotus <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_nomicrotus,
ntree = 500, importance = T)
RF_model_noplethodon <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_noplethodon,
ntree = 500, importance = T)
RF_model_nopcoer <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_nopcoer,
ntree = 500, importance = T)
RF_model_nosalix <- randomForest(group~bio1_1 + bio4_1 + bio5_1
+ bio6_1 + bio7_1 + bio12_1 + bio15_1
+ bio17_1 + dispStage+ selfOut + dispersion
+ tropicLevel + maxSize+taxon, data = PNWdata_nosalix,
ntree = 500, importance = T)
#For each RF classifier, make predictions for the taxon that was omitted,
# and combine these predictions with the dataframes (so that we can plot them later).
pred_ascaphus <- predict(RF_model_noAscaphus, newdata=PNWdata_ascaphus, type="prob")
ascaphusPredictions <- cbind(pred_ascaphus, PNWdata_ascaphus)
pred_conaphe <- predict(RF_model_noconaphe, newdata=PNWdata_conaphe, type="prob")
conaphePredictions <- cbind(pred_conaphe, PNWdata_conaphe)
pred_dicamptadon <- predict(RF_model_nodicamptadon, newdata=PNWdata_dicamptadon,
type="prob")
dicamptadonPredictions <- cbind(pred_dicamptadon, PNWdata_dicamptadon)
pred_microtus <- predict(RF_model_nomicrotus, newdata=PNWdata_microtus, type="prob")
microtusPredictions <- cbind(pred_microtus, PNWdata_microtus)
pred_plethodon <- predict(RF_model_noplethodon, newdata=PNWdata_plethodon, type="prob")
plethodonPredictions <- cbind(pred_plethodon, PNWdata_plethodon)
pred_pcoer <- predict(RF_model_nopcoer, newdata=PNWdata_pcoer, type="prob")
pcoerPredictions <- cbind(pred_pcoer, PNWdata_pcoer)
pred_salix <- predict(RF_model_nosalix, newdata=PNWdata_salix, type="prob")
salixPredictions <- cbind(pred_salix, PNWdata_salix)
#Create one dataframe with all of the predictions, and make a boxplot
crossvalpredictions <- rbind(ascaphusPredictions,conaphePredictions,dicamptadonPredictions,
microtusPredictions,plethodonPredictions,
pcoerPredictions,salixPredictions)
levels(crossvalpredictions$complex) <- c("Ascaphus", "Conaphe", "Dicamptodon",
"Microtus", "Plethodon", "P. coeruleum","Salix")
library(ggplot2)
ggplot(crossvalpredictions, aes(y=C, x=complex, fill = complex)) +
geom_boxplot() + scale_y_continuous(limits = c(0, 1)) + theme(legend.position="top")
Even using cross-validation, it looks like we predicted everything correctly. Let’s look at which variables are driving the classifier.
Let’s see what predictor variables are contributing the most to the model.
varImpPlot(RF_model)
Taxon, trophic level and dispersal stage contribute the most to the model, and the bioclimatic variables contribute the least.
Now, we can make predictions for species that have not yet been studied using genetic data.
#Read in the data for the unstudied taxa
PNWunstudied <- read.table("/Users/msmith/Desktop/Workshop/PredictDataFull.csv",
sep=",", header=T)
#If we don't have the same levels in our new data that we did in the data used
# to train the classifier, then we will get an error.
levels(PNWunstudied$dispStage)<- levels(PNWdata$dispStage)
levels(PNWunstudied$tropicLevel)<- levels(PNWdata$tropicLevel)
levels(PNWunstudied$selfOut)<- levels(PNWdata$selfOut)
levels(PNWunstudied$dispersion)<- levels(PNWdata$dispersion)
levels(PNWunstudied$taxon)<- levels(PNWdata$taxon)
pred <- predict(RF_model, newdata=PNWunstudied, type="prob")
mypredictions <- cbind(pred, PNWunstudied)
ggplot(mypredictions, aes(y=C, x=Species, fill = Species)) +
geom_boxplot() + scale_y_continuous(limits = c(0, 1)) +
theme(legend.position="left") + theme(text = element_text(size=8))
Breiman L. Random Forests. Machine Learning 2001, 45, 5-32.
Espindola A, Ruffley M, Smith M, Carstens BC, Tank D, Sullivan J. 2016. Identifying cryptic diversity with predictive phylogeography. Proceedings of the Royal Society of London B 2016, 283, 1841.