1. The data

The training dataset is in the file ‘PNWPredPhylData.csv’. The taxa we will use to build the classifier are Ascaphus montanus and A. truei, Chonaphe armata, Dicamptodon aterrimus and D. tenbrosus, Microtus richardsoni, Prophysaon coeruleum, Plethodon idahoensis and P. vandykei, and Salix melanopsis. (Click on the names to see what they look like, and get some more information on each of the taxa).

The data for taxa to predict is in the file ‘PredictDataFull.csv’. The taxa we will predict on will be Alnus rubra, Haplotrema vancouverense, Prophysaon andersoni, P. dubium, P.vanattae/ P.humile. (Click on the names to see what they look like, and get some more information on each of the taxa).

  • Species: Species name
  • x , y: geographic coordinates
  • bio1_1: Annual Mean Temperature
  • bio4_1: Temperature Seasonality (standard deviation *100)
  • bio5_1: Max Temperature of Warmest Month
  • bio6_1: Min Temperature of Coldest Month
  • bio7_1: Temperature Annual Range (BIO5-BIO6)
  • bio12_1: Annual Precipitation
  • bio15_1: Precipitation Seasonality (Coefficient of Variation)
  • bio17_1: Precipitation of Driest Quarter
  • taxon: this is the taxonomic rank of the species
  • group: whether a species is recognized as Cryptic (C) or Non-Cryptic (NC), based on previous molecular studies.
  • complex: Because the species that are cryptic belong to a species complex (that now is formed by the new cryptic species), this column represents the name of the complex, if present.
  • dispStage: dispersal stage. This indicates at what developmental stage dispersal occurs. Options are ‘adult’, ‘juvenile’, and ‘embryo’.
  • selfOut: Whether the species reproduces mainly by ‘selfing’, ‘outcrossing’, or both (‘out/self’).
  • dispersion: Means of disperal. Options are ‘wind’, ‘winde/water’, and ‘self’.
  • tropicLevel: Indicates the tropic level of the species. Options are ‘primary’, ‘herbivore’, ‘predator’, and ‘detritivore’.
  • maxSize: maximum size of the organism. In cm.

Let’s import the data!

PNWdata <- read.table("PNWPredPhylData.csv", sep=",", header=T)

2. Check data format

For one of the variables in the dataset, max_size, R has recorded the variable as an integer. This will be a problem later, because for some species in the dataset for which we will make predictions, this variable is not an integer. Let’s go ahead and convert this variable to the numeric class.

PNWdata$maxSize <- as.numeric(PNWdata$maxSize)

3. Build a random forest classifier

The randomForest package implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression (Breiman 2001).

library(randomForest)

Now, let’s build a forest to predict whether or not a species harbors cryptic diversity using bioclimatic data and trait data.

RF_model <- randomForest(group ~ bio1_1 + bio4_1 + bio5_1 + bio6_1 + bio7_1 + bio12_1 
                         + bio15_1 + bio17_1 + taxon + dispStage + selfOut + dispersion 
                         + tropicLevel + maxSize, data=PNWdata, ntree=500)

RF_model

4. Evaluating the classifier

Can we accurately predict the presence or absense of cryptic diversity using these variables?.

We can look at the OOB error rates.

RF_model
## 
## Call:
##  randomForest(formula = group ~ bio1_1 + bio4_1 + bio5_1 + bio6_1 +      bio7_1 + bio12_1 + bio15_1 + bio17_1 + taxon + dispStage +      selfOut + dispersion + tropicLevel + maxSize, data = PNWdata,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##       C   NC class.error
## C  6653    0           0
## NC    0 1550           0
plot(RF_model)

It looks like this predictor works very well, and we don’t need to do much to improve our RF classifier. Let’s try a different method of evaluating the classifier.

a) Does crossvalidation change our confidence in the classifier?

For this classifier, there are multiple observations per species in the classifier. This may give us overly high confidence based on out of the bag error rates, because other observations for the same species may be driving the prediction for that species. We can use a cross-validation approach to further assess the accuracy of the classifier. Code for this is shown below.

#Create dataframes omitting each taxon from the classifier (for a total of 7 new 
# dataframes). We will use these datasets to train each of the RF classifiers 
# that we will use for cross-validation.
PNWdata_noascaphus <- PNWdata[which(PNWdata$complex != "Ascaphus.montanus.truei"),] 
PNWdata_noconaphe <- PNWdata[which(PNWdata$complex != "Conaphe.armataEW"),]
PNWdata_nodicamptadon <- PNWdata[which(PNWdata$complex != "Dicamptodon.ater.tene"),]
PNWdata_nomicrotus <- PNWdata[which(PNWdata$complex != "Microtus.richardsoni"),]
PNWdata_noplethodon <- PNWdata[which(PNWdata$complex != "Plethodon.idaho.vandy"),]
PNWdata_nopcoer <- PNWdata[which(PNWdata$complex != "Prophysaon.coeruEW"),]
PNWdata_nosalix <- PNWdata[which(PNWdata$complex != "Salix.melanopsis"),]

#Create dataframes with only one taxon (a total of 7 new dataframes). These will 
# be used as 'unknown' data in our cross-validation approach.
PNWdata_ascaphus <- PNWdata[which(PNWdata$complex == "Ascaphus.montanus.truei"),]
PNWdata_conaphe <- PNWdata[which(PNWdata$complex == "Conaphe.armataEW"),]
PNWdata_dicamptadon <- PNWdata[which(PNWdata$complex == "Dicamptodon.ater.tene"),]
PNWdata_microtus <- PNWdata[which(PNWdata$complex == "Microtus.richardsoni"),]
PNWdata_plethodon <- PNWdata[which(PNWdata$complex == "Plethodon.idaho.vandy"),]
PNWdata_pcoer <- PNWdata[which(PNWdata$complex == "Prophysaon.coeruEW"),]
PNWdata_salix <- PNWdata[which(PNWdata$complex == "Salix.melanopsis"),]

#Build seven new RF classifiers, each one omitting a different taxon.
RF_model_noAscaphus <- randomForest(group~bio1_1 + bio4_1 + bio5_1
                                  +  bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                  + bio17_1 + dispStage+ selfOut + dispersion 
                                  + tropicLevel + maxSize+taxon, data = PNWdata_noascaphus, 
                                  ntree = 500, importance = T)
RF_model_noconaphe <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                   + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                   + bio17_1 + dispStage+ selfOut + dispersion 
                                   + tropicLevel + maxSize+taxon, data = PNWdata_noconaphe, 
                                   ntree = 500, importance = T)
RF_model_nodicamptadon <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                       + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                       + bio17_1 + dispStage+ selfOut + dispersion 
                                       + tropicLevel + maxSize+taxon, data = PNWdata_nodicamptadon, 
                                       ntree = 500, importance = T)
RF_model_nomicrotus <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                    + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                    + bio17_1 + dispStage+ selfOut + dispersion 
                                    + tropicLevel + maxSize+taxon, data = PNWdata_nomicrotus, 
                                    ntree = 500, importance = T)
RF_model_noplethodon <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                     + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                     + bio17_1 + dispStage+ selfOut + dispersion 
                                     + tropicLevel + maxSize+taxon, data = PNWdata_noplethodon, 
                                     ntree = 500, importance = T)
RF_model_nopcoer <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                 + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                 + bio17_1 + dispStage+ selfOut + dispersion 
                                 + tropicLevel + maxSize+taxon, data = PNWdata_nopcoer, 
                                 ntree = 500, importance = T)
RF_model_nosalix <- randomForest(group~bio1_1 + bio4_1 + bio5_1 
                                 + bio6_1 + bio7_1 + bio12_1 + bio15_1 
                                 + bio17_1 + dispStage+ selfOut + dispersion 
                                 + tropicLevel + maxSize+taxon, data = PNWdata_nosalix, 
                                 ntree = 500, importance = T)

#For each RF classifier, make predictions for the taxon that was omitted, 
# and combine these predictions with the dataframes (so that we can plot them later).
pred_ascaphus <- predict(RF_model_noAscaphus, newdata=PNWdata_ascaphus, type="prob")
ascaphusPredictions <- cbind(pred_ascaphus, PNWdata_ascaphus)
pred_conaphe <- predict(RF_model_noconaphe, newdata=PNWdata_conaphe, type="prob")
conaphePredictions <- cbind(pred_conaphe, PNWdata_conaphe)
pred_dicamptadon <- predict(RF_model_nodicamptadon, newdata=PNWdata_dicamptadon, 
                            type="prob")
dicamptadonPredictions <- cbind(pred_dicamptadon, PNWdata_dicamptadon)
pred_microtus <- predict(RF_model_nomicrotus, newdata=PNWdata_microtus, type="prob")
microtusPredictions <- cbind(pred_microtus, PNWdata_microtus)
pred_plethodon <- predict(RF_model_noplethodon, newdata=PNWdata_plethodon, type="prob")
plethodonPredictions <- cbind(pred_plethodon, PNWdata_plethodon)
pred_pcoer <- predict(RF_model_nopcoer, newdata=PNWdata_pcoer, type="prob")
pcoerPredictions <- cbind(pred_pcoer, PNWdata_pcoer)
pred_salix <- predict(RF_model_nosalix, newdata=PNWdata_salix, type="prob")
salixPredictions <- cbind(pred_salix, PNWdata_salix)

#Create one dataframe with all of the predictions, and make a boxplot
crossvalpredictions <- rbind(ascaphusPredictions,conaphePredictions,dicamptadonPredictions,
                             microtusPredictions,plethodonPredictions,
                             pcoerPredictions,salixPredictions)
levels(crossvalpredictions$complex) <- c("Ascaphus", "Conaphe", "Dicamptodon", 
                                         "Microtus", "Plethodon", "P. coeruleum","Salix")
library(ggplot2)
ggplot(crossvalpredictions, aes(y=C, x=complex, fill = complex)) + 
  geom_boxplot() + scale_y_continuous(limits = c(0, 1)) + theme(legend.position="top")

Even using cross-validation, it looks like we predicted everything correctly. Let’s look at which variables are driving the classifier.

5. Variable Importance

Let’s see what predictor variables are contributing the most to the model.

varImpPlot(RF_model)

Taxon, trophic level and dispersal stage contribute the most to the model, and the bioclimatic variables contribute the least.

6. Predicting Unknown Responses

Now, we can make predictions for species that have not yet been studied using genetic data.

#Read in the data for the unstudied taxa
PNWunstudied <- read.table("/Users/msmith/Desktop/Workshop/PredictDataFull.csv", 
                           sep=",", header=T)

#If we don't have the same levels in our new data that we did in the data used 
# to train the classifier, then we will get an error.
levels(PNWunstudied$dispStage)<- levels(PNWdata$dispStage)
levels(PNWunstudied$tropicLevel)<- levels(PNWdata$tropicLevel)
levels(PNWunstudied$selfOut)<- levels(PNWdata$selfOut)
levels(PNWunstudied$dispersion)<- levels(PNWdata$dispersion)
levels(PNWunstudied$taxon)<- levels(PNWdata$taxon)

pred <- predict(RF_model, newdata=PNWunstudied, type="prob")
mypredictions <- cbind(pred, PNWunstudied)

ggplot(mypredictions, aes(y=C, x=Species, fill = Species)) + 
  geom_boxplot() + scale_y_continuous(limits = c(0, 1)) + 
  theme(legend.position="left") + theme(text = element_text(size=8))

7. References

Breiman L. Random Forests. Machine Learning 2001, 45, 5-32.

Espindola A, Ruffley M, Smith M, Carstens BC, Tank D, Sullivan J. 2016. Identifying cryptic diversity with predictive phylogeography. Proceedings of the Royal Society of London B 2016, 283, 1841.