Another Exercise Dataset

Background

An important area of research in biology and conservation is the development of new methods to identify the presence of cryptically diverse taxa within a system of study. Recently, there have been several new methods developed to this end. Most of these, however, require molecular data, which in many cases is unavailable or difficult to obtain. In a paper we published in 2016 (Espindola et al., 2016), we investigated the ability of Random Forests to assist in the identification of lineages that harbor crytic diversity.

Specifically, the study focused on the disjunct temperate rainforests of the Pacific Northwest of North America. This ecosystem is formed by two disjunct regions of temperate rainforest, one in the Coastal and Cascades Mountain ranges, and the other inland in the Northern Rocky Mountains; the two zones are isolated by the Columbia Basin desert steppes. In this ecosystem, it has been shown that many taxa harbor cryptic diversity structured across the disjunction between the coastal and inland parts of the rainforest (i.e., across the Columbia Basin). Based on that knowledge, we explored whether or not Random Forest analyses (Breiman, 2001), using climatic and taxonomic information, could classify taxa as cryptic or non-cryptic, and predict whether or not unstudied taxa harbor cryptic diversity. To do this, we collected locality data for each taxon that had been studied using genetic data. Using the available genetic data, we classified taxa as cryptic or non-cryptic, and we used these categories as the response variable in the Random Forest classifier. We used taxonomic rank (e.g., Mollusks for snails and slugs; Amphibian for salamanders and frogs; etc.) and extracted bioclimatic variables from the worldclim dataset to use as predictor variables.

We are providing this dataset for you to apply what you learned at the workshop. Specifically, we would like you to:

1- Create a Random Forest analysis that will allow you to build an accurate classifier. For this, you can use whichever improvement method that you consider appropriate.

2- Predict whether or not unstudied taxa harbor cryptic diversity.

The data

The training dataset is in the file ‘PNWPredPhylData.csv’. The taxa we will use to build the classifier are Ascaphus montanus and A. truei, Chonaphe armata, Dicamptodon aterrimus and D. tenbrosus, Microtus richardsoni, Prophysaon coeruleum, Plethodon idahoensis and P. vandykei, and Salix melanopsis. (Click on the names to see what they look like, and to get some more information on each of the taxa).

The data for taxa to predict is in the file ‘PredictDataFull.csv’. The taxa to be predicted will be Alnus rubra, Haplotrema vancouverense, Prophysaon andersoni, P. dubium, P.vanattae/ P.humile. (Click on the names to see what they look like, and to get some more information on each of the taxa).

What’s in the file?

  • Species: Species name
  • x , y: geographic coordinates
  • bio1_1: Annual Mean Temperature
  • bio4_1: Temperature Seasonality (standard deviation *100)
  • bio5_1: Max Temperature of Warmest Month
  • bio6_1: Min Temperature of Coldest Month
  • bio7_1: Temperature Annual Range (BIO5-BIO6)
  • bio12_1: Annual Precipitation
  • bio15_1: Precipitation Seasonality (Coefficient of Variation)
  • bio17_1: Precipitation of Driest Quarter
  • taxon: this is the taxonomic rank of the species
  • group: whether a species is recognized as Cryptic (C) or Non-Cryptic (NC), based on previous molecular studies.
  • complex: Because the species that are cryptic belong to a species complex (that now is formed by the new cryptic species), this column represents the name of the complex, if present.
  • dispStage: dispersal stage. This indicates at what developmental stage dispersal occurs. Options are ‘adult’, ‘juvenile’, and ‘embryo’.
  • selfOut: Whether the species reproduces mainly by ‘selfing’, ‘outcrossing’, or both (‘out/self’).
  • dispersion: Means of disperal. Options are ‘wind’, ‘wind/water’, and ‘self’.
  • tropicLevel: Indicates the tropic level of the species. Options are ‘primary’, ‘herbivore’, ‘predator’, and ‘detritivore’.
  • maxSize: maximum size of the organism. In cm.

References

Breiman L. Random Forests. Machine Learning 2001, 45, 5-32.

Espindola A, Ruffley M, Smith M, Carstens BC, Tank D, Sullivan J. 2016. Identifying cryptic diversity with predictive phylogeography. Proceedings of the Royal Society of London B 2016, 283, 1841.