Title: | Classify Occurrences by Confidence Levels in the Species ID |
---|---|
Description: | Classify occurrence records based on confidence levels of species identification. In addition, implement tools to filter occurrences inside grid cells and to manually check for possibles errors with an interactive shiny application. |
Authors: | Arthur Vinicius Rodrigues [aut, cre]
|
Maintainer: | Arthur Vinicius Rodrigues <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.2 |
Built: | 2025-01-31 04:10:32 UTC |
Source: | https://github.com/avrodrigues/naturalist |
A GBIF raw dataset containing 508 occurrence records for the tree fern Alsophila setosa.
A.setosa
A.setosa
A data frame with 508 rows and 45 variables
GBIF.org (08 July 2019) GBIF Occurrence Download doi:10.15468/dl.6jesg0
A spatial polygon with the Brazil boundaries
BR
BR
A 'SpatialPolygonsDataFrame' with 1 feature
Classifies occurrence records in levels of confidence in species identification
classify_occ( occ, spec = NULL, na.rm.coords = TRUE, crit.levels = c("det_by_spec", "not_spec_name", "image", "sci_collection", "field_obs", "no_criteria_met"), ignore.det.names = NULL, spec.ambiguity = "not.spec", institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
classify_occ( occ, spec = NULL, na.rm.coords = TRUE, crit.levels = c("det_by_spec", "not_spec_name", "image", "sci_collection", "field_obs", "no_criteria_met"), ignore.det.names = NULL, spec.ambiguity = "not.spec", institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
occ |
data frame with occurrence records information. |
spec |
data frame with specialists' names. See details. |
na.rm.coords |
logical. If |
crit.levels |
character. Vector with levels of confidence in decreasing
order. The criteria allowed are |
ignore.det.names |
character vector indicating strings in
|
spec.ambiguity |
character. Indicates how to deal with ambiguity in
specialists names. |
institution.code |
column name of |
collection.code |
column name of |
catalog.number |
column name of |
year |
Column name of |
date.identified |
Column name of |
species |
column name of |
identified.by |
column name of |
decimal.latitude |
column name of |
decimal.longitude |
column name of |
basis.of.record |
column name with the specific nature of the data record. See details. |
media.type |
column name of |
occurrence.id |
column name of |
institution.source |
deprecated, use |
year.event |
deprecated, use |
scientific.name |
deprecated, use |
determined.by |
deprecated, use |
latitude |
deprecated, use |
longitude |
deprecated, use |
basis.of.rec |
deprecated, use |
occ.id |
deprecated, use |
spec
data frame must have columns separating LastName
,
Name
and Abbrev
. See create_spec_df
function for a easy way to produce this data frame.
When ignore.det.name = NULL
(default), the function ignores
strings with "RRC ID Flag", "NA", "", "-" and "_".
When a character
vector is provided, the function adds the default strings to the provided
character vector and ignore all these strings as being a name of a taxonomist.
The function classifies the occurrence records in six levels of confidence in species identification. The six levels are:
det_by_spec
- when the identification was made by a specialists
which is present in the list of specialists provided in the spec
argument;
not_spec_name
- when the identification was made by a name who is
not a specialist name provide in spec
;
image
- the occurrence have not name of a identifier, but present
an image associated;
sci_collection
- the occurrence have not name of a identifier,
but preserved in a scientific collection;
field_obs
- the occurrence have not name of a identifier,
but it was identified in field observation;
no_criteria_met
- no other criteria was met.
The (decreasing) order of the levels in the character vector determines the classification level order.
basis.of.record
is a character vector with one of the following
types of record: PRESERVED_SPECIMEN
, PreservedSpecimen
,
HUMAN_OBSERVATION
or HumanObservation
, as in GBIF data
'basisOfRecord'.
media.type
uses the same pattern as GBIF mediaType column,
indicating the existence of an associated image with stillImage
.
The occ
data frame plus the classification of each record
in a new column, named naturaList_levels
.
Arthur V. Rodrigues
data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists)
data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists)
This function compare the area occupied by a species before and after pass through the cleaning procedure according to the chosen level of filter. The comparison can be made by measuring area in the geographical and in the environmental space
clean_eval( occ.cl, geo.space, env.space = NULL, level.filter = c("1_det_by_spec"), r, species = "species", decimal.longitude = "decimalLongitude", decimal.latitude = "decimalLatitude", scientific.name, longitude, latitude )
clean_eval( occ.cl, geo.space, env.space = NULL, level.filter = c("1_det_by_spec"), r, species = "species", decimal.longitude = "decimalLongitude", decimal.latitude = "decimalLatitude", scientific.name, longitude, latitude )
occ.cl |
data frame with occurrence records information already
classified by |
geo.space |
a SpatialPolygons* or sf object defining the geographical space |
env.space |
a SpatialPolygons* or sf object defining the environmental
space. Use the |
level.filter |
a character vector including the levels in 'naturaList_levels' column which filter the occurrence data set. |
r |
a raster with 2 layers representing the environmental variables. If
|
species |
column name of |
decimal.longitude |
column name of |
decimal.latitude |
column name of |
scientific.name |
deprecated, use |
longitude |
deprecated, use |
latitude |
deprecated, use |
a list in which:
area
data frame remaining area after cleaning proportional to the area
before cleaning. The values vary from 0 to 1. Column named r.geo.area
is the remaining area for all species in the geographic space and the
r.env.area
in the environmental space.
comp
data frame with composition of species in sites (cells from raster
layers) before cleaning (comp$comp$BC
) and after cleaning
(comp$comp$AC
). The number of rows is equal the number of cells in
r
, and number of columns is equal to the number of species in the
occ.cl
.
rich
data frame with a single column with the richness of each site
site.coords
data frame with site's coordinates. It facilitates to built
raster layers from results using rasterFromXYZ
## Not run: library(sp) library(raster) data("speciaLists") # list of specialists data("cyathea.br") # occurrence dataset # classify occ.cl <- classify_occ(cyathea.br, speciaLists) # delimit the geographic space # land area data("BR") # Transform occurrence data in SpatialPointsDataFrame spdf.occ.cl <- sp::SpatialPoints(occ.cl[, c("decimalLongitude", "decimalLatitude")]) # load climate data data("r.temp.prec") # mean temperature and annual precipitation df.temp.prec <- raster::as.data.frame(r.temp.prec) ### Define the environmental space for analysis # this function will create a boundary of available environmental space, # analogous to the continent boundary in the geographical space env.space <- define_env_space(df.temp.prec, buffer.size = 0.05) # filter by year to be consistent with the environmental data occ.class.1970 <- occ.cl %>% dplyr::filter(year >= 1970) ### run the evaluation cl.eval <- clean_eval(occ.class.1970, env.space = env.space, geo.space = BR, r = r.temp.prec) #area results head(cl.eval$area) ### richness maps ## it makes sense if there are more than one species rich.before.clean <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, cl.eval$rich$rich.BC)) rich.after.clean <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, cl.eval$rich$rich.AC)) raster::plot(rich.before.clean) raster::plot(rich.after.clean) ### species area map comp.bc <- as.data.frame(cl.eval$comp$comp.BC) comp.ac <- as.data.frame(cl.eval$comp$comp.AC) c.villosa.bc <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, comp.bc$`Cyathea villosa`)) c.villosa.ac <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, comp.ac$`Cyathea villosa`)) raster::plot(c.villosa.bc) raster::plot(c.villosa.ac) ## End(Not run)
## Not run: library(sp) library(raster) data("speciaLists") # list of specialists data("cyathea.br") # occurrence dataset # classify occ.cl <- classify_occ(cyathea.br, speciaLists) # delimit the geographic space # land area data("BR") # Transform occurrence data in SpatialPointsDataFrame spdf.occ.cl <- sp::SpatialPoints(occ.cl[, c("decimalLongitude", "decimalLatitude")]) # load climate data data("r.temp.prec") # mean temperature and annual precipitation df.temp.prec <- raster::as.data.frame(r.temp.prec) ### Define the environmental space for analysis # this function will create a boundary of available environmental space, # analogous to the continent boundary in the geographical space env.space <- define_env_space(df.temp.prec, buffer.size = 0.05) # filter by year to be consistent with the environmental data occ.class.1970 <- occ.cl %>% dplyr::filter(year >= 1970) ### run the evaluation cl.eval <- clean_eval(occ.class.1970, env.space = env.space, geo.space = BR, r = r.temp.prec) #area results head(cl.eval$area) ### richness maps ## it makes sense if there are more than one species rich.before.clean <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, cl.eval$rich$rich.BC)) rich.after.clean <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, cl.eval$rich$rich.AC)) raster::plot(rich.before.clean) raster::plot(rich.after.clean) ### species area map comp.bc <- as.data.frame(cl.eval$comp$comp.BC) comp.ac <- as.data.frame(cl.eval$comp$comp.AC) c.villosa.bc <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, comp.bc$`Cyathea villosa`)) c.villosa.ac <- raster::rasterFromXYZ(cbind(cl.eval$site.coords, comp.ac$`Cyathea villosa`)) raster::plot(c.villosa.bc) raster::plot(c.villosa.ac) ## End(Not run)
Creates a specialist data frame ready for use in
classify_occ
from a character vector containing the specialists names
create_spec_df(spec.char)
create_spec_df(spec.char)
spec.char |
a character vector with specialist names |
a data frame. Columns split the names, surname and abbreviation for the names. If the full name contain any special character, such as accent marks, two lines for that name will be provided, with and without the special characters. See examples.
# Example using Latin accent marks data(spec_names_ex) spec_names_ex create_spec_df(spec_names_ex)
# Example using Latin accent marks data(spec_names_ex) spec_names_ex create_spec_df(spec_names_ex)
A filtered GBIF dataset containing 3851 occurrence records for the fern species from the genus Cyathea in Brazil. We filtered the data after download from GBIF to ensure all occurrences records are from Brazil.
cyathea.br
cyathea.br
A data frame with 3851 rows and 50 variables
GBIF.org (07 March 2021) GBIF Occurrence Download doi:10.15468/dl.qrhynv
Based on two continuous environmental variables, it defines a bi-dimensional environmental space.
define_env_space(env, buffer.size, plot = TRUE)
define_env_space(env, buffer.size, plot = TRUE)
env |
matrix or data frame with two columns containing two environmental variables. The variables must be numeric, even for data frames. |
buffer.size |
numeric value indicating a buffer size around each point which will delimit the environmental geographical border for the occurrence point. See details. |
plot |
logical. whether to plot the polygon. Default is TRUE. |
The environmental variables are standardized by range, which turns
the range of each environmental variable from 0 to 1. Then, it is delimited
a buffer of size equal to buffer.size
around each point in this
space and a polygon is draw to link these buffers. The function returns the
polygon needed to link all points, and the area of the polygon indicates
the environmental space based in the variables used.
An object of sfc_POLYGON class
## Not run: library("raster") # load climate data data("r.temp.prec") env.data <- raster::as.data.frame(r.temp.prec) define_env_space(env.data, 0.05) ## End(Not run)
## Not run: library("raster") # load climate data data("r.temp.prec") env.data <- raster::as.data.frame(r.temp.prec) define_env_space(env.data, 0.05) ## End(Not run)
Filter the occurrence with the most realible species identification in the environmental space. This function is based in the function envSample provided by Varela et al. (2014) and were adapted to the naturaList package to select the occurrence with the most realible species identification in each environmental grid.
env_grid_filter( occ.cl, env.data, grid.res, institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID" )
env_grid_filter( occ.cl, env.data, grid.res, institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID" )
occ.cl |
data frame with occurrence records information already
classified by |
env.data |
data frame with rows for occurrence observation and columns for each environmental variable |
grid.res |
numeric vector. Each value represents the width of each bin
in the scale of the environmental variable. The order in this vector is
assumed to be the same order in the of the variables in the |
institution.code |
column name of |
collection.code |
column name of |
catalog.number |
column name of |
year |
Column name of |
date.identified |
Column name of |
species |
column name of |
identified.by |
column name of |
decimal.latitude |
column name of |
decimal.longitude |
column name of |
basis.of.record |
column name with the specific nature of the data record. See details. |
media.type |
column name of |
occurrence.id |
column name of |
Data frame with the same columns of occ.cl
.
Varela et al. (2014). Environmental filters reduce the effects of sampling bias and improve predictions of ecological niche models. *Ecography*. 37(11) 1084-1091.
## Not run: library(naturaList) library(tidyverse) data("cyathea.br") data("speciaLists") data("r.temp.prec") occ <- cyathea.br %>% filter(species == "Cyathea atrovirens") occ.cl <- classify_occ(occ, speciaLists, spec.ambiguity = "is.spec") # temperature and precipitaion data env.data <- raster::extract( r.temp.prec, occ.cl[,c("decimalLongitude", "decimalLatitude")] ) %>% as.data.frame() # the bins for temperature has 5 degrees each and for precipitation has 100 mm each grid.res <- c(5, 100) occ.filtered <- env_grid_filter( occ.cl, env.data, grid.res ) ## End(Not run)
## Not run: library(naturaList) library(tidyverse) data("cyathea.br") data("speciaLists") data("r.temp.prec") occ <- cyathea.br %>% filter(species == "Cyathea atrovirens") occ.cl <- classify_occ(occ, speciaLists, spec.ambiguity = "is.spec") # temperature and precipitaion data env.data <- raster::extract( r.temp.prec, occ.cl[,c("decimalLongitude", "decimalLatitude")] ) %>% as.data.frame() # the bins for temperature has 5 degrees each and for precipitation has 100 mm each grid.res <- c(5, 100) occ.filtered <- env_grid_filter( occ.cl, env.data, grid.res ) ## End(Not run)
This function facilitates the search for non-taxonomist strings in the 'identified.by' column of occurrence records data set
get_det_names( occ, identified.by = "identifiedBy", freq = FALSE, decreasing = TRUE, determined.by )
get_det_names( occ, identified.by = "identifiedBy", freq = FALSE, decreasing = TRUE, determined.by )
occ |
data frame with occurrence records information. |
identified.by |
column name of |
freq |
logical. If |
decreasing |
logical. sort strings in decreasing order of frequency.
Default = |
determined.by |
deprecated, use |
character vector containing the strings in identified.by
column of occ
. If freq = TRUE
it return a data frame with
two columns: 'strings' and 'frequency'.
data("A.setosa") get_det_names(A.setosa, freq = TRUE)
data("A.setosa") get_det_names(A.setosa, freq = TRUE)
In each grid cell it selects the occurrence with the highest confidence level
in species identification made by classify_occ
function.
grid_filter( occ.cl, grid.resolution = c(0.5, 0.5), r = NULL, institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
grid_filter( occ.cl, grid.resolution = c(0.5, 0.5), r = NULL, institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
occ.cl |
data frame with occurrence records information already
classified by |
grid.resolution |
numeric vector with width and height of grid cell in decimal degrees. |
r |
raster from which the grid cell resolution is derived. |
institution.code |
column name of |
collection.code |
column name of |
catalog.number |
column name of |
year |
Column name of |
date.identified |
Column name of |
species |
column name of |
identified.by |
column name of |
decimal.latitude |
column name of |
decimal.longitude |
column name of |
basis.of.record |
column name with the specific nature of the data record. See details. |
media.type |
column name of |
occurrence.id |
column name of |
institution.source |
deprecated, use |
year.event |
deprecated, use |
scientific.name |
deprecated, use |
determined.by |
deprecated, use |
latitude |
deprecated, use |
longitude |
deprecated, use |
basis.of.rec |
deprecated, use |
occ.id |
deprecated, use |
Data frame with the same columns of occ.cl
.
Arthur V. Rodrigues
## Not run: data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists) occ.grid <- grid_filter(occ.class) ## End(Not run)
## Not run: data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists) occ.grid <- grid_filter(occ.class) ## End(Not run)
Allows to delete occurrence records and to select occurrence points by classification levels or by drawing spatial polygons.
map_module( occ.cl, action = "clean", institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
map_module( occ.cl, action = "clean", institution.code = "institutionCode", collection.code = "collectionCode", catalog.number = "catalogNumber", year = "year", date.identified = "dateIdentified", species = "species", identified.by = "identifiedBy", decimal.latitude = "decimalLatitude", decimal.longitude = "decimalLongitude", basis.of.record = "basisOfRecord", media.type = "mediaType", occurrence.id = "occurrenceID", institution.source, year.event, scientific.name, determined.by, latitude, longitude, basis.of.rec, occ.id )
occ.cl |
Data frame with occurrence records information already
classified by |
action |
a string with '"clean"' or '"flag"' which defines the action of 'map_module' function with the occurrence dataset. Default is '"clean"'. If the string is '"clean"' the dataset returned only the occurrences records selected by the user. If the string is '"flag"', a column named 'map_module_flag' is added in the output dataset, with tags 'selected' and 'deleted', following the choices of the user in the application. |
institution.code |
column name of |
collection.code |
column name of |
catalog.number |
column name of |
year |
Column name of |
date.identified |
Column name of |
species |
column name of |
identified.by |
column name of |
decimal.latitude |
column name of |
decimal.longitude |
column name of |
basis.of.record |
column name with the specific nature of the data record. See details. |
media.type |
column name of |
occurrence.id |
column name of |
institution.source |
deprecated, use |
year.event |
deprecated, use |
scientific.name |
deprecated, use |
determined.by |
deprecated, use |
latitude |
deprecated, use |
longitude |
deprecated, use |
basis.of.rec |
deprecated, use |
occ.id |
deprecated, use |
Data frame with the same columns of occ.cl
.
Arthur V. Rodrigues
## Not run: data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists) occ.selected <- map_module(occ.class) occ.selected ## End(Not run)
## Not run: data("A.setosa") data("speciaLists") occ.class <- classify_occ(A.setosa, speciaLists) occ.selected <- map_module(occ.class) occ.selected ## End(Not run)
Raster of Annual Mean Temperature (bio1) and Total Annual Precipitation (bio2).
Layers were downloaded from worldclim database and cropped to the extent of
cyathea_br
with a buffer of 100 km.
r.temp.prec
r.temp.prec
A raster with two layers
Example of specialist names with accent marks
spec_names_ex
spec_names_ex
character
A dataset containing the specialists of ferns and lycophytes of Brazil formatted
to be used by naturaList
package. This data serves as a format example for spec
argument in
classify_occ
.
speciaLists
speciaLists
A data frame with 27 rows and 8 columns:
Last name of the specialist.
Columns with the names of specialist. Could be repeated as long as needed. In this data Name* was repeated three times.
Columns with the names of specialist.
Columns with the names of specialist.
Columns with the names of specialist.
Columns with the abbreviation (one character) of the names of specialists. Could be repeated as long as needed. In this data Abbrev* was repeated three times.
Columns with the abbreviation (one character) of the names of specialists.
Columns with the abbreviation (one character) of the names of specialists.
The specialists names was derived from the authors of paper: doi:10.1590/2175-7860201566410