---
title: "Introduction to naturaList"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{natutaList_vignette}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#"
)
```

The aim of `{naturaList}` package is to implement a classification of occurrence records based on the suitability in the species identification record. The quality of classification is ranked up to six levels of confidence. Additionally, `{naturaList}` package provides tools to filter the occurrence data based on these classification levels, identify the possible specialists in the taxa and evaluate the effects of the filtering procedure on different descriptors of species spatial distribution of occurrence records (area of distribution and niche breadth). With `{naturaList}` package the users can filter large occurrence data based on well established and clear criterion, evaluate possible effect of data processing on downstream analysis and explore spatial occurrence data through an interactive interface.

## Installation
Install the package:

```{r install, eval=F, echo = T}
install.packages("naturaList")
```

## Classify occurrence records based on confidence in the species identification

`{naturaList}` has as the core function `classify_occ()`. The rationale of the classification is that the most reliable identification of a specimen is made by a specialist in the taxa. To classify an occurrence at this level of confidence, the `classify_occ()` function needs of an occurrence and a specialist dataset.
The other levels in which data can be classified are derived from information contained in the occurrence dataset. The default order for classification in confidence levels is:

 * Level 1 - species was identified by a specialist, if not;
 * Level 2 - who identified the species was a not specialist name, if not;
 * Level 3 - occurrence record has an image associated, if not;
 * Level 4 - the specimen is preserved in a scientific collection, if not;
 * Level 5 - the identification was done in filed observation, if not;
 * Level 6 - no criteria was met.
 
The user can alter this order, depending on his/her objectives, except for the Level 1 that is always a species determined by a specialist.

As example, we will use the datasets in `{naturaList}`: `A.setosa`, as the occurrence dataset, and `speciaLists`, as the specialist dataset. In the `A.setosa` there are occurrence records for *Alsophila setosa*, a tree fern of the Brazilian Atlantic Forest. This dataset were downloaded from [Global Biodiversity Information Facility (GBIF)]( https://doi.org/10.15468/dl.6jesg0). The `speciaLists` is a dataset with specialists of ferns and lycophytes of Brazil, which we gathered from the authors of this [paper](https://doi.org/10.1590/2175-7860201566410). 

```{r setup, eval=FALSE}
# Load package and data
library(naturaList)

data("A.setosa")
data("speciaLists")

# see the size of datasets
dim(A.setosa) # see ?A.setosa for details
dim(speciaLists) # see ?speciaLists for details
```

Classification using the default order of confidence levels
```{r classify, eval=F, echo = T}
# classification
occ.class <- classify_occ(A.setosa, speciaLists)
dim(occ.class)
```
 

You can check how many occurrences was classified in each level:
```{r levels, eval=F, echo = T}
table(occ.class$naturaList_levels)
```

## Create a specialist dataset

You can easily create a specialist dataset using `create_spec_df()`. You just need to provide a character vector with the names of specialists, and the output is a dataset formatted be used in `classify_occ()`. 

In this example, we use the names of four famous Brazilian musicians. Note that the Latin accent mark is provided, and even a nickname (e.g. Tom Jobim). 

```{r create_spec_df, eval=F, echo = T}
# create a specialist table example
br.musicians <- c("Caetano Veloso", "Antônio Carlos Tom Jobim", 
                  "Gilberto Gil", "Vinícius de Morais")

spec_df <- create_spec_df(br.musicians)
spec_df
```

## Verify if the strings in the 'identifiedBy' column of occurrence dataset is a name 

It might occur that some strings in the 'identifiedBy' column of the occurrence dataset do not correspond to a taxonomist name. Strings as such `"Unknown"` often is included in the 'identifiedBy' data field. It is important then that such strings be ignored by the `classify_occ()`, if not this function could flag an occurrence record as determined by a taxonomist when it was not. 

To cope with this issue, `get_det_names()` can be used to verify which strings are not taxonomists names. This function returns all unique strings in the 'identifiedBy' column of the dataset. Based on this list of names, you could create a character vector with the strings to be ignored by `classify_occ()`, providing it to the `ignore.det.names` argument. See also the `?classify_occ` for more details. 

```{r get_det_names, eval=F, echo = T}

# check out if there are strings which are not taxonomists
get_det_names(A.setosa)

# include these strings in a object
ig.names <- c("Sem Informação" , "Anonymous")

# use 'ignore.det.names' to ignore those strings in classify_occ()
occ.class <- classify_occ(A.setosa, speciaLists, ignore.det.names = ig.names)


table(occ.class$naturaList_levels)

```