---
title: "Evaluating the filtering process with naturaList"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{natutaList_clean_eval_vignette}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  eval = TRUE,
  echo = TRUE,
  message = FALSE,
collapse = TRUE,
comment = "#"
)
```

In this article we will show how `{naturaList}` package can be used to process large dataset of species occurrence records. Specifically we will show how we can filter occurrence data accordingly to a list of specialists and [compare the effect](#comp_eval) of the filtering process regarding the differences in area of distribution of species, richness and the niche space before and after the filtering process.

Before read this article, it is important to see the [Introduction to naturaList](https://avrodrigues.github.io/naturaList/articles/natutaList_vignette.html).

This evaluation of cleaning procedure is done mainly through the use of `clean_eval()` function. So, in summary, in this article you will learn how to:

-   [Process](#env_data) geographical and environmental data in order to adequate its format to `{naturaList}` functions, specifically `clean_eval()` function;
-   [Compare](#comp_eval) the spatial occurrence, richness and environmental niche of species before and after the cleaning process.

## Installation and required packages {#install}

For this article the user must load or install the following packages:

```{r install, message=FALSE}
library(naturaList)
library(raster)
library(rnaturalearth)
library(dplyr)
```


## Classification process

We will use occurrence data of the fern genus *Cyathea* in Brazil, that comprises a group of fern species. This dataset is available in the `{naturalist}`. We used this dataset to exemplify how to use a dataset with several species in `{naturaList}`. 

The list of specialists we will use was derived from the authors of a paper describing the diversity of ferns and lycophytes in Brazil (see `?speciaLists`). 

Both datasets of occurrence of *Cyathea* species in Brazil and the specialists of ferns and lycophytes are available  in the `{naturaList}`. 
They can be loaded using the following code:

```{r load_data}

data("cyathea.br") # occurrence dataset
data("speciaLists") # specialists dataset

```

Now, we can use the `classify_occ()` function to perform the classification process for all species at the same time.

```{r classification}

occ.class <- classify_occ(occ = cyathea.br, spec = speciaLists)

```

We can see how many occurrences were classified in each one of the six confidence levels

```{r table_res}

table(occ.class$naturaList_levels)

```

## Downloading environmental data {#env_data}

We will evaluate the effects of cleaning occurrence data based in the classification made with `classify_occ()` in both geographical and environmental space. To this purpose we need data on environmental variables. So, we downloaded bioclim variables from worldclim (using the `{raster}` package) and select two layers (bio1 and bio12) that represents temperature and precipitation. This raster are available as the `r.temp.prec` data in the package. It will be used to calculate the environmental space using the function `define_env_space()` from `{naturaList}` package.

```{r clim_data}

# load climate data
data("r.temp.prec")
df.temp.prec <- raster::as.data.frame(r.temp.prec)
    

# Transform occurrence data in SpatialPointsDataFrame
spdf.occ.cl <- SpatialPoints(occ.class[, c("decimalLongitude", "decimalLatitude")])

```

## Defining environmental space

Now that we have data of temperature and precipitation, we can define the environmental space using the function `define_env_space()` in the `{naturaList}` package.

```{r env_space}

### Define the environmental space for analysis
env.space <- define_env_space(df.temp.prec, buffer.size = 0.05, plot = F)

```

In this function, the environmental variables are standardized by range, which turns the range of each environmental variable from 0 to 1. Then, it is delimited a `buffer.size` around each point in this space and a polygon is draw to link these buffers. You can play with the value in the `buffer.size` argument and set the `plot = TRUE` to visualize the output polygon. The objective with this function is to create a boundary in the environmental space available for the species. 


## Defining geographical space

We also need to define the geographical space for comparison before and after the cleaning process.

```{r geo_space}

# delimit the geographic space
# land area
data('BR')

```

## Comparing before and after cleaning process {#comp_eval}

Here we will use the function `clean_eval()` to evaluate the effect of filtering process in two characteristics of occurrence of species: the species distribution, calculated as being the convex hull polygon occupied by a species; and the environmental niche breadth, calculated as being the convex hull polygon obtained from two environmental variables.

Both spatial distribution and niche space are compared between the data before and after the cleaning process. The ratio of area after cleaning and before cleaning in both spatial distribution and niche space are given for each species as one of the results of `clean_eval()` function.

```{r compare}
# filter by year to be consistent with the environmental data
occ.class.1970 <- 
  occ.class %>% 
  dplyr::filter(year >= 1970)

# cleaning evaluation process
cl.eval_all <- clean_eval(occ.cl = occ.class.1970,
                              env.space = env.space,
                              geo.space = BR,
                              r = r.temp.prec) 

# the amount of area remained after cleaning process 
area_remained <- cl.eval_all$area 

```

The area is a value that range from 0 to 1, in which 0 indicate that none of the area of distribution of the species left after the cleaning process. On the other hand, values of 1 indicate that all the area occupied by a species remains the same after the cleaning process. In this example we cleaned our data to maintain only the occurrence points of species that was identified by a specialist. But it can be altered in the `level.filter` argument (see `?clean_eval`).

If you are working with more than one species, one example of the usage of `clean_eval()` results is to plot in a map the richness containing in each grid before and after the cleaning process. This can be done with the following code:

```{r clean_evaluation}
rich.before.clean <- rasterFromXYZ(cbind(cl.eval_all$site.coords,
                                         cl.eval_all$rich$rich.BC))
rich.after.clean <- rasterFromXYZ(cbind(cl.eval_all$site.coords,
                                         cl.eval_all$rich$rich.AC))


```

This can be plotted in two maps and compare the effects of filtering process in the richness of community

```{r plotRich}

plot(rich.before.clean, main = "Richness before cleaning")
plot(rich.after.clean, main = "Richness after cleaning")

```