A (space) travel into the Tidyverse 🚀

# A (space) travel into the Tidyverse 🚀
### Filippo Gambarota
### University of Padova - Psicostat
### 10/11/2020

---

# Contents

* ### What is the **tidyverse** and the **tidy approach**

* ### The main packages and function

* ### Other **tidy** packages

* ### Some examples

---
class: inverse, middle, center

# What is the Tidyverse?

---

# Tidyverse

.pull-left[
<div class="figure" style="text-align: center">
<img src="img/hadley-wickham.jpg" alt="Hadley Wickham - RStudio Data Scientist" width="2720" />
Hadley Wickham - RStudio Data Scientist
</div>

]
.pull-right[
* The tidyverse is an opinionated collection of R packages designed for data science. All packages **share an underlying design philosophy, grammar, and data structures**

]

---

# The big picture

---

# What is the Tidy approach?

---

# What is the tidy approach?

* ## The best way to format data is the **long format**

* ## Concatenate operations with **pipes**

* ## Focus on a **functional programming approach**

---

# Long-format data

* ### Each row is an **observation** and each column is a **variable**

```
## # A tibble: 6 x 4
## id A B C
## <int> <dbl> <dbl> <dbl>
## 1 1 -1.48 -1.31 0.825
## 2 2 -1.01 1.01 0.921
## 3 3 0.484 0.676 -0.750
## 4 4 1.86 -1.05 -0.700
## 5 5 -0.249 -0.621 -0.467
## 6 6 2.43 0.716 -1.35
```
]

--
.pull-right[

```
## # A tibble: 6 x 3
## id cond cov
## <int> <chr> <dbl>
## 1 1 A -1.48 
## 2 1 B -1.31 
## 3 1 C 0.825
## 4 2 A -1.01 
## 5 2 B 1.01 
## 6 2 C 0.921
```

]

---

# Concatenate operations with **pipes**

* Pipes are some operators from the `magrittr::` package with the aim of **improving the code readability and maintainability** 1

* There are several different **pipes** but the most used (and useful) is the `%>%`

* Pipes are integrated with all **tidyverse** functions and packages

]

.footnote[[1] [magrittr website](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)
]

---

# Concatenate operations with **pipes**

* The `%>%` pipe is another way to **declare the function with an argument**

* If `.f` is a function and `.x` is an object, this `.f(x)` is equivalent to `.x %>% .f`

```r
mean(iris$Sepal.Length)
```

```
## [1] 5.843333
```

```r
iris$Sepal.Length %>% 
  mean()
```

```
## [1] 5.843333
```

---

# Concatenate operations with **pipes**

The previous simple example is not completely appropriate, the `pipe` is useless. However let's assume a more complicated example:

```r
head(dat)
```

```
## # A tibble: 6 x 6
## id cond1 cond2 value cov1 cov2
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 A 1 113. -0.973 -1.34
## 2 1 A 2 104. -0.973 -1.34
## 3 1 A 3 109. -0.973 -1.34
## 4 1 B 1 81.9 -0.973 -1.34
## 5 1 B 2 105. -0.973 -1.34
## 6 1 B 3 85.6 -0.973 -1.34
```

1. aggregate data by a `factor` using the `mean()` 
2. create a new columns with some operations between columns
3. rename a variable

---
# Without pipes and tidyverse

```r
dat <- aggregate(value ~ id + cond1 + cov1 + cov2, mean, data = dat) # aggregate by cond1
dat$cov1 <- dat$cov1 - mean(dat$cov1) # center
dat$cov2 <- (dat$cov2 - mean(dat$cov2))/sd(dat$cov2) # z point
names(dat)[1] <- "subject" # rename
head(dat)
```

```
##   subject cond1       cov1       cov2     value
## 1       2     A -0.8441294 -2.1237517 100.13083
## 2       2     B -0.8441294 -2.1237517  94.19992
## 3       2     C -0.8441294 -2.1237517 106.81969
## 4       1     A -0.9245966 -0.7455017 108.37841
## 5       1     B -0.9245966 -0.7455017  90.80302
## 6       1     C -0.9245966 -0.7455017 104.49854
```

* This works fine but is a little bit **redundant**, **difficult to read** and there is a **series of assignment operations**

* Some columns are not in the correct order

* In order to have a new `dat`, you can create a `dat_agg` or overwrite the `dat` object

---

# With pipes and tidyverse

```r
dat %>% 
  mutate(cov1 = cov1 - mean(cov1),
         cov2 = (cov2 - mean(cov2))/sd(cov2)) %>% 
  rename("subject" = id) %>% 
  group_by(subject, cond1, cov1, cov2) %>% 
  summarise(mean = mean(value),
            sd = sd(value)) %>% 
  ungroup() %>% 
  head()
```

```
## `summarise()` regrouping output by 'subject', 'cond1', 'cov1' (override with `.groups` argument)
```

```
## # A tibble: 6 x 6
## subject cond1 cov1 cov2 mean sd
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 0.188 -0.356 90.8 3.87
## 2 1 B 0.188 -0.356 101. 19.6 
## 3 1 C 0.188 -0.356 106. 5.32
## 4 2 A -0.301 -0.466 105. 21.1 
## 5 2 B -0.301 -0.466 100. 9.66
## 6 2 C -0.301 -0.466 108. 6.93
```

---

# With pipes and tidyverse

* The `dat` object is not modified

* Operations follows an **easy to read workflow of operations**

* If you want to assign you can use `<-` at the beginning as `dat <- dat %>% ...`

---

# Functional Programming

Without technical details, the idea of functional programming is the comparison between a `for loop` and a `apply family` function 2

.footnote[[2] From the Hadley Wickam talk - [Managing many models with R](https://www.youtube.com/watch?v=rz3_FDVt9eg)
]

```r
means <- vector("double", ncol(mtcars))
medians <- vector("double", ncol(mtcars))

for(i in seq_along(mtcars)) {
 means[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
 medians[[i]] <- median(mtcars[[i]], na.rm = TRUE)
}
```

```r
means <- lapply(mtcars, function(x) mean(x))
median <- lapply(mtcars, function(x) median(x))
```

---

# Functional Programming with purrr::

<img src="img/furrr.svg" width="200" style="display: block; margin: auto;" />
]

* Purrr is as package that provides a series of **apply like** functions in order to perform complex and fast operations

* Furrr is the same package as `purrr` but with a `future` implementation in order to parallelize the operations

```r
means <- map_dbl(mtcars, mean)
means[1:4]
```

```
##       mpg       cyl      disp        hp 
##  20.09062   6.18750 230.72188 146.68750
```

```r
# sapply(mtcars, function(x) mean(x))
```
]

---

# Main packages and functions

---

# Main packages and functions

---

# Tidyr

* Manipulate datasets to have **tidy** data

* Main functions are `pivot_longer()`, `pivot_wider`, and `separate()`

* Other functions like `drop_na()` and `nest()`

]

---

## Pivoting datasets

### From long to wide dataset

```r
dat %>% 
  head()
```

```
## # A tibble: 6 x 6
## id cond1 cond2 value cov1 cov2
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 A 1 89.2 0.607 -0.507
## 2 1 A 2 95.2 0.607 -0.507
## 3 1 A 3 87.9 0.607 -0.507
## 4 1 B 1 113. 0.607 -0.507
## 5 1 B 2 78.7 0.607 -0.507
## 6 1 B 3 112. 0.607 -0.507
```

---

# Tidyr::pivot_wider()

From long to wide dataset

```r
dat %>% 
  pivot_wider(names_from = c(cond1, cond2), values_from = value) %>%
  head()
```

```
## # A tibble: 6 x 12
## id cov1 cov2 A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.607 -0.507 89.2 95.2 87.9 113. 78.7 112. 109. 109. 99.7
## 2 2 0.117 -0.612 121. 81.4 113. 110. 90.7 99.4 101. 107. 115. 
## 3 3 -0.410 -0.568 99.6 113. 117. 102. 92.1 101. 102. 116. 106. 
## 4 4 1.11 -0.341 104. 107. 103. 89.1 83.0 90.9 100. 91.9 79.4
## 5 5 0.304 1.67 114. 82.4 93.5 116. 135. 93.7 109. 91.2 103. 
## 6 6 1.47 0.383 86.8 92.2 104. 90.6 118. 115. 102. 103. 115.
```

---

# Tidyr::pivot_longer()

From wide to long dataset

```r
dat %>% 
  pivot_wider(names_from = c(cond1, cond2), values_from = value) %>% 
  pivot_longer(4:12, names_to = "cond", values_to = "value") %>% 
  head()
```

```
## # A tibble: 6 x 5
## id cov1 cov2 cond value
## <int> <dbl> <dbl> <chr> <dbl>
## 1 1 0.607 -0.507 A_1 89.2
## 2 1 0.607 -0.507 A_2 95.2
## 3 1 0.607 -0.507 A_3 87.9
## 4 1 0.607 -0.507 B_1 113. 
## 5 1 0.607 -0.507 B_2 78.7
## 6 1 0.607 -0.507 B_3 112.
```

---

# Tidyr::separate()

Separate a column in multiple columns considering a pattern

```r
dat %>% 
  pivot_wider(names_from = c(cond1, cond2), values_from = value) %>% 
  pivot_longer(4:12, names_to = "cond", values_to = "value") %>% 
  separate(cond, into = c("cond1", "cond2"), sep = "_") %>% 
  head()
```

```
## # A tibble: 6 x 6
## id cov1 cov2 cond1 cond2 value
## <int> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 0.607 -0.507 A 1 89.2
## 2 1 0.607 -0.507 A 2 95.2
## 3 1 0.607 -0.507 A 3 87.9
## 4 1 0.607 -0.507 B 1 113. 
## 5 1 0.607 -0.507 B 2 78.7
## 6 1 0.607 -0.507 B 3 112.
```

---

# Dplyr

* This is a very comprehensive package with several functions to create new columns, aggregate datasets, select rows, etc.

* `mutate()` adds new variables that are functions of existing variables
* `select()` picks variables based on their names.
* `filter()` picks cases based on their values.
* `summarise()` reduces multiple values down to a single summary.
* `arrange()` changes the ordering of the rows.

]

---

# Dplyr::mutate()

Create new columns with complex operations and functions

```r
dat %>% 
  mutate(new_col = some_functions(other_col))
```

```r
dat %>% 
  mutate(new_col = cov1 * cov2) %>% 
  head()
```

```
## # A tibble: 6 x 7
## id cond1 cond2 value cov1 cov2 new_col
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 1 89.2 0.607 -0.507 -0.308
## 2 1 A 2 95.2 0.607 -0.507 -0.308
## 3 1 A 3 87.9 0.607 -0.507 -0.308
## 4 1 B 1 113. 0.607 -0.507 -0.308
## 5 1 B 2 78.7 0.607 -0.507 -0.308
## 6 1 B 3 112. 0.607 -0.507 -0.308
```

---

# Dplyr::select()

Select columns in a more readable way

```r
dat %>% 
  pivot_wider(names_from = c(cond1, cond2), values_from = value) %>% 
  select(id, starts_with("A"), ends_with("3")) %>% 
  head()
```

```
## # A tibble: 6 x 6
## id A_1 A_2 A_3 B_3 C_3
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 89.2 95.2 87.9 112. 99.7
## 2 2 121. 81.4 113. 99.4 115. 
## 3 3 99.6 113. 117. 101. 106. 
## 4 4 104. 107. 103. 90.9 79.4
## 5 5 114. 82.4 93.5 93.7 103. 
## 6 6 86.8 92.2 104. 115. 115.
```

---

# Dplyr::filter()

Select rows with multiple conditions

```r
dat %>% 
  filter(cond1 == "A" & cond2 == 2) %>% 
  head()
```

```
## # A tibble: 6 x 6
## id cond1 cond2 value cov1 cov2
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 A 2 95.2 0.607 -0.507
## 2 2 A 2 81.4 0.117 -0.612
## 3 3 A 2 113. -0.410 -0.568
## 4 4 A 2 107. 1.11 -0.341
## 5 5 A 2 82.4 0.304 1.67 
## 6 6 A 2 92.2 1.47 0.383
```

---

# Dplyr::arrange()

Reorder rows based on multiple columns

```r
dat %>% 
  group_by(cond1, cond2) %>% 
  summarise(mean = mean(value),
            sd = sd(value)) %>% 
  arrange(cond1) %>% 
  head()
```

```
## `summarise()` regrouping output by 'cond1' (override with `.groups` argument)
```

```
## # A tibble: 6 x 4
## # Groups: cond1 [2]
## cond1 cond2 mean sd
## <chr> <chr> <dbl> <dbl>
## 1 A 1 99.3 11.5 
## 2 A 2 96.9 10.5 
## 3 A 3 104. 9.50
## 4 B 1 104. 9.47
## 5 B 2 96.9 17.1 
## 6 B 3 104. 14.4
```

---

# Dplyr::case_when()

Is an extension of a `ifelse()` statement in a more compact way

```r
dat %>% 
 mutate(new_fac = case_when(value > 100 & cond1 == "A" ~ "level1",
 value < 50 & cond2 == 2 ~ "level2",
 value != 100 & cond2 == 1 & cov1 > 0.5 ~ "level3",
 TRUE ~ "level4")) %>% 
 head()
```

```
## # A tibble: 6 x 7
## id cond1 cond2 value cov1 cov2 new_fac
## <int> <chr> <chr> <dbl> <dbl> <dbl> <chr> 
## 1 1 A 1 89.2 0.607 -0.507 level3 
## 2 1 A 2 95.2 0.607 -0.507 level4 
## 3 1 A 3 87.9 0.607 -0.507 level4 
## 4 1 B 1 113. 0.607 -0.507 level3 
## 5 1 B 2 78.7 0.607 -0.507 level4 
## 6 1 B 3 112. 0.607 -0.507 level4
```

---

# GGplot2

* Easy to integrate with workflow and pipelines

* Lack of `%>%` functions 3

* Amazing way to combine different layers of plot components

]

.footnote[[3] Hadley Wickam [talk](https://www.youtube.com/watch?v=vYwXMnC03I4&t=2203s) about "mistakes" in developing the tidyverse
]

---

# GGplot2

```r
dat %>% 
  select(cov1, cov2) %>% 
  ggplot(aes(x = cov1, y = cov2)) +
  geom_point()
```

---

# Broom

* Manipulate **fitted models** and return **tidy data**

* Very useful for extracting information from multiple models in a easy way

* There are some expansions like `broom mixed` for `lme4` objects and `broomExtra` for also `brms` models

]

---

# Broom::tidy() and Broom::glance()

```r
fit <- lm(value ~ cond1 * cond2 + cov1 + cov2, data = dat)
tidy(fit) %>% head()
```

```
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 100. 3.68 27.3 6.03e-42
## 2 cond1B 4.60 5.10 0.902 3.70e- 1
## 3 cond1C 2.31 5.10 0.453 6.52e- 1
## 4 cond22 -2.39 5.10 -0.469 6.41e- 1
## 5 cond23 4.73 5.10 0.927 3.57e- 1
## 6 cov1 -1.93 1.71 -1.13 2.62e- 1
```

```r
glance(fit)
```

```
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.101 -0.0132 11.4 0.884 0.552 10 -341. 706. 736. 10284.
## # … with 2 more variables: df.residual <int>, nobs <int>
```

---

# Other packages

### Strings manipulation

<img src="img/stringr.svg" width="250" style="display: block; margin: auto;" />
]

### Work with dates

<img src="img/lubridate.svg" width="250" style="display: block; margin: auto;" />
]

---

# Other packages

### Datatable power, dyplr code

<img src="img/dtplyr.svg" width="500" style="display: block; margin: auto;" />
]

### Modelling (e.g., broom)

<img src="img/tidymodels.svg" width="250" style="display: block; margin: auto;" />
]

---

# Other packages

### Tidybayes

<img src="img/tidybayes.png" width="400" style="display: block; margin: auto;" />
]

* Some **geoms** for `ggplo2` and function to manage fitted **bayesian models**

* Support `rstanarm`, `brms`

]

---

# Some Resources

## R4DS - R for Data Science

* Best book for the **tidy** approach the **data science** in general

* Especially the [many models](https://r4ds.had.co.nz/many-models.html) chapter

]

---

# Talks

* [David Robinson - Ten Tremendous Tricks in the Tidyverse](https://www.youtube.com/watch?v=NDHSBUN_rVU)

* [David Robinson - Teach the Tidyverse to Beginners](https://www.youtube.com/watch?v=dT5A0sAWc2I)

* [Hadley Wickham - Managing many models with R](https://www.youtube.com/watch?v=rz3_FDVt9eg&t)

* [Hadley Wickham - Mistakes of the Tidyverse](https://www.youtube.com/watch?v=vYwXMnC03I4&t)

* [Hadley Wickham - Data visualization and data science](https://www.youtube.com/watch?v=9YTNYT1maa4&t)

* [Emily Robinson - The lesser known stars of the Tidyverse](https://www.youtube.com/watch?v=ax4LXQ5t38k)

---

### .large[[filippo.gambarota@phd.unipd.it](mailto:filippo.gambarota@phd.unipd.it)]

[Download PDF slides](tidyverse_presentation.pdf)

.tiny[Slides made with the [Xaringan](https://github.com/yihui/xaringan) package by [Yihui Xie](https://yihui.name/)]