Manipulação de dados e Junção Relacional

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://places.education" target="_blank">https://places.educations</a>
</span>
</div>

---

## Retrospecto

---

## Operadores aritméticos R

Operador | Descrição
---------|-----------
x + y |	Adição de x com y
x - y	| Subtração de y em x
x * y	| Multiplicação de x e y
x / y	| Divisão de x por y
x^y ou x**y	| x elevado a y-ésima potência
x%%y	| Resto da divisão de x por y (módulo)
x%/%y	| Parte inteira da divisão de x por y

---

## Operadores de comparação no R

Operador	| Significado
----------|------------
==	| igual a
!=	| diferente de
\>	| maior que
<	| menor que
\>=	| maior ou igual a
<=	| menor ou igual a

> Os operadores de comparação sempre retornam um valor lógico TRUE ou FALSE.

---

## Operadores lógicos no R

.small[
Operador|Descrição|Explicação
--------|---------|----------
& |	AND lógico|	Versão vetorizada. Compara dois elementos do tipo vetor e retorna um vetor de TRUEs e FALSEs
&#124;| OR lógico| Versão vetorizada. Compara dois elementos do tipo vetor e retorna um vetor de TRUEs e FALSEs
!	|NOT lógico|Negação lógica. Retorna um valor lógico único ou um vetor de TRUE / FALSE.
]

> Também conhecidos como operadores booleanos, permitem trabalhar com múltiplas condições relacionais na mesma expressão, e retornam valores lógicos verdadeiro ou falso.

---

## Algumas funções estatísticas para sumarização de dados

.pull-left[
Funções | Descrição
--------|----------
`min()`| mínimo  
`max()`| máximo  
`range()`| amplitude   
`mean()`| média   
`sum()`| soma
`median()`| mediana
`sd()`| desvio-padrão
`IQR()`| intervalo interquantil
]

.pull-right[
Funções | Descrição
--------|----------
`quantile()`| quartis
`var()`| variância
`cor()`| correlação
`summary()`| métricas de sumarização
`rowMeans()`| média das linhas
`colMeans()`| média das colunas
`rowSums()`| soma das linhas
`colSums()`| soma das colunas
]
---

## Tratamento de dados omissos
O R permite que sejam armazenados, em vetores e data.frames, o valor `NA` (Not Available), que representa dados que ainda não são conhecidos.

>`x == NA` trará sempre um resultado FALSE, mesmo que `x` não seja conhecido.

---

# .hand[Nós...]

---

## Data: Mulheres nas ciências

Informações sobre 10 mulheres nas ciências que mudaram o mundo

.small[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ada Lovelace </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Marie Curie </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Janaki Ammal </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Chien-Shiung Wu </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Katherine Johnson </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Rosalind Franklin </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Vera Rubin </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gladys West </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Flossie Wong-Staal </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Jennifer Doudna </td>
  </tr>
</tbody>
</table>
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Entradas

### Profissões

```r
professions
```

```
## # A tibble: 10 × 2
##   name              profession           
##   <chr>             <chr>                
## 1 Ada Lovelace      Mathematician        
## 2 Marie Curie       Physicist and Chemist
## 3 Janaki Ammal      Botanist             
## 4 Chien-Shiung Wu   Physicist            
## 5 Katherine Johnson Mathematician        
## 6 Rosalind Franklin Chemist              
## # ℹ 4 more rows
```

---

## Entradas
### datas

```r
dates
```

```
## # A tibble: 8 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Janaki Ammal            1897       1984
## 2 Chien-Shiung Wu         1912       1997
## 3 Katherine Johnson       1918       2020
## 4 Rosalind Franklin       1920       1958
## 5 Vera Rubin              1928       2016
## 6 Gladys West             1930         NA
## # ℹ 2 more rows
```

---

## Entradas
### trabalhos

```r
works
```

```
## # A tibble: 9 × 2
##   name              known_for                                    
##   <chr>             <chr>                                        
## 1 Ada Lovelace      first computer algorithm                     
## 2 Marie Curie       theory of radioactivity,  discovery of eleme…
## 3 Janaki Ammal      hybrid species, biodiversity protection      
## 4 Chien-Shiung Wu   confim and refine theory of radioactive beta…
## 5 Katherine Johnson calculations of orbital mechanics critical t…
## 6 Vera Rubin        existence of dark matter                     
## # ℹ 3 more rows
```

---

## Resultado desejado

```
## # A tibble: 10 × 5
##   name              profession    birth_year death_year known_for
##   <chr>             <chr>              <dbl>      <dbl> <chr>    
## 1 Ada Lovelace      Mathematician         NA         NA first co…
## 2 Marie Curie       Physicist an…         NA         NA theory o…
## 3 Janaki Ammal      Botanist            1897       1984 hybrid s…
## 4 Chien-Shiung Wu   Physicist           1912       1997 confim a…
## 5 Katherine Johnson Mathematician       1918       2020 calculat…
## 6 Rosalind Franklin Chemist             1920       1958 <NA>     
## # ℹ 4 more rows
```

---

## Data frames de entrada

```r
names(professions)
```

```
## [1] "name"       "profession"
```

```r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

```r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

```r
nrow(professions)
```

```
## [1] 10
```

```r
nrow(dates)
```

```
## [1] 8
```

```r
nrow(works)
```

```
## [1] 9
```
]

---

# Junção de data frames

---

## Junção de data frames

```r
something_join(x, y)
```

- `left_join()`: todas as linhas de x
- `right_join()`: todas as linhas de y
- `full_join()`: todas as linhas de x e y
- `semi_join()`: todas as linhas de x onde há valores correspondentes em y, mantendo apenas as linhas de x
- `inner_join()`: todas as linhas de x onde há valores correspondentes em y, retornando toda combinação dos diferentes data frames no caso de múltiplas correspondências 
- `anti_join()`: retorna todas as linhas de x onde não há valores correspondentes em y, sem duplicar linhas em x
- ...
 
---

## Dados

para os próximos slides...

```r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

```r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
left_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

```r
professions %>%
* left_join(dates)
```

```
## # A tibble: 10 × 4
##   name              profession            birth_year death_year
##   <chr>             <chr>                      <dbl>      <dbl>
## 1 Ada Lovelace      Mathematician                 NA         NA
## 2 Marie Curie       Physicist and Chemist         NA         NA
## 3 Janaki Ammal      Botanist                    1897       1984
## 4 Chien-Shiung Wu   Physicist                   1912       1997
## 5 Katherine Johnson Mathematician               1918       2020
## 6 Rosalind Franklin Chemist                     1920       1958
## # ℹ 4 more rows
```

---

## `right_join()`

.pull-left[
<img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
right_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```
]

---

## `right_join()`

```r
professions %>%
* right_join(dates)
```

```
## # A tibble: 8 × 4
##   name              profession    birth_year death_year
##   <chr>             <chr>              <dbl>      <dbl>
## 1 Janaki Ammal      Botanist            1897       1984
## 2 Chien-Shiung Wu   Physicist           1912       1997
## 3 Katherine Johnson Mathematician       1918       2020
## 4 Rosalind Franklin Chemist             1920       1958
## 5 Vera Rubin        Astronomer          1928       2016
## 6 Gladys West       Mathematician       1930         NA
## # ℹ 2 more rows
```

---

## `full_join()`

.pull-left[
<img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
full_join(x, y)
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```
]

---

## `full_join()`

```r
dates %>%
* full_join(works)
```

```
## # A tibble: 10 × 4
##   name              birth_year death_year known_for              
##   <chr>                  <dbl>      <dbl> <chr>                  
## 1 Janaki Ammal            1897       1984 hybrid species, biodiv…
## 2 Chien-Shiung Wu         1912       1997 confim and refine theo…
## 3 Katherine Johnson       1918       2020 calculations of orbita…
## 4 Rosalind Franklin       1920       1958 <NA>                   
## 5 Vera Rubin              1928       2016 existence of dark matt…
## 6 Gladys West             1930         NA mathematical modeling …
## # ℹ 4 more rows
```

---

## `inner_join()`

.pull-left[
<img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
inner_join(x, y)
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

```r
dates %>%
* inner_join(works)
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for             
##   <chr>                   <dbl>      <dbl> <chr>                 
## 1 Janaki Ammal             1897       1984 hybrid species, biodi…
## 2 Chien-Shiung Wu          1912       1997 confim and refine the…
## 3 Katherine Johnson        1918       2020 calculations of orbit…
## 4 Vera Rubin               1928       2016 existence of dark mat…
## 5 Gladys West              1930         NA mathematical modeling…
## 6 Flossie Wong-Staal       1947         NA first scientist to cl…
## # ℹ 1 more row
```

---

## `semi_join()`

.pull-left[
<img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
semi_join(x, y)
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

```r
dates %>%
* semi_join(works)
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## # ℹ 1 more row
```

---

## `anti_join()`

.pull-left[
<img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

```r
anti_join(x, y)
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

```r
dates %>%
* anti_join(works)
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Realizando a junção dos dados

```r
professions %>%
  left_join(dates) %>%
  left_join(works)
```

---

# Estudo de caso: dados de estudantes

---

## Dados de estudantes

- Temos:
  - Enrolment: dados oficiais de matrícula da universidade
  - Survey: dados coletados junto aos estudantes sobre as disciplinas que estão cursando
- Want: Resultados da pesquisa considerando informações de todos alunos matriculados

```r
enrolment
```

```
## # A tibble: 3 × 2
##      id name           
##   <dbl> <chr>          
## 1     1 Dave Friday    
## 2     2 Hermine        
## 3     3 Sura Selvarajah
```
]
.pull-right[

```r
survey
```

```
## # A tibble: 4 × 3
##      id name    username            
##   <dbl> <chr>   <chr>               
## 1     2 Hermine bakealongwithhermine
## 2     3 Sura    surasbakes          
## 3     4 Peter   peter_bakes         
## 4     5 Mark    thebakingbuddha
```
]

---

## Dados dos estudantes

### Em classes

```r
enrolment %>% 
* left_join(survey, by = "id")
```

```
## # A tibble: 3 × 4
##      id name.x          name.y  username            
##   <dbl> <chr>           <chr>   <chr>               
## 1     1 Dave Friday     <NA>    <NA>                
## 2     2 Hermine         Hermine bakealongwithhermine
## 3     3 Sura Selvarajah Sura    surasbakes
```

---

## Dados dos estudantes

### Pesquisa direta

```r
enrolment %>% 
* anti_join(survey, by = "id")
```

```
## # A tibble: 1 × 2
##      id name       
##   <dbl> <chr>      
## 1     1 Dave Friday
```

---

## Dados dos estudantes

### Abandonaram as disciplinas

```r
survey %>% 
* anti_join(enrolment, by = "id")
```

```
## # A tibble: 2 × 3
##      id name  username       
##   <dbl> <chr> <chr>          
## 1     4 Peter peter_bakes    
## 2     5 Mark  thebakingbuddha
```

---

# Estudo de caso: venda de alimentos

---

## Vendas de alimentos

- Temos:
  - Purchases: Uma linha por consumidor, por item.
  - Prices: uma linha por item na loja
- Want: receita total

```r
purchases
```

```
## # A tibble: 5 × 2
##   customer_id item        
##         <dbl> <chr>       
## 1           1 bread       
## 2           1 milk        
## 3           1 banana      
## 4           2 milk        
## 5           2 toilet paper
```
]
.pull-right[

```r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]

---

## Venda de alimentos - receita total

```r
purchases %>% 
* left_join(prices)
```

```
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
```
]
.pull-right[

```r
purchases %>% 
  left_join(prices) %>%
* summarise(total_revenue = sum(price))
```

```
## # A tibble: 1 × 1
##   total_revenue
##           <dbl>
## 1          5.75
```
]

---

## Receita por consumidor

```r
purchases %>% 
  left_join(prices)
```

```r
purchases %>% 
  left_join(prices) %>%
* group_by(customer_id) %>%
  summarise(total_revenue = sum(price))
```

```
## # A tibble: 2 × 2
##   customer_id total_revenue
##         <dbl>         <dbl>
## 1           1          1.95
## 2           2          3.8
```
]

---

#DÚVIDAS?