Data wrangling - III

Lecture 8

Dr. Greg Chism

University of Arizona
INFO 526 - Spring 2024

Warm up

Announcements

RQ 3 is due Feb 19th
Project 1 reviews will be returned to you by Wednesday

Setup

# load packages
library(countdown)
library(tidyverse)
library(glue)
library(lubridate)
library(scales)
library(ggthemes)
library(gt)
library(palmerpenguins)
library(openintro)
library(ggrepel)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

Missing values I

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

df <- tibble(
  x = c(1, 2, 3, NA, 3),
  y = c(5, NA, 10, 0, 5)
)

ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 3)

Warning: Removed 2 rows containing missing values
(`geom_point()`).

Missing values II

set.seed(1234)
df <- tibble(x = rnorm(100))

p <- ggplot(df, aes(x = x)) +
  geom_boxplot()
p

df |>
  summarize(med_x = median(x))

# A tibble: 1 × 1
   med_x
   <dbl>
1 -0.385

Missing values II

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

p + xlim(0, 2)

Warning: Removed 69 rows containing non-finite values
(`stat_boxplot()`).

Missing values II

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

p + scale_x_continuous(limits = c(0, 2))

Warning: Removed 69 rows containing non-finite values
(`stat_boxplot()`).

Missing values II

Why doesn’t the following generate a warning?

p + coord_cartesian(xlim = c(0, 2))

Bringing together multiple data frames

Scenario 2

We…

have multiple data frames

want to want to bring them together so we can plot them

professions <- read_csv("data/professions.csv")
dates <- read_csv("data/dates.csv")
works <- read_csv("data/works.csv")

10 women in science who changed the world

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

professions

# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist

dates

# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947         NA
8 Jennifer Doudna          1964         NA

works

# A tibble: 9 × 2
  name               known_for                                   
  <chr>              <chr>                                       
1 Ada Lovelace       first computer algorithm                    
2 Marie Curie        theory of radioactivity,  first woman Nobel…
3 Janaki Ammal       hybrid species, biodiversity protection     
4 Chien-Shiung Wu    experiment overturning theory of parity     
5 Katherine Johnson  orbital mechanics critical to sending first…
6 Vera Rubin         existence of dark matter                    
7 Gladys West        mathematical modeling of the shape of the E…
8 Flossie Wong-Staal first to clone HIV and map its genes, which…
9 Jennifer Doudna    one of the primary developers of CRISPR

Desired output

# A tibble: 10 × 5
   name               profession  birth_year death_year known_for
   <chr>              <chr>            <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematic…         NA         NA first co…
 2 Marie Curie        Physicist …         NA         NA theory o…
 3 Janaki Ammal       Botanist          1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist         1912       1997 experime…
 5 Katherine Johnson  Mathematic…       1918       2020 orbital …
 6 Rosalind Franklin  Chemist           1920       1958 <NA>     
 7 Vera Rubin         Astronomer        1928       2016 existenc…
 8 Gladys West        Mathematic…       1930         NA mathemat…
 9 Flossie Wong-Staal Virologist…       1947         NA first to…
10 Jennifer Doudna    Biochemist        1964         NA one of t…

Inputs, reminder

names(professions)

[1] "name"       "profession"

names(dates)

[1] "name"       "birth_year" "death_year"

names(works)

[1] "name"      "known_for"

nrow(professions)

[1] 10

nrow(dates)

[1] 8

nrow(works)

[1] 9

Joining data frames

something_join(x, y)

left_join(): all rows from x
right_join(): all rows from y
full_join(): all rows from both x and y
semi_join(): all rows from x where there are matching values in y, keeping just columns from x
inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x
…

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

x

# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3

y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

y

# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4

`left_join()`

left_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>

`left_join()`

professions |>
  left_join(dates)

Joining with `by = join_by(name)`

# A tibble: 10 × 4
   name               profession            birth_year death_year
   <chr>              <chr>                      <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                 NA         NA
 2 Marie Curie        Physicist and Chemist         NA         NA
 3 Janaki Ammal       Botanist                    1897       1984
 4 Chien-Shiung Wu    Physicist                   1912       1997
 5 Katherine Johnson  Mathematician               1918       2020
 6 Rosalind Franklin  Chemist                     1920       1958
 7 Vera Rubin         Astronomer                  1928       2016
 8 Gladys West        Mathematician               1930         NA
 9 Flossie Wong-Staal Virologist and Molec…       1947         NA
10 Jennifer Doudna    Biochemist                  1964         NA

`right_join()`

right_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4

`right_join()`

professions |>
  right_join(dates)

Joining with `by = join_by(name)`

# A tibble: 8 × 4
  name               profession             birth_year death_year
  <chr>              <chr>                       <dbl>      <dbl>
1 Janaki Ammal       Botanist                     1897       1984
2 Chien-Shiung Wu    Physicist                    1912       1997
3 Katherine Johnson  Mathematician                1918       2020
4 Rosalind Franklin  Chemist                      1920       1958
5 Vera Rubin         Astronomer                   1928       2016
6 Gladys West        Mathematician                1930         NA
7 Flossie Wong-Staal Virologist and Molecu…       1947         NA
8 Jennifer Doudna    Biochemist                   1964         NA

`full_join()`

full_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4

`full_join()`

dates |>
  full_join(works)

Joining with `by = join_by(name)`

# A tibble: 10 × 4
   name               birth_year death_year known_for            
   <chr>                   <dbl>      <dbl> <chr>                
 1 Janaki Ammal             1897       1984 hybrid species, biod…
 2 Chien-Shiung Wu          1912       1997 experiment overturni…
 3 Katherine Johnson        1918       2020 orbital mechanics cr…
 4 Rosalind Franklin        1920       1958 <NA>                 
 5 Vera Rubin               1928       2016 existence of dark ma…
 6 Gladys West              1930         NA mathematical modelin…
 7 Flossie Wong-Staal       1947         NA first to clone HIV a…
 8 Jennifer Doudna          1964         NA one of the primary d…
 9 Ada Lovelace               NA         NA first computer algor…
10 Marie Curie                NA         NA theory of radioactiv…

`inner_join()`

inner_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2

`inner_join()`

dates |>
  inner_join(works)

Joining with `by = join_by(name)`

# A tibble: 7 × 4
  name               birth_year death_year known_for             
  <chr>                   <dbl>      <dbl> <chr>                 
1 Janaki Ammal             1897       1984 hybrid species, biodi…
2 Chien-Shiung Wu          1912       1997 experiment overturnin…
3 Katherine Johnson        1918       2020 orbital mechanics cri…
4 Vera Rubin               1928       2016 existence of dark mat…
5 Gladys West              1930         NA mathematical modeling…
6 Flossie Wong-Staal       1947         NA first to clone HIV an…
7 Jennifer Doudna          1964         NA one of the primary de…

`semi_join()`

semi_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2

`semi_join()`

dates |>
  semi_join(works)

Joining with `by = join_by(name)`

# A tibble: 7 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Vera Rubin               1928       2016
5 Gladys West              1930         NA
6 Flossie Wong-Staal       1947         NA
7 Jennifer Doudna          1964         NA

`anti_join()`

anti_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3

`anti_join()`

dates |>
  anti_join(works)

Joining with `by = join_by(name)`

# A tibble: 1 × 3
  name              birth_year death_year
  <chr>                  <dbl>      <dbl>
1 Rosalind Franklin       1920       1958

Putting it altogether

scientists <- professions |>
  left_join(dates) |>
  left_join(works)

Joining with `by = join_by(name)`
Joining with `by = join_by(name)`

scientists

# A tibble: 10 × 5
   name               profession  birth_year death_year known_for
   <chr>              <chr>            <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematic…         NA         NA first co…
 2 Marie Curie        Physicist …         NA         NA theory o…
 3 Janaki Ammal       Botanist          1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist         1912       1997 experime…
 5 Katherine Johnson  Mathematic…       1918       2020 orbital …
 6 Rosalind Franklin  Chemist           1920       1958 <NA>     
 7 Vera Rubin         Astronomer        1928       2016 existenc…
 8 Gladys West        Mathematic…       1930         NA mathemat…
 9 Flossie Wong-Staal Virologist…       1947         NA first to…
10 Jennifer Doudna    Biochemist        1964         NA one of t…

`*_join()` functions

From dplyr
Incredibly useful for bringing datasets with common information (e.g., unique identifier) together
Use by argument when the names of the column containing the common information are not the same across datasets
Always check that the numbers of rows and columns of the result dataset makes sense
Refer to two-table verbs vignette when needed

Visualizing joined data

But first…

What is the plot in the previous slide called?

Livecoding

Reveal below for code developed during live coding session.

Transform

Code

scientists_longer <- scientists |>
  mutate(
    birth_year = case_when(
      name == "Ada Lovelace" ~ 1815,
      name == "Marie Curie" ~ 1867,
      TRUE ~ birth_year
    ),
    death_year = case_when(
      name == "Ada Lovelace" ~ 1852,
      name == "Marie Curie" ~ 1934,
      name == "Flossie Wong-Staal" ~ 2020,
      TRUE ~ death_year
    ),
    status = if_else(is.na(death_year), "alive", "deceased"),
    death_year = if_else(is.na(death_year), 2021, death_year),
    known_for = if_else(name == "Rosalind Franklin", "understanding of the molecular structures of DNA ", known_for)
  ) |>
  pivot_longer(
    cols = contains("year"),
    names_to = "year_type",
    values_to = "year"
  ) |>
  mutate(death_year_fake = if_else(year == 2021, TRUE, FALSE))

Plot

Code

ggplot(scientists_longer, 
       aes(x = year, y = fct_reorder(name, as.numeric(factor(profession))), group = name, color = profession)) +
  geom_point(aes(shape = death_year_fake), show.legend = FALSE) +
  geom_line(aes(linetype = status), show.legend = FALSE) +
  scale_shape_manual(values = c("circle", NA)) +
  scale_linetype_manual(values = c("dashed", "solid")) +
  scale_color_colorblind() +
  scale_x_continuous(expand = c(0.01, 0), breaks = seq(1820, 2020, 50)) +
  geom_text(aes(y = name, label = known_for), x = 2030, show.legend = FALSE, hjust = 0) +
  geom_text(aes(label = profession), x = 1809, y = Inf, hjust = 1, vjust = 1, show.legend = FALSE) +
  coord_cartesian(clip = "off") +
  labs(
    x = "Year", y = NULL,
    title = "10 women in science who changed the world",
    caption = "Source: Discover magazine"
  ) +
  facet_grid(profession ~ ., scales = "free_y", space = "free_y", switch = "x") +
  theme(
    plot.margin = unit(c(1, 23, 1, 4), "lines"),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    plot.caption = element_text(hjust = 2), # manual hack
    strip.background = element_blank(),
    strip.text = element_blank(),
    axis.title.x = element_text(hjust = 0),
    panel.background = element_rect(fill = "#f0f0f0", color = "white"),
    panel.grid.major = element_line(color = "white", size = 0.5)
  )

Data wrangling - III

Warm up

Announcements

Setup

Missing values I

Missing values II

Missing values II

Missing values II

Missing values II

Bringing together multiple data frames

Scenario 2

10 women in science who changed the world

Inputs

Desired output

Inputs, reminder

Joining data frames

Setup

left_join()

left_join()

right_join()

right_join()

full_join()

full_join()

inner_join()

inner_join()

semi_join()

semi_join()

anti_join()

anti_join()

Putting it altogether

*_join() functions

Visualizing joined data

But first…

Livecoding

`left_join()`

`left_join()`

`right_join()`

`right_join()`

`full_join()`

`full_join()`

`inner_join()`

`inner_join()`

`semi_join()`

`semi_join()`

`anti_join()`

`anti_join()`

`*_join()` functions