Recap on summarization

  • summary(x): quantile information
  • summarize: creates a summary table of columns of interest
  • count(variable): how many of each unique value do you have
  • group_by(): changes all subsequent functions
    • combine with summarize() to get statistics per group

📃Cheatsheet

Data Cleaning

In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.

⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️

Dealing with Missing Data

Air quality data

The airquality dataset comes with R about air quality in New York in 1973.

?airquality # use this to find out more about the data

We can use count to see missing values

The will be at the bottom typically

Ozone_values <- count(airquality, Ozone)
tail(Ozone_values)
   Ozone  n
63   115  1
64   118  1
65   122  1
66   135  1
67   168  1
68    NA 37

Missing Data Issues

Recall that mathematical operations with NA often result in NAs.

sum(c(1,2,3,NA))
[1] NA
mean(c(1,2,3,NA))
[1] NA
median(c(1,2,3,NA))
[1] NA

filter() and missing data

Be careful with missing data using subsetting!

filter() removes missing values by default. Because R can’t tell for sure if an NA value meets the condition. To keep them need to add is.na() conditional.

Think about if this is OK or not - it depends on your data!

airquality %>% filter(Solar.R > 330 | is.na(Solar.R))
  Ozone Solar.R Wind Temp Month Day
1    NA      NA 14.3   56     5   5
2    28      NA 14.9   66     5   6
3     7      NA  6.9   74     5  11
4    14     334 11.5   64     5  16
5    NA      NA  8.0   57     5  27
6    NA     332 13.8   80     6  14
7    78      NA  6.9   86     8   4
8    35      NA  7.4   85     8   5
9    66      NA  4.6   87     8   6

To remove rows with NA values for a variable use drop_na()

A function from the tidyr package. (Need a data frame to start!)

Disclaimer: Don’t do this unless you have thought about if dropping NA values makes sense based on knowing what these values mean in your data.

airquality %>% drop_na(Ozone)
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      28      NA 14.9   66     5   6
6      23     299  8.6   65     5   7
7      19      99 13.8   59     5   8
8       8      19 20.1   61     5   9
9       7      NA  6.9   74     5  11
10     16     256  9.7   69     5  12
11     11     290  9.2   66     5  13
12     14     274 10.9   68     5  14
13     18      65 13.2   58     5  15
14     14     334 11.5   64     5  16
15     34     307 12.0   66     5  17
16      6      78 18.4   57     5  18
17     30     322 11.5   68     5  19
18     11      44  9.7   62     5  20
19      1       8  9.7   59     5  21
20     11     320 16.6   73     5  22
21      4      25  9.7   61     5  23
22     32      92 12.0   61     5  24
23     23      13 12.0   67     5  28
24     45     252 14.9   81     5  29
25    115     223  5.7   79     5  30
26     37     279  7.4   76     5  31
27     29     127  9.7   82     6   7
28     71     291 13.8   90     6   9
29     39     323 11.5   87     6  10
30     23     148  8.0   82     6  13
31     21     191 14.9   77     6  16
32     37     284 20.7   72     6  17
33     20      37  9.2   65     6  18
34     12     120 11.5   73     6  19
35     13     137 10.3   76     6  20
36    135     269  4.1   84     7   1
37     49     248  9.2   85     7   2
38     32     236  9.2   81     7   3
39     64     175  4.6   83     7   5
40     40     314 10.9   83     7   6
41     77     276  5.1   88     7   7
42     97     267  6.3   92     7   8
43     97     272  5.7   92     7   9
44     85     175  7.4   89     7  10
45     10     264 14.3   73     7  12
46     27     175 14.9   81     7  13
47      7      48 14.3   80     7  15
48     48     260  6.9   81     7  16
49     35     274 10.3   82     7  17
50     61     285  6.3   84     7  18
51     79     187  5.1   87     7  19
52     63     220 11.5   85     7  20
53     16       7  6.9   74     7  21
54     80     294  8.6   86     7  24
55    108     223  8.0   85     7  25
56     20      81  8.6   82     7  26
57     52      82 12.0   86     7  27
58     82     213  7.4   88     7  28
59     50     275  7.4   86     7  29
60     64     253  7.4   83     7  30
61     59     254  9.2   81     7  31
62     39      83  6.9   81     8   1
63      9      24 13.8   81     8   2
64     16      77  7.4   82     8   3
65     78      NA  6.9   86     8   4
66     35      NA  7.4   85     8   5
67     66      NA  4.6   87     8   6
68    122     255  4.0   89     8   7
69     89     229 10.3   90     8   8
70    110     207  8.0   90     8   9
71     44     192 11.5   86     8  12
72     28     273 11.5   82     8  13
73     65     157  9.7   80     8  14
74     22      71 10.3   77     8  16
75     59      51  6.3   79     8  17
76     23     115  7.4   76     8  18
77     31     244 10.9   78     8  19
78     44     190 10.3   78     8  20
79     21     259 15.5   77     8  21
80      9      36 14.3   72     8  22
81     45     212  9.7   79     8  24
82    168     238  3.4   81     8  25
83     73     215  8.0   86     8  26
84     76     203  9.7   97     8  28
85    118     225  2.3   94     8  29
86     84     237  6.3   96     8  30
87     85     188  6.3   94     8  31
88     96     167  6.9   91     9   1
89     78     197  5.1   92     9   2
90     73     183  2.8   93     9   3
91     91     189  4.6   93     9   4
92     47      95  7.4   87     9   5
93     32      92 15.5   84     9   6
94     20     252 10.9   80     9   7
95     23     220 10.3   78     9   8
96     21     230 10.9   75     9   9
97     24     259  9.7   73     9  10
98     44     236 14.9   81     9  11
99     21     259 15.5   76     9  12
100    28     238  6.3   77     9  13
101     9      24 10.9   71     9  14
102    13     112 11.5   71     9  15
103    46     237  6.9   78     9  16
104    18     224 13.8   67     9  17
105    13      27 10.3   76     9  18
106    24     238 10.3   68     9  19
107    16     201  8.0   82     9  20
108    13     238 12.6   64     9  21
109    23      14  9.2   71     9  22
110    36     139 10.3   81     9  23
111     7      49 10.3   69     9  24
112    14      20 16.6   63     9  25
113    30     193  6.9   70     9  26
114    14     191 14.3   75     9  28
115    18     131  8.0   76     9  29
116    20     223 11.5   68     9  30

To remove rows with NA values for a data frame use drop_na()

This function of the tidyr package drops rows with any missing data in any column when used on a df.

airquality %>% drop_na()
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      23     299  8.6   65     5   7
6      19      99 13.8   59     5   8
7       8      19 20.1   61     5   9
8      16     256  9.7   69     5  12
9      11     290  9.2   66     5  13
10     14     274 10.9   68     5  14
11     18      65 13.2   58     5  15
12     14     334 11.5   64     5  16
13     34     307 12.0   66     5  17
14      6      78 18.4   57     5  18
15     30     322 11.5   68     5  19
16     11      44  9.7   62     5  20
17      1       8  9.7   59     5  21
18     11     320 16.6   73     5  22
19      4      25  9.7   61     5  23
20     32      92 12.0   61     5  24
21     23      13 12.0   67     5  28
22     45     252 14.9   81     5  29
23    115     223  5.7   79     5  30
24     37     279  7.4   76     5  31
25     29     127  9.7   82     6   7
26     71     291 13.8   90     6   9
27     39     323 11.5   87     6  10
28     23     148  8.0   82     6  13
29     21     191 14.9   77     6  16
30     37     284 20.7   72     6  17
31     20      37  9.2   65     6  18
32     12     120 11.5   73     6  19
33     13     137 10.3   76     6  20
34    135     269  4.1   84     7   1
35     49     248  9.2   85     7   2
36     32     236  9.2   81     7   3
37     64     175  4.6   83     7   5
38     40     314 10.9   83     7   6
39     77     276  5.1   88     7   7
40     97     267  6.3   92     7   8
41     97     272  5.7   92     7   9
42     85     175  7.4   89     7  10
43     10     264 14.3   73     7  12
44     27     175 14.9   81     7  13
45      7      48 14.3   80     7  15
46     48     260  6.9   81     7  16
47     35     274 10.3   82     7  17
48     61     285  6.3   84     7  18
49     79     187  5.1   87     7  19
50     63     220 11.5   85     7  20
51     16       7  6.9   74     7  21
52     80     294  8.6   86     7  24
53    108     223  8.0   85     7  25
54     20      81  8.6   82     7  26
55     52      82 12.0   86     7  27
56     82     213  7.4   88     7  28
57     50     275  7.4   86     7  29
58     64     253  7.4   83     7  30
59     59     254  9.2   81     7  31
60     39      83  6.9   81     8   1
61      9      24 13.8   81     8   2
62     16      77  7.4   82     8   3
63    122     255  4.0   89     8   7
64     89     229 10.3   90     8   8
65    110     207  8.0   90     8   9
66     44     192 11.5   86     8  12
67     28     273 11.5   82     8  13
68     65     157  9.7   80     8  14
69     22      71 10.3   77     8  16
70     59      51  6.3   79     8  17
71     23     115  7.4   76     8  18
72     31     244 10.9   78     8  19
73     44     190 10.3   78     8  20
74     21     259 15.5   77     8  21
75      9      36 14.3   72     8  22
76     45     212  9.7   79     8  24
77    168     238  3.4   81     8  25
78     73     215  8.0   86     8  26
79     76     203  9.7   97     8  28
80    118     225  2.3   94     8  29
81     84     237  6.3   96     8  30
82     85     188  6.3   94     8  31
83     96     167  6.9   91     9   1
84     78     197  5.1   92     9   2
85     73     183  2.8   93     9   3
86     91     189  4.6   93     9   4
87     47      95  7.4   87     9   5
88     32      92 15.5   84     9   6
89     20     252 10.9   80     9   7
90     23     220 10.3   78     9   8
91     21     230 10.9   75     9   9
92     24     259  9.7   73     9  10
93     44     236 14.9   81     9  11
94     21     259 15.5   76     9  12
95     28     238  6.3   77     9  13
96      9      24 10.9   71     9  14
97     13     112 11.5   71     9  15
98     46     237  6.9   78     9  16
99     18     224 13.8   67     9  17
100    13      27 10.3   76     9  18
101    24     238 10.3   68     9  19
102    16     201  8.0   82     9  20
103    13     238 12.6   64     9  21
104    23      14  9.2   71     9  22
105    36     139 10.3   81     9  23
106     7      49 10.3   69     9  24
107    14      20 16.6   63     9  25
108    30     193  6.9   70     9  26
109    14     191 14.3   75     9  28
110    18     131  8.0   76     9  29
111    20     223 11.5   68     9  30

Summary

  • count() can help determine if we have NA values
  • filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)
  • drop_na() can help you remove NA values from a variable or an entire data frame
  • NA values can change your calculation results
  • think about what NA values represent - don’t drop them if you shouldn’t

Practice

Practice

Use count() and tail() to determine the number of missing values in the airquality data for the Solar.R variable.

airquality %>% count(_______) %>% _____

Practice

Filter the rows of airquality to remove rows with NA values for Solar.R.

filt_airqual <- ________ %>% _______(_______)

Recoding Variables

Example of Recoding

#install.packages("catdata")
library(catdata)
?catdata::teratology
data(teratology)
rat <- teratology

The End

Note about select

Once loading this catdata package, you need to specify that you want to use the select from dplyr. It just happens to have a function that is the same name and we want the dplyr version.

select <- dplyr::select

Oh dear…

It’s not very easy to tell what is what.

head(rat)
   D L Grp
1  1 9   1
2  4 7   1
3  9 3   1
4  4 0   1
5 10 0   1
6  9 2   1

Grp variable

rat %>%
  count(Grp)

Changing the class

We can use as.character or as.numeric to change a variable to each class respectively.

Let’s change the group to be character, since it doesn’t actually have numeric significance.

rat <- rat %>% mutate(Grp = as.character(Grp))

dplyr can help!

In dplyr you can use the recode function to change each Grp value to be something more useful!

(need mutate for data frames/tibbles!)

# General Format - this is not code!
{data_input} %>%
  mutate({new variable} = recode({Variable_fixing}, {old_value} = {new_value},
                                            {another_old_value} = {new_value}))

recode() function

Need quotes for values!

rat <-rat %>% 
  mutate(Grp_recoded = recode(Grp, "1" = "Untreated", 
                                   "2" = "Inj. Day 7 and 10", 
                                   "3" = "Inj. Day 0 and 7",
                                   "4" = "Inj. Weekly"))
  rat %>% count(Grp_recoded)
        Grp_recoded  n
1  Inj. Day 0 and 7  5
2 Inj. Day 7 and 10 12
3       Inj. Weekly 10
4         Untreated 31

rename columns

Can use the rename() function.

# general format! not code!
{data you are creating or changing} <- {data you are using} %>% 
                            rename({New Name} = {Old name})
head(rat, 2)
  D L Grp Grp_recoded
1 1 9   1   Untreated
2 4 7   1   Untreated
rat <- rat %>% rename("num_dead_litter" = "D",
                      "num_living_litter" = "L")
head(rat, 2)
  num_dead_litter num_living_litter Grp Grp_recoded
1               1                 9   1   Untreated
2               4                 7   1   Untreated

Practice

Practice

First load some data.

#install.packages("catdata")
library(catdata)
data(teratology2)
rat2 <-teratology2 # assign it to a new name
head(rat2)
  y Rat Grp
1 1   1  G1
2 0   1  G1
3 0   1  G1
4 0   1  G1
5 0   1  G1
6 0   1  G1
?teratology2 #find out more about the data

Practice

Recode the data to create a new variable from the y variable to be values of dead (instead of 1) and living (instead of 0). Call the variable status.

First change the y variable to be character.

rat2 <- rat2 %>% ______(y = __________(y))

rat2_recoded <- rat2 %>% 
  _______(status = ______(_, ___ = ______,
                             ___ = ____))

Summary

  • recode() requires mutate() when working with dataframes/tibbles
  • recode() can help with simple recoding (an exact swap) for values
  • recode() has the opposite order as rename - use “old value” = “new value”
  • rename() helps us change column names - use new name = old name and it does not require mutate() 🏠 Workshop Website

Extra slides if there is time

case_when() helps make sophisticated new variables

Note that automatically values not reassigned explicitly by case_when() will be NA unless otherwise specified.

# General Format - this is not code!
{data_input} %>%
  mutate({new_variable} = case_when({Variable}   
             /some condition/ ~ {value_for_con},
                         TRUE ~ {value_for_not_meeting_condition})

{value_for_not_meeting_condition} could be something new or it can be the original values of a column

case_when()

case_when can do very sophisticated comparisons

rat <- rat %>%
  mutate("survival" = 
           case_when(num_living_litter > num_dead_litter ~ "well",
                     num_living_litter == num_dead_litter ~ "even",
                     num_living_litter < num_dead_litter ~ "poor"))

Now it is easier to see what is happening

rat%>% 
  count(Grp_recoded, survival)
        Grp_recoded survival  n
1  Inj. Day 0 and 7     well  5
2 Inj. Day 7 and 10     well 12
3       Inj. Weekly     well 10
4         Untreated     poor 26
5         Untreated     well  5

case_when will make NA values

If there is a condition not specified, NA values will be generated.

rat %>%
  mutate("survival" = 
           case_when(num_living_litter > num_dead_litter ~ "well")) %>% 
  count(Grp_recoded, survival)
        Grp_recoded survival  n
1  Inj. Day 0 and 7     well  5
2 Inj. Day 7 and 10     well 12
3       Inj. Weekly     well 10
4         Untreated     well  5
5         Untreated     <NA> 26

Summary

  • recode() and case_when() require mutate() when working with dataframes/tibbles
  • recode() can help with simple recoding (an exact swap)
  • case_when() can recode based on conditions (need quotes for conditions and new values)
    • remember case_when() will generate NA values for anything not specified

🏠 Workshop Website