Data analysis

Warm up!

We can use “describe” in psych package to see the number of participant, mean, std of a variable.

library(psych)
describe(penguins$body_mass_g)

result

vars   n    mean     sd median trimmed    mad  min  max range
X1    1 342 4201.75 801.95   4050 4154.01 889.56 2700 6300  3600
   skew kurtosis    se
X1 0.47    -0.74 43.36

What is TIDY DATA

  • Every column is a variable
  • Every row is an observation
  • Every cell has one value It will benefit a lot if we deal with tidy data, for example, easy for data sharing, reproducible, easy to automate…

Data cleaning

Remove data hierachically!

  1. remove variables you do not need
  2. remove/filter observations you do not need
  3. remove missing values
  4. add new variables

Tidyverse: A tidy universe

use %>% pipe

shortcut: ctrl+shift+M

  1. remove variables you do not need (and remain what you need)
# We remove flipper_length_mm and body_mass_g and remain the others
penguins_select <- penguins %>% 
  select(penguin, species, island, bill_length_mm, bill_depth_mm, sex, year)


## Or, if you just want to remove columns, you can just do this:
penguins_select2 <- penguins %>% 
  select(-flipper_length_mm,-body_mass_g)
  1. filter the rows (observations) you want
penguins_filter <- penguins %>% 
  filter(year == 2008 & year==2007 | species=="Adelie")

logical operators

  1. drop all rows with missing value
penguins_clean <- penguins %>% 
  drop_na()
  1. Build new variables to the dataframe mutate can do changes to the cells
penguins_clean <- penguins %>% 
  mutate(
    bill_sum = bill_length_mm+bill_depth_mm,
    species = factor(species),
    sex_recoded = case_when(sex == "female" ~ "f",
                            sex == "male" ~ "m"),
    sex_recoded = factor(sex_recoded)
  )
  • transform variables to factors
  • create new variables
    • numerical operation
    • if else condition

case_when case_when( variable == X ~ value, variable == y ~ value, )