Exercises
Let's create a sample with replacement of size 2000 from the colors red, blue and green. This is like reaching into a jar with three balls of each color, grabbing one and recording the color, placing it back into the jar and repeating this 2000 times.
rbg_chr <- sample(c("red", "blue", "green"), 2000, replace = TRUE)
We add one last entry to the sample: the entry is 'pink':
rbg_chr <- c(rbg_chr, "pink") # add a pink entry to the sample
We now turn rbg_chr
(which is a character vector) into a factor
and call it rbg_fac
. We then drop the 'pink' entry from both vectors.
rbg_fac <- factor(rbg_chr) # turn `rbg_chr` into a `factor` `rbg_fac`
rbg_chr <- rbg_chr[1:(length(rbg_chr)-1)] # dropping the last entry from `rbg_chr`
rbg_fac <- rbg_fac[1:(length(rbg_fac)-1)] # dropping the last entry from `rbg_fac`
Note that rbg_chr
and rbg_fac
contain the same information, but are of different types. Discuss what differences you notice between rbg_chr
and rbg_fac
in each of the below cases:
(1) When we query the first few entries of each:
head(rbg_chr)
head(rbg_fac)
(2) When we compare the size of each in the memory:
sprintf("Size as characters: %s. Size as factor: %s",
object.size(rbg_chr), object.size(rbg_fac))
(3) When we ask for counts within each category:
table(rbg_chr)
table(rbg_fac)
(4) when we try to replace an entry with something other than 'red', 'blue' and 'green':
rbg_chr[3] <- "yellow" # replaces the 3rd entry in `rbg_chr` with 'yellow'
rbg_fac[3] <- "yellow" # throws a warning, replaces the 3rd entry with NA
(5) Each category in a categorical column (formatted as factor
) is called a factor level. We can look at factor levels using the levels
function:
levels(rbg_fac)
We can relabel the factor levels directly with levels
. Change the levels of rbg_fac
so that the labels start with capital letters.
(6) We can add new factor levels to the existing ones. Add "Yellow" as a new level for rbg_fac
.
(7) Once new factor levels have been created, we can have entries which match the new level. Change the third entry of rbg_fac
to now be "Yellow".
(8) Finally, we need to recreate the factor
column if we want to drop a particular level or change the order of the levels.
table(rbg_chr) # what we see in the orignal `character` column
If we don't provide the factor
with levels (through the levels
argument), we create a factor
by scanning the data to find all the levels and sort the levels alphabetically.
rbg_fac <- factor(rbg_chr)
table(rbg_fac) # the levels are just whatever was present in `rbg_chr`
We can overwrite that by explicitly passing factor levels to the factor
function, in the order that we wish them to be. Recreate rbg_fac
by passing rbg_chr
factor
function, but this time specify only "red", "green" and "blue" as the levels. Run table
on both rbg_chr
and rbg_fac
. What differences do you see?
(9) What benefits do you see in being able to overwrite factor levels? Specifically, what could be useful about adding new factor levels? Removing certain existing factor levels? Reordering factor levels?