Exercises

Let's create a sample with replacement of size 2000 from the colors red, blue and green. This is like reaching into a jar with three balls of each color, grabbing one and recording the color, placing it back into the jar and repeating this 2000 times.

rbg_chr <- sample(c("red", "blue", "green"), 2000, replace = TRUE)

We add one last entry to the sample: the entry is 'pink':

rbg_chr <- c(rbg_chr, "pink") # add a pink entry to the sample

We now turn rbg_chr (which is a character vector) into a factor and call it rbg_fac. We then drop the 'pink' entry from both vectors.

rbg_fac <- factor(rbg_chr) # turn `rbg_chr` into a `factor` `rbg_fac`
rbg_chr <- rbg_chr[1:(length(rbg_chr)-1)] # dropping the last entry from `rbg_chr`
rbg_fac <- rbg_fac[1:(length(rbg_fac)-1)] # dropping the last entry from `rbg_fac`

Note that rbg_chr and rbg_fac contain the same information, but are of different types. Discuss what differences you notice between rbg_chr and rbg_fac in each of the below cases:

(1) When we query the first few entries of each:

head(rbg_chr)
head(rbg_fac)

(2) When we compare the size of each in the memory:

sprintf("Size as characters: %s. Size as factor: %s", 
        object.size(rbg_chr), object.size(rbg_fac))

(3) When we ask for counts within each category:

table(rbg_chr)
table(rbg_fac)

(4) when we try to replace an entry with something other than 'red', 'blue' and 'green':

rbg_chr[3] <- "yellow" # replaces the 3rd entry in `rbg_chr` with 'yellow'
rbg_fac[3] <- "yellow" # throws a warning, replaces the 3rd entry with NA

(5) Each category in a categorical column (formatted as factor) is called a factor level. We can look at factor levels using the levels function:

levels(rbg_fac)

We can relabel the factor levels directly with levels. Change the levels of rbg_fac so that the labels start with capital letters.

(6) We can add new factor levels to the existing ones. Add "Yellow" as a new level for rbg_fac.

(7) Once new factor levels have been created, we can have entries which match the new level. Change the third entry of rbg_fac to now be "Yellow".

(8) Finally, we need to recreate the factor column if we want to drop a particular level or change the order of the levels.

table(rbg_chr) # what we see in the orignal `character` column

If we don't provide the factor with levels (through the levels argument), we create a factor by scanning the data to find all the levels and sort the levels alphabetically.

rbg_fac <- factor(rbg_chr)
table(rbg_fac) # the levels are just whatever was present in `rbg_chr`

We can overwrite that by explicitly passing factor levels to the factor function, in the order that we wish them to be. Recreate rbg_fac by passing rbg_chr factor function, but this time specify only "red", "green" and "blue" as the levels. Run table on both rbg_chr and rbg_fac. What differences do you see?

(9) What benefits do you see in being able to overwrite factor levels? Specifically, what could be useful about adding new factor levels? Removing certain existing factor levels? Reordering factor levels?

results matching ""

    No results matching ""