Solutions
(1) We see quotes around rbg_chr
but no quotes for rbg_fac
and factor levels at the bottom.
head(rbg_chr) # we see quotes
[1] "green" "green" "yellow" "green" "green" "red"
head(rbg_fac) # we don't see quotes and we see the factor levels at the bottom
[1] blue green green green red green
Levels: blue green pink red
(2) A factor
column tends to take up less space than character
column, the more so when the strings in the character
column are longer. This is because a factor
column stores the information as integers under the hood, with a mapping from each integer to the string it represents.
sprintf("Size as characters: %s. Size as factor: %s",
object.size(rbg_chr), object.size(rbg_fac))
[1] "Size as characters: 16184. Size as factor: 8624"
(3)
table(rbg_chr)
rbg_chr
blue green red
695 647 658
table(rbg_fac) # we can see a count of 0 for 'pink', becuase it's one of the factor levels
rbg_fac
blue green pink red
695 647 0 658
(4) Changing an entry in a factor
column to a values other than one of its acceptable levels will result in an NA. Notice that this happens without any warnings.
head(rbg_chr) # the 3rd entry changed to 'yellow'
[1] "blue" "green" "yellow" "green" "red" "green"
head(rbg_fac) # we could not change the 3rd entry to 'yellow' because it's not one of the factor levels
[1] blue green <NA> green red green
Levels: blue green pink red
(5) We simply re-assign the factor levels, but we must be careful to provide the new levels in the same order as the old ones.
levels(rbg_fac) <- c('Blue', 'Green', 'Pink', 'Red') # we capitalize the first letters
head(rbg_fac)
[1] Green Red <NA> Red Red Red
Levels: Blue Green Pink Red
(6) We simply append "Yellow" to the old factor levels and assign this as the new factor levels.
levels(rbg_fac) <- c(levels(rbg_fac), "Yellow") # we add 'Yellow' as a new factor level
table(rbg_fac) # even though the data has no 'Yellow' entries, it's an acceptable value
rbg_fac
Blue Green Pink Red Yellow
656 682 0 661 0
(7) Since "Yellow" is one of the levels now, we can change any entry to "Yellow" and we won't get an NA anymore.
rbg_fac[3] <- "Yellow" # does not throw a warning anymore
head(rbg_fac) # now the data has one 'Yellow' entry
[1] Green Red Yellow Red Red Red
Levels: Blue Green Pink Red Yellow
(8) We use the levels
argument in the factor
function. Since "yellow" was one of the entries in rgb_chr
and we are not specifying "yellow" as one of the factor levels we want, it will be turned into an NA.
table(rbg_chr)
rbg_chr
blue green red yellow
656 682 661 1
rbg_fac <- factor(rbg_chr, levels = c('red', 'green', 'blue')) # create a `factor`, with only the levels provided, in the order provided
table(rbg_fac) # notice how 'yellow' has disappeared
rbg_fac
red green blue
661 682 656
table(rbg_fac, useNA = "ifany") # 'yellow' was turned into an NA
rbg_fac
red green blue <NA>
661 682 656 1
(9) There are three important advantages to providing factor levels:
- We can reorder the levels to any order we want (instead of having them alphabetically ordered). This way related levels can appear next to each other in summaries and plots.
- The factor levels don't have to be limited to what's in the data: we can provide additional levels that are not part of the data if we expect them to be part of future data. This way levels that are not in the data can still be represented in summaries and plots.
- Factor levels that are in the data, but not relevant to the analysis can be ignored (replaced with NAs) by not including them in
levels
. Note that doing so results in information loss if we overwrite the original column.