Solutions

(1) We see quotes around rbg_chr but no quotes for rbg_fac and factor levels at the bottom.

head(rbg_chr) # we see quotes
[1] "green"  "green"  "yellow" "green"  "green"  "red"
head(rbg_fac) # we don't see quotes and we see the factor levels at the bottom
[1] blue  green green green red   green
Levels: blue green pink red

(2) A factor column tends to take up less space than character column, the more so when the strings in the character column are longer. This is because a factor column stores the information as integers under the hood, with a mapping from each integer to the string it represents.

sprintf("Size as characters: %s. Size as factor: %s", 
        object.size(rbg_chr), object.size(rbg_fac))
[1] "Size as characters: 16184. Size as factor: 8624"

(3)

table(rbg_chr)
rbg_chr
 blue green   red 
  695   647   658
table(rbg_fac) # we can see a count of 0 for 'pink', becuase it's one of the factor levels
rbg_fac
 blue green  pink   red 
  695   647     0   658

(4) Changing an entry in a factor column to a values other than one of its acceptable levels will result in an NA. Notice that this happens without any warnings.

head(rbg_chr) # the 3rd entry changed to 'yellow'
[1] "blue"   "green"  "yellow" "green"  "red"    "green"
head(rbg_fac) # we could not change the 3rd entry to 'yellow' because it's not one of the factor levels
[1] blue  green <NA>  green red   green
Levels: blue green pink red

(5) We simply re-assign the factor levels, but we must be careful to provide the new levels in the same order as the old ones.

levels(rbg_fac) <- c('Blue', 'Green', 'Pink', 'Red') # we capitalize the first letters
head(rbg_fac)
[1] Green Red   <NA>  Red   Red   Red  
Levels: Blue Green Pink Red

(6) We simply append "Yellow" to the old factor levels and assign this as the new factor levels.

levels(rbg_fac) <- c(levels(rbg_fac), "Yellow") # we add 'Yellow' as a new factor level
table(rbg_fac) # even though the data has no 'Yellow' entries, it's an acceptable value
rbg_fac
  Blue  Green   Pink    Red Yellow 
   656    682      0    661      0

(7) Since "Yellow" is one of the levels now, we can change any entry to "Yellow" and we won't get an NA anymore.

rbg_fac[3] <- "Yellow" # does not throw a warning anymore
head(rbg_fac) # now the data has one 'Yellow' entry
[1] Green  Red    Yellow Red    Red    Red   
Levels: Blue Green Pink Red Yellow

(8) We use the levels argument in the factor function. Since "yellow" was one of the entries in rgb_chr and we are not specifying "yellow" as one of the factor levels we want, it will be turned into an NA.

table(rbg_chr)
rbg_chr
  blue  green    red yellow 
   656    682    661      1
rbg_fac <- factor(rbg_chr, levels = c('red', 'green', 'blue')) # create a `factor`, with only the levels provided, in the order provided
table(rbg_fac) # notice how 'yellow' has disappeared
rbg_fac
  red green  blue 
  661   682   656
table(rbg_fac, useNA = "ifany") # 'yellow' was turned into an NA
rbg_fac
  red green  blue  <NA> 
  661   682   656     1

(9) There are three important advantages to providing factor levels:

  1. We can reorder the levels to any order we want (instead of having them alphabetically ordered). This way related levels can appear next to each other in summaries and plots.
  2. The factor levels don't have to be limited to what's in the data: we can provide additional levels that are not part of the data if we expect them to be part of future data. This way levels that are not in the data can still be represented in summaries and plots.
  3. Factor levels that are in the data, but not relevant to the analysis can be ignored (replaced with NAs) by not including them in levels. Note that doing so results in information loss if we overwrite the original column.

results matching ""

    No results matching ""