Dealing with factors
It's time to turn our attention to the categorical columns in the dataset. Ideally, categorical columns should be turned into factor
(usually from character
or integer
). A factor
is the appropriate data type for a categorical column. When we loaded the data in R using read.csv
, we set stringsAsFactors = FALSE
to prevent any character
columns from being turned into a factor. This is generally a good idea, because some character columns (such as columns with raw text in them or alpha-numeric ID columns) are not appropriate for factors. Accidentally turning such columns into factors can result in overhead, especially when data sizes are large. The overhead is the result of R having to keep a tally of all the factor levels. We do not have any character
columns in this dataset that need to be converted to factors, but we have integer
columns that represent categorical data. These are the columns with low cardinality, as can be seen here:
sapply(nyc_taxi, num.distinct)
VendorID pickup_datetime dropoff_datetime
2 3337726 3338601
passenger_count trip_distance pickup_longitude
10 3632 28324
pickup_latitude rate_code_id dropoff_longitude
52983 7 41876
dropoff_latitude payment_type fare_amount
73777 4 1162
extra mta_tax tip_amount
31 8 2901
tolls_amount improvement_surcharge total_amount
649 2 9163
Fortunately, the site that hosted the dataset also provides us with a data dictionary. Going over the document helps answer what the categorical columns are and what each category represents.
For example, for rate_code_id
, the mapping is as follows:
- 1 = Standard rate
- 2 = JFK
- 3 = Newark
- 4 = Nassau or Westchester
- 5 = Negotiated fare
- 6 = Group ride
The above information helps us properly label the factor levels.
Notice how summary
shows us numeric summaries for the categorical columns right now.
summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # shows numeric summaries for both columns
rate_code_id payment_type
Min. : 1 Min. :1.00
1st Qu.: 1 1st Qu.:1.00
Median : 1 Median :1.00
Mean : 1 Mean :1.38
3rd Qu.: 1 3rd Qu.:2.00
Max. :99 Max. :4.00
A quick glance at payment_type
shows two payments as by far the most common. The data dictionary confirms for us that they correspond to card and cash payments.
table(nyc_taxi$payment_type)
1 2 3 4
2417055 1419764 11998 3545
We now turn both rate_code_id
and payment_type
into factor
columns. For rate_code_id
we keep all the labels, but for payment_type
we only keep the two most common and label them as 'card' and 'cash'. We do so by specifying levels = 1:2
instead of levels = 1:6
and provide labels for only the first two categories. This means the other values of payment_type
get lumped together and replaced with NAs, resulting in information loss (which we are comfortable with, for the sake of this analysis).
nyc_taxi <- transform(nyc_taxi,
rate_code_id = factor(rate_code_id,
levels = 1:6, labels = c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride')),
payment_type = factor(payment_type,
levels = 1:2, labels = c('card', 'cash')
))
head(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now proper labels are showing in the data
rate_code_id payment_type
1 standard card
2 standard card
3 standard card
...
summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now counts are showing in the summary
rate_code_id payment_type
standard :3758230 card:2417055
JFK : 75423 cash:1419764
Newark : 6243 NA's: 15543
Nassau or Westchester: 1300
negotiated : 11039
group ride : 33
NA's : 94
It is very important that the labels
be in the same order as the levels
they map into.
What about passenger_count
? should it be treated as a factor
or left as integer? The answer is it depends on how it will be used, especially in the context of modeling. Most of the time, such a column is best left as integer
in the data and converted into factor 'on-the-fly' when need be (such as when we want to see counts, or when we want a model to treat the column as a factor
).
Our data-cleaning is for now done. We are ready to now add new features to the data, but before we do so, let's briefly revisit what we have so far done from the beginning, and see if we could have taken any shortcuts. That is the subject of the next chapter.