Dealing with factors

It's time to turn our attention to the categorical columns in the dataset. Ideally, categorical columns should be turned into factor (usually from character or integer). A factor is the appropriate data type for a categorical column. When we loaded the data in R using read.csv, we set stringsAsFactors = FALSE to prevent any character columns from being turned into a factor. This is generally a good idea, because some character columns (such as columns with raw text in them or alpha-numeric ID columns) are not appropriate for factors. Accidentally turning such columns into factors can result in overhead, especially when data sizes are large. The overhead is the result of R having to keep a tally of all the factor levels. We do not have any character columns in this dataset that need to be converted to factors, but we have integer columns that represent categorical data. These are the columns with low cardinality, as can be seen here:

sapply(nyc_taxi, num.distinct)
             VendorID       pickup_datetime      dropoff_datetime 
                    2               3337726               3338601 
      passenger_count         trip_distance      pickup_longitude 
                   10                  3632                 28324 
      pickup_latitude          rate_code_id     dropoff_longitude 
                52983                     7                 41876 
     dropoff_latitude          payment_type           fare_amount 
                73777                     4                  1162 
                extra               mta_tax            tip_amount 
                   31                     8                  2901 
         tolls_amount improvement_surcharge          total_amount 
                  649                     2                  9163

Fortunately, the site that hosted the dataset also provides us with a data dictionary. Going over the document helps answer what the categorical columns are and what each category represents.

For example, for rate_code_id, the mapping is as follows:

  • 1 = Standard rate
  • 2 = JFK
  • 3 = Newark
  • 4 = Nassau or Westchester
  • 5 = Negotiated fare
  • 6 = Group ride

The above information helps us properly label the factor levels.

Notice how summary shows us numeric summaries for the categorical columns right now.

summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # shows numeric summaries for both columns
  rate_code_id  payment_type 
 Min.   : 1    Min.   :1.00  
 1st Qu.: 1    1st Qu.:1.00  
 Median : 1    Median :1.00  
 Mean   : 1    Mean   :1.38  
 3rd Qu.: 1    3rd Qu.:2.00  
 Max.   :99    Max.   :4.00

A quick glance at payment_type shows two payments as by far the most common. The data dictionary confirms for us that they correspond to card and cash payments.

table(nyc_taxi$payment_type)
      1       2       3       4 
2417055 1419764   11998    3545

We now turn both rate_code_id and payment_type into factor columns. For rate_code_id we keep all the labels, but for payment_type we only keep the two most common and label them as 'card' and 'cash'. We do so by specifying levels = 1:2 instead of levels = 1:6 and provide labels for only the first two categories. This means the other values of payment_type get lumped together and replaced with NAs, resulting in information loss (which we are comfortable with, for the sake of this analysis).

nyc_taxi <- transform(nyc_taxi, 
                      rate_code_id = factor(rate_code_id, 
                                            levels = 1:6, labels = c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride')),
                      payment_type = factor(payment_type,
                                            levels = 1:2, labels = c('card', 'cash')
                      ))
head(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now proper labels are showing in the data
 rate_code_id payment_type
1     standard         card
2     standard         card
3     standard         card
...
summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now counts are showing in the summary
                rate_code_id     payment_type  
 standard             :3758230   card:2417055  
 JFK                  :  75423   cash:1419764  
 Newark               :   6243   NA's:  15543  
 Nassau or Westchester:   1300                 
 negotiated           :  11039                 
 group ride           :     33                 
 NA's                 :     94

It is very important that the labels be in the same order as the levels they map into.

What about passenger_count? should it be treated as a factor or left as integer? The answer is it depends on how it will be used, especially in the context of modeling. Most of the time, such a column is best left as integer in the data and converted into factor 'on-the-fly' when need be (such as when we want to see counts, or when we want a model to treat the column as a factor).

Our data-cleaning is for now done. We are ready to now add new features to the data, but before we do so, let's briefly revisit what we have so far done from the beginning, and see if we could have taken any shortcuts. That is the subject of the next chapter.

results matching ""

    No results matching ""