Solutions
Here are some of the ways we can clean the data:
tpep_pickup_datetime
andtpep_dropoff_datetime
should bedatetime
columns, notcharacter
rate_code_id
andpayment_type
should be afactor
, notcharacter
- the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
fare_amount
is sometimes negative (could be refunds, could be errors, could be something else)
Some data-cleaning jobs depend on the analysis. For example, turning payment_type
into a factor
is unnecessary if we don't intend to use it as a categorical variable in the model. Even so, we might still benefit from turning it into a factor so that we can see counts for it when we run summary
on the data, or have it show the proper labels when we use it in a plot. Other data- cleaning jobs on the other hand relate to data quality issues. For example, unreasonable bounds for pick-up or drop-off coordinates can be due to error. In such cases, we must decide whether we should clean the data by
- removing rows that have incorrect information for some columns, even though other columns might still be correct
- replace the incorrect information with NAs and decide whether we should impute missing values somehow
- leave the data as is, but think about how doing so could skew some results from our analysis