Solutions

Here are some of the ways we can clean the data:

  • tpep_pickup_datetime and tpep_dropoff_datetime should be datetime columns, not character
  • rate_code_id and payment_type should be a factor, not character
  • the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
  • fare_amount is sometimes negative (could be refunds, could be errors, could be something else)

Some data-cleaning jobs depend on the analysis. For example, turning payment_type into a factor is unnecessary if we don't intend to use it as a categorical variable in the model. Even so, we might still benefit from turning it into a factor so that we can see counts for it when we run summary on the data, or have it show the proper labels when we use it in a plot. Other data- cleaning jobs on the other hand relate to data quality issues. For example, unreasonable bounds for pick-up or drop-off coordinates can be due to error. In such cases, we must decide whether we should clean the data by

  • removing rows that have incorrect information for some columns, even though other columns might still be correct
  • replace the incorrect information with NAs and decide whether we should impute missing values somehow
  • leave the data as is, but think about how doing so could skew some results from our analysis

results matching ""

    No results matching ""