Solutions

Here are some of the ways we can clean the data:

tpep_pickup_datetime and tpep_dropoff_datetime should be datetime columns, not character
rate_code_id and payment_type should be a factor, not character
the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
fare_amount is sometimes negative (could be refunds, could be errors, could be something else)

Some data-cleaning jobs depend on the analysis. For example, turning payment_type into a factor is unnecessary if we don't intend to use it as a categorical variable in the model. Even so, we might still benefit from turning it into a factor so that we can see counts for it when we run summary on the data, or have it show the proper labels when we use it in a plot. Other data- cleaning jobs on the other hand relate to data quality issues. For example, unreasonable bounds for pick-up or drop-off coordinates can be due to error. In such cases, we must decide whether we should clean the data by

removing rows that have incorrect information for some columns, even though other columns might still be correct
replace the incorrect information with NAs and decide whether we should impute missing values somehow
leave the data as is, but think about how doing so could skew some results from our analysis

4.2 Solutions

Solutions

results matching ""

No results matching ""