Basic summaries

After str, summary is probably the most ubiquitous R function. It provides us with summary statistics of each of the columns in the data. The kind of summary statistics we see for a given column depends on the column type. Just like str, summary gives clues for how we need to clean the data. For example

  • tpep_pickup_datetime and tpep_dropoff_datetime should be datetime columns, not character
  • rate_code_id and payment_type should be a factor, not character
  • the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
  • fare_amount is sometimes negative (could be refunds, could be errors, could be something else)

Once we clean the data (next chapter), we will rerun summary and notice how we see the appropriate summary statistics once the column have been converted to the right classes.

What if there are summaries we don't see? We can just write our own summary function, and here's an example. The num.distinct function will return the number of unique elements in a vector. Most of the work is done for us: the unique function returns the unique elements of a vector, and the length function counts how many there are. Notice how the function is commented with information about input types and output.

num.distinct <- function(x) {
  # returns the number of distinct values of a vector `x`
  # `x` can be numeric (floats are not recommended) , character, logical, factor
  # to see why floats are a bad idea try this: 
  # unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))
  length(unique(x))
}

It's usually a good idea to test the function with some random inputs before we test it on the larger data. We should also test the function on 'unusual' inputs to see if it does what we expect from it.

num.distinct(c(5, 6, 6, 9))
[1] 3
num.distinct(1) # test the function on a singleton (a vector of length 1)
[1] 1
num.distinct(c()) # test the function on an empty vector
[1] 0
num.distinct(c(23, 45, 45, NA, 11, 11)) # test the function on a vector with NAs
[1] 4

Now we can test the function on the data, for example on pickup_longitude:

num.distinct(nyc_taxi$pickup_longitude) # check it on a single variable in our data
[1] 28392

But what if we wanted to run the function on all the columns in the data at once? We could write a loop, but instead we show you the sapply function, which accomplishes the same thing in a more succint and R-like manner. With sapply, we pass the data as the first argument, and some function (usually a summary function) as the second argument: sapply will run the function on each column of the data (or those columns of the data for which the summary function is relevant).

sapply(nyc_taxi, num.distinct) # apply it to each variable in the data
             VendorID  tpep_pickup_datetime tpep_dropoff_datetime 
                    2               3337727               3338601 
      passenger_count         trip_distance      pickup_longitude 
                   10                  3632                 28392 
      pickup_latitude          rate_code_id    store_and_fwd_flag 
                53199                     7                     2 
    dropoff_longitude      dropoff_latitude          payment_type 
                41964                 74370                     4 
          fare_amount                 extra               mta_tax 
                 1162                    31                     8 
           tip_amount          tolls_amount improvement_surcharge 
                 2901                   649                     2 
         total_amount                     u 
                 9163               3817983

Any secondary argument to the summary function can be passed along to sapply. This feature makes sapply (and other similar functions) very powerful. For example, the mean function has an argument called na.rm for removing missing values. By default, na.rm is set to FALSE and unless na.rm = TRUE the function will return NA if there is any missing value in the data.

sapply(nyc_taxi, mean) # returns the average of all columns in the data
             VendorID  tpep_pickup_datetime tpep_dropoff_datetime 
                1.521                    NA                    NA 
      passenger_count         trip_distance      pickup_longitude 
                1.677                15.305               -72.699 
      pickup_latitude          rate_code_id    store_and_fwd_flag 
               40.048                 1.038                    NA 
    dropoff_longitude      dropoff_latitude          payment_type 
              -72.739                40.071                 1.378 
          fare_amount                 extra               mta_tax 
               12.707                 0.315                 0.498 
           tip_amount          tolls_amount improvement_surcharge 
                1.677                 0.288                 0.297 
         total_amount                     u 
               15.786                 0.025
sapply(nyc_taxi, mean, na.rm = TRUE) # returns the average of all columns in the data after removing NAs
             VendorID  tpep_pickup_datetime tpep_dropoff_datetime 
                1.521                    NA                    NA 
      passenger_count         trip_distance      pickup_longitude 
                1.677                15.305               -72.699 
      pickup_latitude          rate_code_id    store_and_fwd_flag 
               40.048                 1.038                    NA 
    dropoff_longitude      dropoff_latitude          payment_type 
              -72.739                40.071                 1.378 
          fare_amount                 extra               mta_tax 
               12.707                 0.315                 0.498 
           tip_amount          tolls_amount improvement_surcharge 
                1.677                 0.288                 0.297 
         total_amount                     u 
               15.786                 0.025

results matching ""

    No results matching ""