Basic summaries
After str
, summary
is probably the most ubiquitous R function. It provides us with summary statistics of each of the columns in the data. The kind of summary statistics we see for a given column depends on the column type. Just like str
, summary
gives clues for how we need to clean the data. For example
tpep_pickup_datetime
andtpep_dropoff_datetime
should bedatetime
columns, notcharacter
rate_code_id
andpayment_type
should be afactor
, notcharacter
- the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
fare_amount
is sometimes negative (could be refunds, could be errors, could be something else)
Once we clean the data (next chapter), we will rerun summary and notice how we see the appropriate summary statistics once the column have been converted to the right classes.
What if there are summaries we don't see? We can just write our own summary function, and here's an example. The num.distinct
function will return the number of unique elements in a vector. Most of the work is done for us: the unique
function returns the unique elements of a vector, and the length
function counts how many there are. Notice how the function is commented with information about input types and output.
num.distinct <- function(x) {
# returns the number of distinct values of a vector `x`
# `x` can be numeric (floats are not recommended) , character, logical, factor
# to see why floats are a bad idea try this:
# unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))
length(unique(x))
}
It's usually a good idea to test the function with some random inputs before we test it on the larger data. We should also test the function on 'unusual' inputs to see if it does what we expect from it.
num.distinct(c(5, 6, 6, 9))
[1] 3
num.distinct(1) # test the function on a singleton (a vector of length 1)
[1] 1
num.distinct(c()) # test the function on an empty vector
[1] 0
num.distinct(c(23, 45, 45, NA, 11, 11)) # test the function on a vector with NAs
[1] 4
Now we can test the function on the data, for example on pickup_longitude
:
num.distinct(nyc_taxi$pickup_longitude) # check it on a single variable in our data
[1] 28392
But what if we wanted to run the function on all the columns in the data at once? We could write a loop, but instead we show you the sapply
function, which accomplishes the same thing in a more succint and R-like manner. With sapply
, we pass the data as the first argument, and some function (usually a summary function) as the second argument: sapply
will run the function on each column of the data (or those columns of the data for which the summary function is relevant).
sapply(nyc_taxi, num.distinct) # apply it to each variable in the data
VendorID tpep_pickup_datetime tpep_dropoff_datetime
2 3337727 3338601
passenger_count trip_distance pickup_longitude
10 3632 28392
pickup_latitude rate_code_id store_and_fwd_flag
53199 7 2
dropoff_longitude dropoff_latitude payment_type
41964 74370 4
fare_amount extra mta_tax
1162 31 8
tip_amount tolls_amount improvement_surcharge
2901 649 2
total_amount u
9163 3817983
Any secondary argument to the summary function can be passed along to sapply
. This feature makes sapply
(and other similar functions) very powerful. For example, the mean
function has an argument called na.rm
for removing missing values. By default, na.rm
is set to FALSE
and unless na.rm = TRUE
the function will return NA
if there is any missing value in the data.
sapply(nyc_taxi, mean) # returns the average of all columns in the data
VendorID tpep_pickup_datetime tpep_dropoff_datetime
1.521 NA NA
passenger_count trip_distance pickup_longitude
1.677 15.305 -72.699
pickup_latitude rate_code_id store_and_fwd_flag
40.048 1.038 NA
dropoff_longitude dropoff_latitude payment_type
-72.739 40.071 1.378
fare_amount extra mta_tax
12.707 0.315 0.498
tip_amount tolls_amount improvement_surcharge
1.677 0.288 0.297
total_amount u
15.786 0.025
sapply(nyc_taxi, mean, na.rm = TRUE) # returns the average of all columns in the data after removing NAs
VendorID tpep_pickup_datetime tpep_dropoff_datetime
1.521 NA NA
passenger_count trip_distance pickup_longitude
1.677 15.305 -72.699
pickup_latitude rate_code_id store_and_fwd_flag
40.048 1.038 NA
dropoff_longitude dropoff_latitude payment_type
-72.739 40.071 1.378
fare_amount extra mta_tax
12.707 0.315 0.498
tip_amount tolls_amount improvement_surcharge
1.677 0.288 0.297
total_amount u
15.786 0.025