Data summary and analysis

Let's recap where we are in the process:

load all the data (and combine them if necessary)
inspect the data in preparation cleaning it
clean the data in preparation for analysis
add any interesting features or columns as far as they pertain to the analysis
find ways to analyze or summarize the data and report your findings

Of course in practice a workflow is not clean-cut the way we have it here, and it tends to be circular in that finding out certain quirks about the data forces us to go back and make certain changes to the data-cleaning process or add other features and so on.

We now have a data set that's more or less ready for analysis. In the next section we go over ways we can summarize the data and produce plots and tables. Let's run str(nyc_taxi) and head(nyc_taxi) again to review all the work we did so far.

str(nyc_taxi)

'data.frame':    3852362 obs. of  25 variables:
 $ pickup_datetime      : POSIXct, format: "2015-01-15 19:05:40" "2015-01-25 00:13:06" ...
 $ dropoff_datetime     : POSIXct, format: "2015-01-15 19:28:18" "2015-01-25 00:24:51" ...
 $ passenger_count      : int  5 1 1 1 1 1 1 1 1 1 ...
 $ trip_distance        : num  8.33 3.37 3.72 10.2 0.36 8.98 1.56 1.5 1.39 15.2 ...
 $ pickup_longitude     : num  -73.9 -73.9 -74 -74 -74 ...
 $ pickup_latitude      : num  40.8 40.8 40.8 40.8 40.8 ...
 $ rate_code_id         : Factor w/ 7 levels "standard","JFK",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ dropoff_longitude    : num  -74 -74 -74 -73.9 -74 ...
 $ dropoff_latitude     : num  40.8 40.8 40.7 40.7 40.8 ...
 $ payment_type         : Factor w/ 2 levels "card","cash": 1 1 1 2 2 1 1 1 2 1 ...
 $ fare_amount          : num  26 12.5 16.5 39 3.5 27 7 8 7.5 52 ...
 $ extra                : num  1 0.5 0.5 0.5 0 0 0 0 0 0 ...
 $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 $ tip_amount           : num  8.08 0 3.56 1 0.5 ...
 $ tolls_amount         : num  5.33 0 0 0 0 5.33 0 0 0 5.33 ...
 $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0 0.3 0.3 ...
 $ total_amount         : num  41.2 13.8 21.4 40.3 4.3 ...
 $ pickup_hour          : Factor w/ 7 levels "1AM-5AM","5AM-9AM",..: 6 7 7 7 4 4 4 4 4 4 ...
 $ pickup_dow           : Factor w/ 7 levels "Sun","Mon","Tue",..: 5 1 1 1 1 1 1 1 1 5 ...
 $ dropoff_hour         : Factor w/ 7 levels "1AM-5AM","5AM-9AM",..: 6 7 7 7 4 4 4 4 4 4 ...
 $ dropoff_dow          : Factor w/ 7 levels "Sun","Mon","Tue",..: 5 1 1 1 1 1 1 1 1 5 ...
 $ trip_duration        : int  1358 705 1309 2971 106 1171 393 522 501 3477 ...
 $ pickup_nhood         : Factor w/ 28 levels "West Village",..: NA 17 24 25 11 NA 17 24 2 NA ...
 $ dropoff_nhood        : Factor w/ 28 levels "West Village",..: 4 27 20 NA 11 17 11 9 20 25 ...
 $ tip_percent          : int  23 0 17 2 12 0 12 17 6 21 ...

head(nyc_taxi, 3)

      pickup_datetime    dropoff_datetime passenger_count trip_distance
1 2015-01-15 19:05:40 2015-01-15 19:28:18               5          8.33
2 2015-01-25 00:13:06 2015-01-25 00:24:51               1          3.37
3 2015-01-25 00:13:08 2015-01-25 00:34:57               1          3.72
  pickup_longitude pickup_latitude rate_code_id dropoff_longitude dropoff_latitude
1            -73.9            40.8     standard               -74             40.8
2            -73.9            40.8     standard               -74             40.8
3            -74.0            40.8     standard               -74             40.7
  payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         card        26.0   1.0     0.5       8.08         5.33
2         card        12.5   0.5     0.5       0.00         0.00
3         card        16.5   0.5     0.5       3.56         0.00
  improvement_surcharge total_amount pickup_hour pickup_dow dropoff_hour
1                   0.3         41.2    6PM-10PM        Thu     6PM-10PM
2                   0.3         13.8    10PM-1AM        Sun     10PM-1AM
3                   0.3         21.4    10PM-1AM        Sun     10PM-1AM
  dropoff_dow trip_duration    pickup_nhood    dropoff_nhood tip_percent
1         Thu          1358            <NA>    Carnegie Hill          23
2         Sun           705 Upper East Side Garment District           0
3         Sun          1309 Upper West Side          Chelsea          17

We divide this chapter into three section:

Overview of some important statistical summary functions: This is by no means a comprehensive glossary of statistical functions, but rather a sampling of the important ones and how to use them, how to modify them, and some common patterns among them.
Data summary with base R tools: The base R tools for summarizing data are a bit more tedious and some have a different notation or way of passing arguments, but they are also widely used and they can be very efficient if used right.
Data summary with dplyr: dplyr offers a consistent and popular notation for processing and summarizing data, and one worth learning on top of base R.

To reiterate, statistical summary functions which we cover in section 1 can be used in either of the above cases, but what's different is the way we query the data using those functions. For the latter, we will review two (mostly alternative) ways: one using base functions in section 2 and one using the dplyr library in section 3.

7. Data summary and analysis

Data summary and analysis

results matching ""

No results matching ""