Summary functions
We already learned of one all-encompassing summary function, namely summary
:
summary(nyc_taxi) # summary of the whole data
pickup_datetime dropoff_datetime passenger_count
Min. :2015-01-01 00:00:08 Min. :2015-01-01 00:03:40 Min. :0.00
1st Qu.:2015-02-15 19:56:01 1st Qu.:2015-02-15 20:09:51 1st Qu.:1.00
Median :2015-03-31 23:24:13 Median :2015-03-31 23:38:33 Median :1.00
Mean :2015-04-01 05:59:25 Mean :2015-04-01 06:47:29 Mean :1.68
3rd Qu.:2015-05-15 06:35:17 3rd Qu.:2015-05-15 06:49:10 3rd Qu.:2.00
Max. :2015-06-30 23:59:56 Max. :2253-08-23 05:54:14 Max. :9.00
NA's :2 NA's :4
trip_distance pickup_longitude pickup_latitude
Min. : 0 Min. :-171.8 Min. :38
1st Qu.: 1 1st Qu.: -74.0 1st Qu.:41
Median : 2 Median : -74.0 Median :41
Mean : 15 Mean : -72.7 Mean :41
3rd Qu.: 3 3rd Qu.: -74.0 3rd Qu.:41
Max. :9083540 Max. : 0.0 Max. :41
NA's :66616
rate_code_id dropoff_longitude dropoff_latitude payment_type
standard :3758230 Min. :-75 Min. :39 card:2417055
JFK : 75423 1st Qu.:-74 1st Qu.:41 cash:1419764
Newark : 6243 Median :-74 Median :41 NA's: 15543
Nassau or Westchester: 1300 Mean :-74 Mean :41
negotiated : 11039 3rd Qu.:-74 3rd Qu.:41
group ride : 33 Max. :-73 Max. :41
n/a : 94 NA's :73062 NA's :64909
fare_amount extra mta_tax tip_amount tolls_amount
Min. : 0 Min. :-11 Min. :-1.7 Min. : 0 Min. : 0
1st Qu.: 6 1st Qu.: 0 1st Qu.: 0.5 1st Qu.: 1 1st Qu.: 0
Median : 10 Median : 0 Median : 0.5 Median : 2 Median : 0
Mean : 13 Mean : 0 Mean : 0.5 Mean : 2 Mean : 0
3rd Qu.: 14 3rd Qu.: 0 3rd Qu.: 0.5 3rd Qu.: 2 3rd Qu.: 0
Max. :3130 Max. :605 Max. :60.4 Max. :824 Max. :561
NA's :16002
improvement_surcharge total_amount pickup_hour pickup_dow
Min. :0.000 Min. : 0 1AM-5AM :220035 Sat : 610998
1st Qu.:0.300 1st Qu.: 8 5AM-9AM :585542 Fri : 594144
Median :0.300 Median : 12 9AM-12PM:548667 Thu : 575963
Mean :0.297 Mean : 16 12PM-4PM:729954 Wed : 537178
3rd Qu.:0.300 3rd Qu.: 18 4PM-6PM :422422 Tue : 523780
Max. :0.300 Max. :3219 6PM-10PM:903859 (Other):1010297
10PM-1AM:441883 NA's : 2
dropoff_hour dropoff_dow trip_duration pickup_nhood
1AM-5AM :230231 Sat : 610963 Min. : -49504 Midtown : 619229
5AM-9AM :552652 Fri : 592603 1st Qu.: 394 Upper East Side: 516998
9AM-12PM:545022 Thu : 574159 Median : 655 Upper West Side: 315193
12PM-4PM:734115 Wed : 536613 Mean : 930 Gramercy : 302670
4PM-6PM :407044 Sun : 527387 3rd Qu.: 1057 Chelsea : 257451
6PM-10PM:914624 (Other):1010633 Max. :32905773 (Other) :1477126
10PM-1AM:468674 NA's : 4 NA's :6 NA's : 363695
dropoff_nhood tip_percent
Midtown : 590646 Min. : 0
Upper East Side: 482809 1st Qu.: 9
Upper West Side: 308657 Median :16
Gramercy : 273631 Mean :14
Chelsea : 232022 3rd Qu.:18
(Other) :1442160 Max. :99
NA's : 522437 NA's :16002
We can use summary
to run a sanity check on the data and find ways that the data might need to be cleaned in preparation for analysis, but we are now interested in individual summaries. For example, here's how we can find the average fare amount for the whole data.
mean(nyc_taxi$fare_amount) # the average of `fare_amount`
[1] 12.7
By specifying trim = .10
we can get a 10 percent trimmed average, i.e. the average after throwing out the top and bottom 10 percent of the data:
mean(nyc_taxi$fare_amount, trim = .10) # trimmed mean
[1] 10.6
By default, the mean
function will return NA if there is any NA in the data, but we can overwrite that with na.rm = TRUE
. This same argument shows up in almost all the statistical functions we encounter in this section.
mean(nyc_taxi$trip_duration) # NAs are not ignored by default
[1] NA
mean(nyc_taxi$trip_duration, na.rm = TRUE) # removes NAs before computing the average
[1] 930
We can use weighted.mean
to find a weighted average. The weights are specified as the second argument, and if we fail to specify anything for weights, we just get a simple average.
weighted.mean(nyc_taxi$tip_percent, na.rm = TRUE) # simple average
[1] 13.9
weighted.mean(nyc_taxi$tip_percent, nyc_taxi$trip_distance, na.rm = TRUE) # weighted average
[1] 9.33
The sd
function returns the standard deviation of the data, which is the same as returning the square root of its variance.
sd(nyc_taxi$trip_duration, na.rm = TRUE) # standard deviation
[1] 30309
sqrt(var(nyc_taxi$trip_duration, na.rm = TRUE)) # standard deviation == square root of variance
[1] 30309
We can use range
to get the minimum and maximum of the data at once, or use min
and max
individually.
range(nyc_taxi$trip_duration, na.rm = TRUE) # minimum and maximum
[1] -49504 32905773
c(min(nyc_taxi$trip_duration, na.rm = TRUE), max(nyc_taxi$trip_duration, na.rm = TRUE))
[1] -49504 32905773
We can use median
to return the median of the data.
median(nyc_taxi$trip_duration, na.rm = TRUE) # median
[1] 655
The quantile
function is used to get any percentile of the data, where the percentile is specified by the probs
argument. For example, letting probs = .5
returns the median.
quantile(nyc_taxi$trip_duration, probs = .5, na.rm = TRUE) # median == 50th percentile
50%
655
We can specify a vector for probs
to get multiple percentiles all at once. For example setting probs = c(.25, .75)
returns the 25th and 75th percentiles.
quantile(nyc_taxi$trip_duration, probs = c(.25, .75), na.rm = TRUE) # IQR == difference b/w 75th and 25th percentiles
25% 75%
394 1057
The difference between the 25th and 75th percentiles is called the inter-quartile range, which we can also get using the IQR
function.
IQR(nyc_taxi$trip_duration, na.rm = TRUE) # interquartile range
[1] 663
Let's look at a common bivariate summary statistic for numeric data: correlation.
cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs")
[1] 0.0000345
We can use mothod
to switch from Pearson's correlation to Spearman rank correlation.
cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs", method = "spearman")
[1] 0.836
Why does the Spearman correlation coefficient takes so much longer to compute?
So far we've examined functions for summarizing numeric data. Let's now shift our attention to categorical data. We already saw that we can use table
to get counts for each level of a factor
column.
table(nyc_taxi$pickup_nhood) # one-way table
West Village East Village Battery Park Carnegie Hill
94222 135597 33783 43896
Gramercy Soho Murray Hill Little Italy
302670 78188 127397 33254
Central Park Greenwich Village Midtown Morningside Heights
51726 174398 619229 19887
Harlem Hamilton Heights Tribeca North Sutton Area
19185 7634 63430 39633
Upper East Side Financial District Inwood Chelsea
516998 79908 399 257451
Lower East Side Chinatown Washington Heights Upper West Side
88674 12207 4857 315193
Clinton Yorkville Garment District East Harlem
119654 25272 211846 12079
When we pass more than one column to table
, we get counts for each combination of the factor levels. For example, with two columns we get counts for each combination of the levels of the first factor and the second factor. In other words, we get a two-way table.
two_way <- with(nyc_taxi, table(pickup_nhood, dropoff_nhood)) # two-way table: an R `matrix`
two_way[1:5, 1:5]
dropoff_nhood
pickup_nhood West Village East Village Battery Park Carnegie Hill Gramercy
West Village 4942 4521 1933 283 7006
East Village 4473 11094 914 295 17178
Battery Park 1740 764 1105 68 1889
Carnegie Hill 198 189 139 1651 1082
Gramercy 7667 17681 2370 1324 36809
What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an array
. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.
What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an array
. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.
arr_3d <- with(nyc_taxi, table(pickup_dow, pickup_hour, payment_type)) # a three-way table, an R 3D `array`
arr_3d
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 38825 23091 45959 64587 33087 58968 52616
Mon 9497 59577 42446 55701 36444 75973 23487
Tue 8604 65646 45244 58581 38099 91282 27787
Wed 9897 68757 46888 58648 37579 94052 32446
Thu 13230 70123 49428 61270 38504 98079 40043
Fri 16234 67487 48839 60353 39077 93822 48112
Sat 35901 30713 49983 67140 37521 83725 63702
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 22586 18297 32073 44662 21843 35622 28009
Mon 7884 30328 29672 41496 21600 37315 14346
Tue 6699 31232 29275 40896 22274 41275 14956
Wed 7480 31837 29232 39364 20876 41754 16340
Thu 10007 32302 31033 41708 22599 44961 20382
Fri 11321 31846 31891 42771 24069 51271 24614
Sat 20196 22356 34874 49735 27156 52582 32866
Let's see how we query a 3-dimensional array
: Because we have a 3-dimensional array, we need to index it across three different dimensions:
arr_3d[3, 2, 2] # give me the 3rd row, 2nd column, 2nd 'page'
[1] 31232
Just as with a data.frame
, leaving out the index for one of the dimensions returns all the values for that dimension.
arr_3d[ , , 2]
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 22586 18297 32073 44662 21843 35622 28009
Mon 7884 30328 29672 41496 21600 37315 14346
Tue 6699 31232 29275 40896 22274 41275 14956
Wed 7480 31837 29232 39364 20876 41754 16340
Thu 10007 32302 31033 41708 22599 44961 20382
Fri 11321 31846 31891 42771 24069 51271 24614
Sat 20196 22356 34874 49735 27156 52582 32866
We can use the names of the dimensions instead of their numeric index:
arr_3d['Tue', '5AM-9AM', 'cash']
[1] 31232
We can turn the array
representation into a data.frame
representation:
df_arr_3d <- as.data.frame(arr_3d) # same information, formatted as data frame
head(df_arr_3d)
pickup_dow pickup_hour payment_type Freq
1 Sun 1AM-5AM card 38825
2 Mon 1AM-5AM card 9497
3 Tue 1AM-5AM card 8604
4 Wed 1AM-5AM card 9897
5 Thu 1AM-5AM card 13230
6 Fri 1AM-5AM card 16234
We can subset the data.frame
using the subset
function:
subset(df_arr_3d, pickup_dow == 'Tue' & pickup_hour == '5AM-9AM' & payment_type == 'cash')
pickup_dow pickup_hour payment_type Freq
59 Tue 5AM-9AM cash 31232
Notice how the array
notation is more terse, but not as readable (because we need to remember the order of the dimensions).
We can use apply
to get aggregates of a multidimensional array across some dimension(s).
dim(arr_3d)
[1] 7 7 2
The second argument to apply
is used to specify which dimension(s) we are aggregating over.
apply(arr_3d, 2, sum) # because `pickup_hour` is the second dimension, we sum over `pickup_hour`
1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
218361 583592 546837 726912 420728 900681 439706
Once again, when the dimensions have names it is better to use the names instead of the numeric index.
apply(arr_3d, "pickup_hour", sum) # same as above, but more readable notation
1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
218361 583592 546837 726912 420728 900681 439706
So in the above example, we used apply to collapse a 3D array
into a 2D array
by summing across the values in the second dimension (the dimension representing pick-up hour).
We can use prop.table
to turn the counts returned by table
into proportions. The prop.table
function has a second argument. When we leave it out, we get proportions for the grand total of the table.
prop.table(arr_3d) # as a proportion of the grand total
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.01012 0.00602 0.01198 0.01683 0.00862 0.01537 0.01371
Mon 0.00248 0.01553 0.01106 0.01452 0.00950 0.01980 0.00612
Tue 0.00224 0.01711 0.01179 0.01527 0.00993 0.02379 0.00724
Wed 0.00258 0.01792 0.01222 0.01529 0.00979 0.02451 0.00846
Thu 0.00345 0.01828 0.01288 0.01597 0.01004 0.02556 0.01044
Fri 0.00423 0.01759 0.01273 0.01573 0.01018 0.02445 0.01254
Sat 0.00936 0.00800 0.01303 0.01750 0.00978 0.02182 0.01660
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.00589 0.00477 0.00836 0.01164 0.00569 0.00928 0.00730
Mon 0.00205 0.00790 0.00773 0.01082 0.00563 0.00973 0.00374
Tue 0.00175 0.00814 0.00763 0.01066 0.00581 0.01076 0.00390
Wed 0.00195 0.00830 0.00762 0.01026 0.00544 0.01088 0.00426
Thu 0.00261 0.00842 0.00809 0.01087 0.00589 0.01172 0.00531
Fri 0.00295 0.00830 0.00831 0.01115 0.00627 0.01336 0.00642
Sat 0.00526 0.00583 0.00909 0.01296 0.00708 0.01370 0.00857
For proportions out of marginal totals, we provide the second argument to prop.table
. For example, specifying 1 as the second argument gives us proportions out of "row" totals. Recall that in a 3d object, a "row" is a 2D object, for example arr_3d[1, , ]
is the first "row", arr3d[2, , ]
is the second "row" and so on.
prop.table(arr_3d, 1) # as a proportion of 'row' totals, or marginal totals for the first dimension
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.0746 0.0444 0.0883 0.1242 0.0636 0.1134 0.1011
Mon 0.0196 0.1226 0.0874 0.1147 0.0750 0.1564 0.0484
Tue 0.0165 0.1258 0.0867 0.1123 0.0730 0.1749 0.0532
Wed 0.0185 0.1285 0.0876 0.1096 0.0702 0.1757 0.0606
Thu 0.0231 0.1222 0.0862 0.1068 0.0671 0.1710 0.0698
Fri 0.0274 0.1141 0.0825 0.1020 0.0660 0.1586 0.0813
Sat 0.0590 0.0505 0.0821 0.1103 0.0617 0.1376 0.1047
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.0434 0.0352 0.0617 0.0859 0.0420 0.0685 0.0538
Mon 0.0162 0.0624 0.0611 0.0854 0.0445 0.0768 0.0295
Tue 0.0128 0.0598 0.0561 0.0784 0.0427 0.0791 0.0287
Wed 0.0140 0.0595 0.0546 0.0736 0.0390 0.0780 0.0305
Thu 0.0174 0.0563 0.0541 0.0727 0.0394 0.0784 0.0355
Fri 0.0191 0.0538 0.0539 0.0723 0.0407 0.0866 0.0416
Sat 0.0332 0.0367 0.0573 0.0817 0.0446 0.0864 0.0540
We can confirm this by using apply
to run the sum
function across the first dimension to make sure that they all add up to 1.
apply(prop.table(arr_3d, 1), 1, sum) # check that across rows, proportions add to 1
Sun Mon Tue Wed Thu Fri Sat
1 1 1 1 1 1 1
Similarly, if the second argument to prop.table
is 2, we get proportions that add up to 1 across the values of the 2nd dimension. Since the second dimension corresponds to pick-up hour, for each pickup-hour, we get the proportion of observations that fall into each pick-up day of week and payment type combination.
prop.table(arr_3d, 2) # as a proportion of column totals
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.1778 0.0396 0.0840 0.0889 0.0786 0.0655 0.1197
Mon 0.0435 0.1021 0.0776 0.0766 0.0866 0.0844 0.0534
Tue 0.0394 0.1125 0.0827 0.0806 0.0906 0.1013 0.0632
Wed 0.0453 0.1178 0.0857 0.0807 0.0893 0.1044 0.0738
Thu 0.0606 0.1202 0.0904 0.0843 0.0915 0.1089 0.0911
Fri 0.0743 0.1156 0.0893 0.0830 0.0929 0.1042 0.1094
Sat 0.1644 0.0526 0.0914 0.0924 0.0892 0.0930 0.1449
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.1034 0.0314 0.0587 0.0614 0.0519 0.0396 0.0637
Mon 0.0361 0.0520 0.0543 0.0571 0.0513 0.0414 0.0326
Tue 0.0307 0.0535 0.0535 0.0563 0.0529 0.0458 0.0340
Wed 0.0343 0.0546 0.0535 0.0542 0.0496 0.0464 0.0372
Thu 0.0458 0.0554 0.0568 0.0574 0.0537 0.0499 0.0464
Fri 0.0518 0.0546 0.0583 0.0588 0.0572 0.0569 0.0560
Sat 0.0925 0.0383 0.0638 0.0684 0.0645 0.0584 0.0747
Which once again we can double-check with apply
:
apply(prop.table(arr_3d, 2), 2, sum) # check that across columns, proportions add to 1
1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
1 1 1 1 1 1 1
Finally, if the second argument to prop.table
is 3, we get proportions that add up to 1 across the values of the 3rd dimension. So for each payment type, the proportions now add up to 1.
prop.table(arr_3d, 3) # as a proportion of totals across third dimension
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.01606 0.00955 0.01901 0.02672 0.01369 0.02440 0.02177
Mon 0.00393 0.02465 0.01756 0.02304 0.01508 0.03143 0.00972
Tue 0.00356 0.02716 0.01872 0.02424 0.01576 0.03777 0.01150
Wed 0.00409 0.02845 0.01940 0.02426 0.01555 0.03891 0.01342
Thu 0.00547 0.02901 0.02045 0.02535 0.01593 0.04058 0.01657
Fri 0.00672 0.02792 0.02021 0.02497 0.01617 0.03882 0.01991
Sat 0.01485 0.01271 0.02068 0.02778 0.01552 0.03464 0.02636
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.01591 0.01289 0.02259 0.03146 0.01538 0.02509 0.01973
Mon 0.00555 0.02136 0.02090 0.02923 0.01521 0.02628 0.01010
Tue 0.00472 0.02200 0.02062 0.02880 0.01569 0.02907 0.01053
Wed 0.00527 0.02242 0.02059 0.02773 0.01470 0.02941 0.01151
Thu 0.00705 0.02275 0.02186 0.02938 0.01592 0.03167 0.01436
Fri 0.00797 0.02243 0.02246 0.03013 0.01695 0.03611 0.01734
Sat 0.01422 0.01575 0.02456 0.03503 0.01913 0.03704 0.02315
Both prop.table
and apply
also accepts combinations of dimensions as the second argument. This makes them powerful tools for aggregation, as long as we're careful. For example, letting the second argument be c(1, 2)
gives us proportions that add up to 1 for each combination of "row" and "column". So in other words, we get the percentage of card vs cash payments for each pick-up day of week and hour combination.
prop.table(arr_3d, c(1, 2)) # as a proportion of totals for each combination of 1st and 2nd dimensions
, , payment_type = card
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.632 0.558 0.589 0.591 0.602 0.623 0.653
Mon 0.546 0.663 0.589 0.573 0.628 0.671 0.621
Tue 0.562 0.678 0.607 0.589 0.631 0.689 0.650
Wed 0.570 0.684 0.616 0.598 0.643 0.693 0.665
Thu 0.569 0.685 0.614 0.595 0.630 0.686 0.663
Fri 0.589 0.679 0.605 0.585 0.619 0.647 0.662
Sat 0.640 0.579 0.589 0.574 0.580 0.614 0.660
, , payment_type = cash
pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun 0.368 0.442 0.411 0.409 0.398 0.377 0.347
Mon 0.454 0.337 0.411 0.427 0.372 0.329 0.379
Tue 0.438 0.322 0.393 0.411 0.369 0.311 0.350
Wed 0.430 0.316 0.384 0.402 0.357 0.307 0.335
Thu 0.431 0.315 0.386 0.405 0.370 0.314 0.337
Fri 0.411 0.321 0.395 0.415 0.381 0.353 0.338
Sat 0.360 0.421 0.411 0.426 0.420 0.386 0.340