Summary functions

We already learned of one all-encompassing summary function, namely summary:

summary(nyc_taxi) # summary of the whole data
 pickup_datetime               dropoff_datetime              passenger_count
 Min.   :2015-01-01 00:00:08   Min.   :2015-01-01 00:03:40   Min.   :0.00   
 1st Qu.:2015-02-15 19:56:01   1st Qu.:2015-02-15 20:09:51   1st Qu.:1.00   
 Median :2015-03-31 23:24:13   Median :2015-03-31 23:38:33   Median :1.00   
 Mean   :2015-04-01 05:59:25   Mean   :2015-04-01 06:47:29   Mean   :1.68   
 3rd Qu.:2015-05-15 06:35:17   3rd Qu.:2015-05-15 06:49:10   3rd Qu.:2.00   
 Max.   :2015-06-30 23:59:56   Max.   :2253-08-23 05:54:14   Max.   :9.00   
 NA's   :2                     NA's   :4                                    
 trip_distance     pickup_longitude pickup_latitude
 Min.   :      0   Min.   :-171.8   Min.   :38     
 1st Qu.:      1   1st Qu.: -74.0   1st Qu.:41     
 Median :      2   Median : -74.0   Median :41     
 Mean   :     15   Mean   : -72.7   Mean   :41     
 3rd Qu.:      3   3rd Qu.: -74.0   3rd Qu.:41     
 Max.   :9083540   Max.   :   0.0   Max.   :41     
                                    NA's   :66616  
                rate_code_id     dropoff_longitude dropoff_latitude payment_type  
 standard             :3758230   Min.   :-75       Min.   :39       card:2417055  
 JFK                  :  75423   1st Qu.:-74       1st Qu.:41       cash:1419764  
 Newark               :   6243   Median :-74       Median :41       NA's:  15543  
 Nassau or Westchester:   1300   Mean   :-74       Mean   :41                     
 negotiated           :  11039   3rd Qu.:-74       3rd Qu.:41                     
 group ride           :     33   Max.   :-73       Max.   :41                     
 n/a                  :     94   NA's   :73062     NA's   :64909                  
  fare_amount       extra        mta_tax       tip_amount     tolls_amount
 Min.   :   0   Min.   :-11   Min.   :-1.7   Min.   :  0     Min.   :  0  
 1st Qu.:   6   1st Qu.:  0   1st Qu.: 0.5   1st Qu.:  1     1st Qu.:  0  
 Median :  10   Median :  0   Median : 0.5   Median :  2     Median :  0  
 Mean   :  13   Mean   :  0   Mean   : 0.5   Mean   :  2     Mean   :  0  
 3rd Qu.:  14   3rd Qu.:  0   3rd Qu.: 0.5   3rd Qu.:  2     3rd Qu.:  0  
 Max.   :3130   Max.   :605   Max.   :60.4   Max.   :824     Max.   :561  
                                             NA's   :16002                
 improvement_surcharge  total_amount    pickup_hour       pickup_dow     
 Min.   :0.000         Min.   :   0   1AM-5AM :220035   Sat    : 610998  
 1st Qu.:0.300         1st Qu.:   8   5AM-9AM :585542   Fri    : 594144  
 Median :0.300         Median :  12   9AM-12PM:548667   Thu    : 575963  
 Mean   :0.297         Mean   :  16   12PM-4PM:729954   Wed    : 537178  
 3rd Qu.:0.300         3rd Qu.:  18   4PM-6PM :422422   Tue    : 523780  
 Max.   :0.300         Max.   :3219   6PM-10PM:903859   (Other):1010297  
                                      10PM-1AM:441883   NA's   :      2  
   dropoff_hour     dropoff_dow      trip_duration               pickup_nhood    
 1AM-5AM :230231   Sat    : 610963   Min.   :  -49504   Midtown        : 619229  
 5AM-9AM :552652   Fri    : 592603   1st Qu.:     394   Upper East Side: 516998  
 9AM-12PM:545022   Thu    : 574159   Median :     655   Upper West Side: 315193  
 12PM-4PM:734115   Wed    : 536613   Mean   :     930   Gramercy       : 302670  
 4PM-6PM :407044   Sun    : 527387   3rd Qu.:    1057   Chelsea        : 257451  
 6PM-10PM:914624   (Other):1010633   Max.   :32905773   (Other)        :1477126  
 10PM-1AM:468674   NA's   :      4   NA's   :6          NA's           : 363695  
         dropoff_nhood      tip_percent   
 Midtown        : 590646   Min.   : 0     
 Upper East Side: 482809   1st Qu.: 9     
 Upper West Side: 308657   Median :16     
 Gramercy       : 273631   Mean   :14     
 Chelsea        : 232022   3rd Qu.:18     
 (Other)        :1442160   Max.   :99     
 NA's           : 522437   NA's   :16002

We can use summary to run a sanity check on the data and find ways that the data might need to be cleaned in preparation for analysis, but we are now interested in individual summaries. For example, here's how we can find the average fare amount for the whole data.

mean(nyc_taxi$fare_amount) # the average of `fare_amount`
[1] 12.7

By specifying trim = .10 we can get a 10 percent trimmed average, i.e. the average after throwing out the top and bottom 10 percent of the data:

mean(nyc_taxi$fare_amount, trim = .10) # trimmed mean
[1] 10.6

By default, the mean function will return NA if there is any NA in the data, but we can overwrite that with na.rm = TRUE. This same argument shows up in almost all the statistical functions we encounter in this section.

mean(nyc_taxi$trip_duration) # NAs are not ignored by default
[1] NA
mean(nyc_taxi$trip_duration, na.rm = TRUE) # removes NAs before computing the average
[1] 930

We can use weighted.mean to find a weighted average. The weights are specified as the second argument, and if we fail to specify anything for weights, we just get a simple average.

weighted.mean(nyc_taxi$tip_percent, na.rm = TRUE) # simple average
[1] 13.9
weighted.mean(nyc_taxi$tip_percent, nyc_taxi$trip_distance, na.rm = TRUE) # weighted average
[1] 9.33

The sd function returns the standard deviation of the data, which is the same as returning the square root of its variance.

sd(nyc_taxi$trip_duration, na.rm = TRUE) # standard deviation
[1] 30309
sqrt(var(nyc_taxi$trip_duration, na.rm = TRUE)) # standard deviation == square root of variance
[1] 30309

We can use range to get the minimum and maximum of the data at once, or use min and max individually.

range(nyc_taxi$trip_duration, na.rm = TRUE) # minimum and maximum
[1]   -49504 32905773
c(min(nyc_taxi$trip_duration, na.rm = TRUE), max(nyc_taxi$trip_duration, na.rm = TRUE))
[1]   -49504 32905773

We can use median to return the median of the data.

median(nyc_taxi$trip_duration, na.rm = TRUE) # median
[1] 655

The quantile function is used to get any percentile of the data, where the percentile is specified by the probs argument. For example, letting probs = .5 returns the median.

quantile(nyc_taxi$trip_duration, probs = .5, na.rm = TRUE) # median == 50th percentile
50% 
655

We can specify a vector for probs to get multiple percentiles all at once. For example setting probs = c(.25, .75) returns the 25th and 75th percentiles.

quantile(nyc_taxi$trip_duration, probs = c(.25, .75), na.rm = TRUE) # IQR == difference b/w 75th and 25th percentiles
 25%  75% 
 394 1057

The difference between the 25th and 75th percentiles is called the inter-quartile range, which we can also get using the IQR function.

IQR(nyc_taxi$trip_duration, na.rm = TRUE) # interquartile range
[1] 663

Let's look at a common bivariate summary statistic for numeric data: correlation.

cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs")
[1] 0.0000345

We can use mothod to switch from Pearson's correlation to Spearman rank correlation.

cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs", method = "spearman")
[1] 0.836

Why does the Spearman correlation coefficient takes so much longer to compute?

So far we've examined functions for summarizing numeric data. Let's now shift our attention to categorical data. We already saw that we can use table to get counts for each level of a factor column.

table(nyc_taxi$pickup_nhood) # one-way table
       West Village        East Village        Battery Park       Carnegie Hill 
              94222              135597               33783               43896 
           Gramercy                Soho         Murray Hill        Little Italy 
             302670               78188              127397               33254 
       Central Park   Greenwich Village             Midtown Morningside Heights 
              51726              174398              619229               19887 
             Harlem    Hamilton Heights             Tribeca   North Sutton Area 
              19185                7634               63430               39633 
    Upper East Side  Financial District              Inwood             Chelsea 
             516998               79908                 399              257451 
    Lower East Side           Chinatown  Washington Heights     Upper West Side 
              88674               12207                4857              315193 
            Clinton           Yorkville    Garment District         East Harlem 
             119654               25272              211846               12079

When we pass more than one column to table, we get counts for each combination of the factor levels. For example, with two columns we get counts for each combination of the levels of the first factor and the second factor. In other words, we get a two-way table.

two_way <- with(nyc_taxi, table(pickup_nhood, dropoff_nhood)) # two-way table: an R `matrix`
two_way[1:5, 1:5]
               dropoff_nhood
pickup_nhood    West Village East Village Battery Park Carnegie Hill Gramercy
  West Village          4942         4521         1933           283     7006
  East Village          4473        11094          914           295    17178
  Battery Park          1740          764         1105            68     1889
  Carnegie Hill          198          189          139          1651     1082
  Gramercy              7667        17681         2370          1324    36809

What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an array. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.

What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an array. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.

arr_3d <- with(nyc_taxi, table(pickup_dow, pickup_hour, payment_type)) # a three-way table, an R 3D `array`
arr_3d
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun   38825   23091    45959    64587   33087    58968    52616
       Mon    9497   59577    42446    55701   36444    75973    23487
       Tue    8604   65646    45244    58581   38099    91282    27787
       Wed    9897   68757    46888    58648   37579    94052    32446
       Thu   13230   70123    49428    61270   38504    98079    40043
       Fri   16234   67487    48839    60353   39077    93822    48112
       Sat   35901   30713    49983    67140   37521    83725    63702

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun   22586   18297    32073    44662   21843    35622    28009
       Mon    7884   30328    29672    41496   21600    37315    14346
       Tue    6699   31232    29275    40896   22274    41275    14956
       Wed    7480   31837    29232    39364   20876    41754    16340
       Thu   10007   32302    31033    41708   22599    44961    20382
       Fri   11321   31846    31891    42771   24069    51271    24614
       Sat   20196   22356    34874    49735   27156    52582    32866

Let's see how we query a 3-dimensional array: Because we have a 3-dimensional array, we need to index it across three different dimensions:

arr_3d[3, 2, 2] # give me the 3rd row, 2nd column, 2nd 'page'
[1] 31232

Just as with a data.frame, leaving out the index for one of the dimensions returns all the values for that dimension.

arr_3d[ , , 2]
          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun   22586   18297    32073    44662   21843    35622    28009
       Mon    7884   30328    29672    41496   21600    37315    14346
       Tue    6699   31232    29275    40896   22274    41275    14956
       Wed    7480   31837    29232    39364   20876    41754    16340
       Thu   10007   32302    31033    41708   22599    44961    20382
       Fri   11321   31846    31891    42771   24069    51271    24614
       Sat   20196   22356    34874    49735   27156    52582    32866

We can use the names of the dimensions instead of their numeric index:

arr_3d['Tue', '5AM-9AM', 'cash']
[1] 31232

We can turn the array representation into a data.frame representation:

df_arr_3d <- as.data.frame(arr_3d) # same information, formatted as data frame
head(df_arr_3d)
  pickup_dow pickup_hour payment_type  Freq
1        Sun     1AM-5AM         card 38825
2        Mon     1AM-5AM         card  9497
3        Tue     1AM-5AM         card  8604
4        Wed     1AM-5AM         card  9897
5        Thu     1AM-5AM         card 13230
6        Fri     1AM-5AM         card 16234

We can subset the data.frame using the subset function:

subset(df_arr_3d, pickup_dow == 'Tue' & pickup_hour == '5AM-9AM' & payment_type == 'cash')
   pickup_dow pickup_hour payment_type  Freq
59        Tue     5AM-9AM         cash 31232

Notice how the array notation is more terse, but not as readable (because we need to remember the order of the dimensions).

We can use apply to get aggregates of a multidimensional array across some dimension(s).

dim(arr_3d)
[1] 7 7 2

The second argument to apply is used to specify which dimension(s) we are aggregating over.

apply(arr_3d, 2, sum) # because `pickup_hour` is the second dimension, we sum over `pickup_hour`
 1AM-5AM  5AM-9AM 9AM-12PM 12PM-4PM  4PM-6PM 6PM-10PM 10PM-1AM 
  218361   583592   546837   726912   420728   900681   439706

Once again, when the dimensions have names it is better to use the names instead of the numeric index.

apply(arr_3d, "pickup_hour", sum) # same as above, but more readable notation
 1AM-5AM  5AM-9AM 9AM-12PM 12PM-4PM  4PM-6PM 6PM-10PM 10PM-1AM 
  218361   583592   546837   726912   420728   900681   439706

So in the above example, we used apply to collapse a 3D array into a 2D array by summing across the values in the second dimension (the dimension representing pick-up hour).

We can use prop.table to turn the counts returned by table into proportions. The prop.table function has a second argument. When we leave it out, we get proportions for the grand total of the table.

prop.table(arr_3d) # as a proportion of the grand total
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun 0.01012 0.00602  0.01198  0.01683 0.00862  0.01537  0.01371
       Mon 0.00248 0.01553  0.01106  0.01452 0.00950  0.01980  0.00612
       Tue 0.00224 0.01711  0.01179  0.01527 0.00993  0.02379  0.00724
       Wed 0.00258 0.01792  0.01222  0.01529 0.00979  0.02451  0.00846
       Thu 0.00345 0.01828  0.01288  0.01597 0.01004  0.02556  0.01044
       Fri 0.00423 0.01759  0.01273  0.01573 0.01018  0.02445  0.01254
       Sat 0.00936 0.00800  0.01303  0.01750 0.00978  0.02182  0.01660

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun 0.00589 0.00477  0.00836  0.01164 0.00569  0.00928  0.00730
       Mon 0.00205 0.00790  0.00773  0.01082 0.00563  0.00973  0.00374
       Tue 0.00175 0.00814  0.00763  0.01066 0.00581  0.01076  0.00390
       Wed 0.00195 0.00830  0.00762  0.01026 0.00544  0.01088  0.00426
       Thu 0.00261 0.00842  0.00809  0.01087 0.00589  0.01172  0.00531
       Fri 0.00295 0.00830  0.00831  0.01115 0.00627  0.01336  0.00642
       Sat 0.00526 0.00583  0.00909  0.01296 0.00708  0.01370  0.00857

For proportions out of marginal totals, we provide the second argument to prop.table. For example, specifying 1 as the second argument gives us proportions out of "row" totals. Recall that in a 3d object, a "row" is a 2D object, for example arr_3d[1, , ] is the first "row", arr3d[2, , ] is the second "row" and so on.

prop.table(arr_3d, 1) # as a proportion of 'row' totals, or marginal totals for the first dimension
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun  0.0746  0.0444   0.0883   0.1242  0.0636   0.1134   0.1011
       Mon  0.0196  0.1226   0.0874   0.1147  0.0750   0.1564   0.0484
       Tue  0.0165  0.1258   0.0867   0.1123  0.0730   0.1749   0.0532
       Wed  0.0185  0.1285   0.0876   0.1096  0.0702   0.1757   0.0606
       Thu  0.0231  0.1222   0.0862   0.1068  0.0671   0.1710   0.0698
       Fri  0.0274  0.1141   0.0825   0.1020  0.0660   0.1586   0.0813
       Sat  0.0590  0.0505   0.0821   0.1103  0.0617   0.1376   0.1047

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun  0.0434  0.0352   0.0617   0.0859  0.0420   0.0685   0.0538
       Mon  0.0162  0.0624   0.0611   0.0854  0.0445   0.0768   0.0295
       Tue  0.0128  0.0598   0.0561   0.0784  0.0427   0.0791   0.0287
       Wed  0.0140  0.0595   0.0546   0.0736  0.0390   0.0780   0.0305
       Thu  0.0174  0.0563   0.0541   0.0727  0.0394   0.0784   0.0355
       Fri  0.0191  0.0538   0.0539   0.0723  0.0407   0.0866   0.0416
       Sat  0.0332  0.0367   0.0573   0.0817  0.0446   0.0864   0.0540

We can confirm this by using apply to run the sum function across the first dimension to make sure that they all add up to 1.

apply(prop.table(arr_3d, 1), 1, sum) # check that across rows, proportions add to 1
Sun Mon Tue Wed Thu Fri Sat 
  1   1   1   1   1   1   1

Similarly, if the second argument to prop.table is 2, we get proportions that add up to 1 across the values of the 2nd dimension. Since the second dimension corresponds to pick-up hour, for each pickup-hour, we get the proportion of observations that fall into each pick-up day of week and payment type combination.

prop.table(arr_3d, 2) # as a proportion of column totals
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun  0.1778  0.0396   0.0840   0.0889  0.0786   0.0655   0.1197
       Mon  0.0435  0.1021   0.0776   0.0766  0.0866   0.0844   0.0534
       Tue  0.0394  0.1125   0.0827   0.0806  0.0906   0.1013   0.0632
       Wed  0.0453  0.1178   0.0857   0.0807  0.0893   0.1044   0.0738
       Thu  0.0606  0.1202   0.0904   0.0843  0.0915   0.1089   0.0911
       Fri  0.0743  0.1156   0.0893   0.0830  0.0929   0.1042   0.1094
       Sat  0.1644  0.0526   0.0914   0.0924  0.0892   0.0930   0.1449

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun  0.1034  0.0314   0.0587   0.0614  0.0519   0.0396   0.0637
       Mon  0.0361  0.0520   0.0543   0.0571  0.0513   0.0414   0.0326
       Tue  0.0307  0.0535   0.0535   0.0563  0.0529   0.0458   0.0340
       Wed  0.0343  0.0546   0.0535   0.0542  0.0496   0.0464   0.0372
       Thu  0.0458  0.0554   0.0568   0.0574  0.0537   0.0499   0.0464
       Fri  0.0518  0.0546   0.0583   0.0588  0.0572   0.0569   0.0560
       Sat  0.0925  0.0383   0.0638   0.0684  0.0645   0.0584   0.0747

Which once again we can double-check with apply:

apply(prop.table(arr_3d, 2), 2, sum) # check that across columns, proportions add to 1
 1AM-5AM  5AM-9AM 9AM-12PM 12PM-4PM  4PM-6PM 6PM-10PM 10PM-1AM 
       1        1        1        1        1        1        1

Finally, if the second argument to prop.table is 3, we get proportions that add up to 1 across the values of the 3rd dimension. So for each payment type, the proportions now add up to 1.

prop.table(arr_3d, 3) # as a proportion of totals across third dimension
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun 0.01606 0.00955  0.01901  0.02672 0.01369  0.02440  0.02177
       Mon 0.00393 0.02465  0.01756  0.02304 0.01508  0.03143  0.00972
       Tue 0.00356 0.02716  0.01872  0.02424 0.01576  0.03777  0.01150
       Wed 0.00409 0.02845  0.01940  0.02426 0.01555  0.03891  0.01342
       Thu 0.00547 0.02901  0.02045  0.02535 0.01593  0.04058  0.01657
       Fri 0.00672 0.02792  0.02021  0.02497 0.01617  0.03882  0.01991
       Sat 0.01485 0.01271  0.02068  0.02778 0.01552  0.03464  0.02636

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun 0.01591 0.01289  0.02259  0.03146 0.01538  0.02509  0.01973
       Mon 0.00555 0.02136  0.02090  0.02923 0.01521  0.02628  0.01010
       Tue 0.00472 0.02200  0.02062  0.02880 0.01569  0.02907  0.01053
       Wed 0.00527 0.02242  0.02059  0.02773 0.01470  0.02941  0.01151
       Thu 0.00705 0.02275  0.02186  0.02938 0.01592  0.03167  0.01436
       Fri 0.00797 0.02243  0.02246  0.03013 0.01695  0.03611  0.01734
       Sat 0.01422 0.01575  0.02456  0.03503 0.01913  0.03704  0.02315

Both prop.table and apply also accepts combinations of dimensions as the second argument. This makes them powerful tools for aggregation, as long as we're careful. For example, letting the second argument be c(1, 2) gives us proportions that add up to 1 for each combination of "row" and "column". So in other words, we get the percentage of card vs cash payments for each pick-up day of week and hour combination.

prop.table(arr_3d, c(1, 2)) # as a proportion of totals for each combination of 1st and 2nd dimensions
, , payment_type = card

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun   0.632   0.558    0.589    0.591   0.602    0.623    0.653
       Mon   0.546   0.663    0.589    0.573   0.628    0.671    0.621
       Tue   0.562   0.678    0.607    0.589   0.631    0.689    0.650
       Wed   0.570   0.684    0.616    0.598   0.643    0.693    0.665
       Thu   0.569   0.685    0.614    0.595   0.630    0.686    0.663
       Fri   0.589   0.679    0.605    0.585   0.619    0.647    0.662
       Sat   0.640   0.579    0.589    0.574   0.580    0.614    0.660

, , payment_type = cash

          pickup_hour
pickup_dow 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
       Sun   0.368   0.442    0.411    0.409   0.398    0.377    0.347
       Mon   0.454   0.337    0.411    0.427   0.372    0.329    0.379
       Tue   0.438   0.322    0.393    0.411   0.369    0.311    0.350
       Wed   0.430   0.316    0.384    0.402   0.357    0.307    0.335
       Thu   0.431   0.315    0.386    0.405   0.370    0.314    0.337
       Fri   0.411   0.321    0.395    0.415   0.381    0.353    0.338
       Sat   0.360   0.421    0.411    0.426   0.420    0.386    0.340

results matching ""

    No results matching ""