Exercises
In this exercise, we will be using the nyc_jan_xdf
data from prior exercises. If you need to re-load the data, run the following code:
input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)
nyc_jan_xdf <- RxXdfData(input_xdf)
In the last section, we used the kmeans
function to build clusters on the sample data, and used the centroids it provided us with to initialize the rxKmeans
function, which builds the clusters on the whole data. One thing that we took for granted is the number of clusters to build. Our decision to build 300 clusters was somewhat arbitrary, based on the gut feeling that we expect about 300 "drop-off hubs" in Manhattan, i.e. points where taxis often drop passengers off. In this exercise, we provide a little more backing for our choice of the number of clusters.
Let's go back to the sample data and the kmeans
function, as shown here. We made two changes to the function call:
- we let the number of clusters vary based on what we pick for
nclus
- by letting
nstart = 1
we initialize the clusters only once, making it run much faster at the expense of less "stable" clusters (which we don't care about in this case)
xydropoff <- rxDataStep(nyc_jan_xdf, rowSelection = (u < .1),
transforms = list(u = runif(.rxNumRows),
long_std = dropoff_longitude / -74,
lat_std = dropoff_latitude / 40),
varsToKeep = c("dropoff_longitude", "dropoff_latitude"))
xydropoff <- xydropoff[ , c("long_std", "lat_std")]
nclus <- 50
kmeans_nclus <- kmeans(xydropoff, centers = nclus, iter.max = 2000, nstart = 1)
sum(kmeans_nclus$withinss)
[1] 0.0005410579
The number we extracted is the sum of the within-cluster sum of squares for each cluster. The within-cluster sum of squares (WSSs for short) is a measure of how much variability there is within each cluster. A lower WSSs indicates a more homogeneous cluster. However, we don't care about this metric per cluster. We simply sum over all the clusters' WSSs. When the number of clusters we build is small, individual clusters are less homogeneous, making the total WSSs larger. When we build a large number of clusters, the opposite is true. Therefore, total WSSs generally drops as nclus
increases, but there is a point beyond which increasing nclus
results in smaller and smaller drops in total WSSs. In other words, a point beyond which building a higher number of clusters is not worth the cost of increased complexity (as having more clusters makes it hard to tell them apart).
(1) Write a function called find_wss
that has as input the number of clusters we want to build, represented by nclus
in the above code, and returns the total WSSs. Additionally, your function should also print (but not return) the input nclus
as well as how long it takes to run.
(2) Demonstrate the concept of decreasing total WSSs as we assign a larger and larger number to nclus
by letting nculs
loop through the increasing sequence of numbers given by nclus_seq <- seq(20, 1000, by = 20)
, and running find_wss
in each case. You can do so by plotting the results.