Solutions

(1) The function shown here has nclus as its only argument, but we also use ... to pass any arguments kmeans take to find_wss as well. This could be helpful if we wanted to change nstart or the data itself.

find_wss <- function(nclus, ...) {
  st <- Sys.time()
  res <- sum(kmeans(centers = nclus, ...)$withinss)
  print(sprintf("nclus = %d, runtime = %3.2f seconds", nclus, Sys.time() - st))
  res
}

find_wss(nclus = 10, x = xydata, iter.max = 500, nstart = 1)

(2) We use sapply to run the above function in a loop. This makes the notation more clean and easy to modify. We then use ggplot2 to plot the results. Another interesting to notice is that nclus goes up, the function takes longer and longer to run. This has implications on building the clusters on the whole data: the number of clusters we want to build can significantly add to the runtime.

wss <- sapply(nclus_seq, find_wss, x = xydata, iter.max = 500, nstart = 1)

ggplot(aes(x = x, y = y), data = data.frame(x = nclus_seq, y = wss)) +
  geom_line() +
  xlab("number of clusters") +
  ylab("within clusters sum of squares")

WSS decline

As the plot shows, around 250 clusters, total WSSs starts to decrease very slowly.

results matching ""

    No results matching ""