Getting started

Before we begin any analysis, we need to decide what tools we use. When performing exploratory data analysis R can be a great choice. It's syntax is succinct and very readable, and its plethora of packages can be used to our advantage. However, R is not without shortcomings. A data.frame, which is R's representation of tabular data, like any other R object needs to be loaded in the memory. When datasets get large, we quickly run out of available memory. Moreover, most analytics and modeling algorithms in R work only with a data.frame, but very large datasets are often stored in a distributed environment such as a Hadoop or Spark cluster or on a SQL Server. In such cases, we need algorithms that can work directly with data on disk or data distributed across a cluster, and we need algorithms that scale well as data sizes grow. In this module, we will see how the RevoScaleR package fills this need.

results matching ""

    No results matching ""