Deploying to SQL Server (demo)

A basic overview of the SQL Server R Services architecture can be found here. Let's point to a SQL table containing a copy of the NYC Taxi dataset. The first thing we need to do is set up a SQL Server connection string, which contains our SQL login credentials. Since the connection string contains sensitive information, it is usually stored in a file in a restricted location and read from R, but in our example we will simply hard-code the connection string and store it in sqlConnString. Assume, the NYC Taxi dataset is stored in a table called NYCTaxiSmall inside the RDB database that the connection string points to. The last thing left for us to do is to point to the table, which we do with the RxSqlServerData function. This is the equivalent of RxXdfData when pointing to an XDF file stored on disk.

input_xdf <- 'yellow_tripdata_2016_manhattan.xdf'
nyc_xdf <- RxXdfData(input_xdf)

sqlConnString <- "Driver=SQL Server;Server=SERVERNAME;Database=RDB;Uid=ruser;Pwd=ruser"
sqlRowsPerRead <- 100000
sqlTable <- "NYCTaxiSmall"

nyc_sql <- RxSqlServerData(connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead, 
                           table = sqlTable)

For the sake of illustration, we now dump the content of nyc_xdf into the SQL table represented by nyc_sql (which in called NYCTaxiSmall in the SQL database). If the XDF file in question is large, this can take a while.

  rxDataStep(nyc_xdf, nyc_sql, overwrite = TRUE)

That's it. We can now use nyc_sql the same way we used nyc_xdf before. There is however something missing: we did not specify what the column types were. In this case, RxSqlServerData will try as best it can to convert a SQL Server column type to an R column type. This can cause problems though. First of all, SQL Server has a richer variety of column types than R. Second, some SQL Server column types like datetime for example don't always successfully transfer to their corresponding R column type. Third, the R column type factor does not really have a good equivalent in SQL Server, so in order for a column to be brought in as factor we must manually specify it. Doing so however gives us the advantage that we can also specify the levels and labels for it, and as we saw they don't always have to be the exact levels we see in the data. For example, if payment_type is represented by the integers 1 through 5 in the data, but we only care about 1 and 2 and want them labeled card and cash respectively, we can do that here without needing to do it later as a separate transformation. To deal with column types we create an object that stores the information about the columns and pass it to the colInfo argument in RxSqlServerData. Here's the example for nyc_sql:

ccColInfo <- list(
  tpep_pickup_datetime = list(type = "character"),
  tpep_dropoff_datetime = list(type = "character"),
  passenger_count = list(type = "integer"),
  trip_distance = list(type = "numeric"),
  pickup_longitude = list(type = "numeric"),
  pickup_latitude = list(type = "numeric"),
  dropoff_longitude = list(type = "numeric"),
  dropoff_latitude = list(type = "numeric"),
  RateCodeID = list(type = "factor", levels = as.character(1:6), newLevels = c("standard", "JFK", "Newark", "Nassau or Westchester", "negotiated", "group ride")),
  store_and_fwd_flag = list(type = "factor", levels = c("Y", "N")),
  payment_type = list(type = "factor", levels = as.character(1:2), newLevels = c("card", "cash")),
  fare_amount = list(type = "numeric"),
  tip_amount = list(type = "numeric"),
  total_amount = list(type = "numeric")

Notice that the object above does not necessarily have to specify the types for each column in the data. We can limit it only to the columns of interest, and even then only the ones that need to be explicitly overwritten. However, since we tend to be more conservative in a production environment, it's best to be more explicit. After all, certain numeric columns in SQL Server could be stored as something different (VARCHAR for example) which would turn into a character column in R.

In addition to the columns that were in the original data, we also need to specify the column types for columns that we added to the data throughout the analysis. Let's begin with the date time columns:

weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')

ccColInfo$pickup_dow <- list(type = "factor", levels = weekday_labels)
ccColInfo$pickup_hour <- list(type = "factor", levels = hour_labels)
ccColInfo$dropoff_dow <- list(type = "factor", levels = weekday_labels)
ccColInfo$dropoff_hour <- list(type = "factor", levels = hour_labels)

When working with the XDF file in the previous weeks, we went back and forth quite a few times to get the data in the right format, especially where factor columns were concerned. This is because we were learning about the data as we went and gradually preparing it for analysis. When working in a production environment however, the assumption is that we have our EDA behind us and know quite a bit about the data already. If not, then a recommended approach would be to take a sample of the data first and run some EDA on it. So some of the steps that we took with the XDF file in the prior weeks may have contained some redundancy or inneficiencies, which we never bothered to go back and revise. But when deploying code in production it's a good idea to make a second pass at the code and simplify things wherever it's warranted. As an example, when working with the XDF file, we first wrote a function to extract the pickup_nhood and dropoff_nhood columns from the pick-up and drop-off coordinates. We then noticed that those columns contain neighborhoods outside of Manhattan limits (our area of interest), so we made a second pass through the data to remove the factor levels for the irrelevant neighborhoods. With nyc_sql, we could take a similar approach: read those columns as factors with levels as is, and then use rxDataStep to perform a transformation that removes unwanted factor levels. But doing so is inefficient. The better approach is to find all the relevant factor levels (Manhattan neighborhoods, which we can get directly from the shapefile) and in the ccColInfo object only specify those as levels for those columns. Here's how:

nyc_shapefile <- readShapePoly('ZillowNeighborhoods-NY/ZillowNeighborhoods-NY.shp')
mht_shapefile <- subset(nyc_shapefile, str_detect(CITY, 'New York City-Manhattan'))
manhattan_nhoods <- as.character([email protected]$NAME)

ccColInfo$pickup_nhood <- list(type = "factor", levels = manhattan_nhoods)
ccColInfo$dropoff_nhood <- list(type = "factor", levels = manhattan_nhoods)

We are now ready to point to the SQL table a second time, but this time specify how columns should be treated in R using the colInfo argument.

nyc_sql <- RxSqlServerData(connectionString = sqlConnString, table = sqlTable, rowsPerRead = sqlRowsPerRead, colInfo = ccColInfo)

Just recall that every time we make a change to ccColInfo, we need to rerun the above line so that the change is reflected. For example, later (after running the seriate function), we can reorder the factor levels so that instead of being alphabetically ordered as they are now, they can follow a more natural ordering based on proximity to each other.

At this point, the rest of the analysis is no different from what it was with the XDF file, so we can change nyc_xdf into nyc_sql and run the remaining code just like before. For example, we can start with rxGetInfo to double check the column types.

rxGetInfo(nyc_sql, getVarInfo = TRUE, numRows = 3) # show column types and the first 10 rows

Let's now run rxSummary on all the columns:

  rxsum_sql <- rxSummary( ~ ., nyc_sql) # provide statistical summaries for all the columns

We get our summary back, but something important is missing. We have not yet set the compute context to the remote SQL Server session. Although we got our summary back, because the compute context was set to the local R session (the default) rxSummary had to download the data (using an ODBC connection) to the local R session so that it could summarize it. In-database analytics however is about taking the data to the computation, not the other way around. So let's now set the compute context to the remote SQL Server session (using rxSetComputeContext) and rerun rxSummary.

# Set ComputeContext. Needs a temp directory path to serialize R objects back and forth
sqlShareDir <- paste("C:/AllShare/", Sys.getenv("USERNAME"), sep = "")
sqlWait <- TRUE
sqlConsoleOutput <- TRUE
sqlCC <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)

  rxsum_sql <- rxSummary( ~ ., nyc_sql) # provide statistical summaries for all the columns

We can set the compute context back to local anytime we need to by running rxSetComputeContext(RxLocalSeq()). We can also run rxGetComputeContext() to see what the current compute context is. The difference in run time between rxSummary when the compute context is set to local and when it is set to SQL depends mostly on the size of the data and the speed at which it can travel over the network to reach the local R session (when the compute context is set to local). For large enough data sizes, this difference can be dramatic. By avoiding this cost, in-database analytics means we can greatly reduce our runtime.

There are however some limitations that we need to be aware of. Certain functions, such as rxMerge, rxSort or rxSplit only work with XDF files on the local file system, not with data sitting in Spark or SQL Server. This is because the common data processing functions already have their (probably more efficient) implementation, so we can simply defer to the SQL language if we need to join tables, sort tables, or split tables instead of the above-mentioned RevoScaleR functions.

As an example, let's run the same linear model we build on the XDF file now using the SQL table. We're going to build the model on 75 percent of the data (the training data) by creating a column u of random uniform numbers and using rowSelection only picking rows where u < .75.

system.time(linmod <- rxLinMod(tip_percent ~ pickup_nhood:dropoff_nhood + pickup_dow:pickup_hour, 
                               data = nyc_sql, reportProgress = 0,
                               rowSelection = (split == "train")))

We now point to a new SQL table called NYCTaxiScore (our pointer to it in R will be called nyc_score). We then use rxPredict to score the data.

sqlTable <- "NYCTaxiScore"
nyc_score <- RxSqlServerData(connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead, 
                           table = sqlTable)

rxPredict(trained.models$linmod, data = nyc_sql, outData = nyc_score, predVarNames = "tip_percent_pred_linmod", overwrite = TRUE)

Since building models can be a tedious and sometimes time-consuming process, we usually save models that we build so that we can use them for subsequent scoring (using rxPredict). Saving models is important because we often score new data on a regular basis using the same model. Saving models in R is easy: we simply use the save function, but saving models in SQL using the save function may break the rules about saving external objects on the server. Foretunately, there's a way we can save R objects in R (not just models but images or any other objects) by inserting them into the database. In the following example, we begin by creating a table called models with a single column called model whose type is varbinary(max). Moreover, we create a stored procedure in SQL for inserting a model object into this table.

(model varbinary(max))

CREATE PROCEDURE [dbo].[PersistModel]
@m nvarchar(max)
-- SET NOCOUNT ON added to prevent extra result sets from interfering with SELECT statements.
insert into models (model) values (convert(varbinary(max),@m,2))

From R, we can now insert a serialized model object into the above SQL table. To serialize an R object we use the serialize function. We then concatenate the serial object into a single (usually long) character vector (of length one) which we can insert into the above table. To run any random SQL query in R we can use the sqlQuery function as shown here. In this case, the query simply executes the stored procedure to store the model object in the table we created in the last step.

modelbin <- serialize(linmod, NULL)
modelbinstr <- paste(modelbin, collapse = "")

# Persist model by calling a stored procedure from SQL
conn <- odbcDriverConnect(sqlConnString)
q <- paste("EXEC PersistModel @m='", modelbinstr,"'", sep = "")
sqlQuery(conn, q)

Once the model object is stored in SQL, to open it in R, we must unserialize the object (using the unserialize R function). After unserializing it, we obtain the original R object back which can be used to score. So if serializing is the SQL equivalent to the save function, unserializing is the equivalent to load. We've already seen an example of scoring in dataset using rxPredict. Here's an example of doing the same, but by invoking a SQL stored procedure instead of running it directly within R. The stored procedure PredictTipBatchMode will score the NYCTaxiSmall dataset.

-- Create prediction stored procedure
CREATE PROCEDURE [dbo].[PredictTipBatchMode] @inquery nvarchar(max)
DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
FROM models);
EXEC sp_execute_external_script @language = N'R',
@script = N'
mod <- unserialize(as.raw(model))
OutputDataSet <- rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
  predVarNames = "Score", type = "response", writeModelVars = FALSE,
  overwrite = TRUE)
@input_data_1 = @inquery,
@params = N'@model varbinary(max)',
@model = @lmodel2
WITH RESULT SETS ((Score float));

As a SQL stored procedure, it can be invoked by an application that talks to SQL. Here's how we can run the stored procedure from T-SQL:

DECLARE @query_string nvarchar(max)
SET @query_string='SELECT *
FROM NYCTaxiSmall'
-- Call stored procedure for scoring
EXEC [dbo].[PredictTipBatchMode] @inquery = @query_string;

Of course the same stored procedure can be invokes directly by R, as such:

# Define input data using a SQL query. Here we use the table, NYCTaxiDirectDistFeatures,
# that was created earlier to store the engineered feature, direct_distance.
input = "N' SELECT * FROM NYCTaxiSmall'"
# The query is sent to be executed with input data, scored data frame is returned
q <- paste("EXEC PredictTipBatchMode @inquery = ", input, sep = "")
scoredData <- sqlQuery(conn, q)

results matching ""

    No results matching ""