Bash for Windows 10 and Microsoft R Open

Microsoft R Open
There are three variations of R for consideration currently:

• Microsoft R Server
• Microsoft R Open
• SQL Server R Services
• Microsoft R Client

This article derived from the narrative by David Salgado (see reference below) uses the bash environment in Windows 10 to examine the elementary steps for predictive modeling (a.k.a. machine learning) using Microsoft R Open. Another variation, Microsoft R Client, will be examined at a later date to illustrate ScaleR technology with suitable data sets.

Install the tools
The pre-requisite for this version of this exercise is the installation of bash on Windows. This environment provides the command-line utilities for a basic Ubuntu plug-in. Obviously, Microsoft R Open is not dependent on this environment. This exercise is merely an illustration of the portability of R and the new add-on feature to support Linux natively in Windows 10. The installation guide for this environment is noted in the References section below. After completing the installation, open a command prompt window and enter bash to launch the environment:

Obtain the installation file, microsoft-r-open-3.3.1.tar.gz, from the Microsoft site at mran.revolutionaryanalytics.com as illustrated below. You should change the current directory to a work folder as a practical measure.The version number of the archive shown below is 3.3.1; this version may be superseded by the time you read this note.

It is prudent to check that the download was successful by displaying the contents of the folder with the ls command that, in the figure below, illustrates the presence of the installation file, microsoft-r-open-3.3.1.tar.gz.

The command to extract the files from the installation archive is:

tar -xf microsoft-r-open-3.3.1.tar.gz

The example assumes that the installation archive resides in the current folder. After the command completes without any error messages, you can examine the contents of the folder once again by using the ls command. There will be a new file install.sh in the folder amongst a few other files.

Accept the Microsoft license agreement to continue with the installation:

Accept the request to install the Microsoft R Open package as well as the Intel Multithreading Kernel Library (MKL) package:

The following figure illustrates the messages that confirm the installation of the packages:

Launch Microsoft R Open using the standard command, R, as shown below. The preamble will be repeated every time the package is launched:

Frame the problem
This introductory exercise illustrates the development a very simple prediction to assist the business in demand estimation based on autonomous variables (e.g. snow condition, day of the week). There are many, many models to choose for the prediction but for the limited purpose of an introductory exercise only two basic models are shown. There are certainly other models that can provide more practical predictions.

Extract Transform and Load Data
Extract, Transform and Load (ETL) is the classic paradigm for analysis with structured data. Of course, the concepts have been applied to unstructured data too but that topic is beyond the scope of the current article. The sample data for this exercise is available at:

raw.githubusercontent.com/davidsalgado/BlogSamples/master/FirstPredictiveModelWithR/RentalFeatures.txt

The steps below illustrate the use of the conventional wget command to retrieve the dataset to a local folder named labs and then lists the files in the folder after the download. You can use any alternate method that you prefer to download the data set to the current folder.

The sample data set has been graciously made available here. I chose the word graciously here deliberately. Nearly ten years ago, I wanted to emulate a market basket analysis case study from a textbook using Microsoft SSAS 2005 with the Naïve Bayes model. I asked the author to clarify some key omissions in the textbook data. He declined stating that it would be unfair for me to audit his work!

Assign the dataset that resides in the current folder to a local variable, mydata, by loading it with the read.table method and specifying the presence of column names in the first row of the data set with the header parameter set to TRUE. You will have to use a fully qualified name with Linux conventions for the data set if it resides in another folder.

mydata = read.table(“RentalFeatures.txt”, header=TRUE)

Using the head command to list a few rows of data it is apparent that some of the column values ought to represent categories and not numerical values. The simplest way would be to append additional columns for category analysis by factorizing some of the numerical columns. The three key columns for this exercise are:

• Holiday
• Snow
• WeekDay

The command to represent the numerical values to categories for each of these columns is factor as shown below:

mydata$FHoliday = factor(mydata$Holiday)
mydata$FSnow = factor(mydata$Snow)
mydata$FWeekDay = factor(mydata$WeekDay)

You can view the levels for each factor assignment using the levels command as shown in the figure below:

levels(mydata$FHoliday)
levels(mydata$FSnow)
levels(mydata$FWeekDay)

Split the data into two subsets – one for training the model to estimate parameters and the other to validate it using your preferred nomenclature. In this exercise the names are train_data and test_data respectively:

train_data = mydata[mydata$Year < 2015]
test_data = mydata[mydata$Year == 2015]

The data in test_data is expressly omitted during the model training. It is limited to assessing the model only.

The application of a very simple selection criterion (e.g. Year) is sufficient for the purposes of an introductory illustration in this exercise. Practical uses of the splitting process require more complex filtering criteria to reduce intrinsic bias.

In this exercise, the data prior to year 2015 will serve as the training set for the model and the data for year 2015 will be used to test the model. It is also helpful to have a column to test the quality of the prediction as shown below:

Prepare the data
While ETL operations performed using a systematic workflow may be sufficient for introductory exercises, there is often a need for operations with the data before the analysis can start. You may encounter discussions on schemata or data normalization. This introductory exercise assumes that the ETL operations conducted so far are sufficient to proceed with the subsequent steps.

A summary of the data is available with the summary method as shown below:

The summary information provides a basic insight to the range of the data for subsequent analysis. You can list a few rows of the data set using the head command as shown below:

Exploratory Data Analysis
With the assumption that the current data set is ready for modeling exercises, data visualization is a key step to examine the key aspects for a data model. The simplest way to visualize is to plot two columns from the data set as follows:

plot(mydata$FSnow, mydata$RentalCount)
title(“Snow versus Rental Count”)

The plots are saved in the file, Rplots.pdf, in the current folder.

The title command illustrates the use of one of many options to embellish the plot. The snow values were factored into the FSnow category in a previous step. The corresponding chart illustrated higher demand for rentals during Snow days (x-axis: 0=no snow, 1=snow present) is shown below:

In order to visualize the rental trends on days of the week, the following command may be used:

plot(mydata$FWeekDay, mydata$RentalCount)
title(“Week Day versus Rental Count”)

The initial examination of the rental count for each week day (factored as FWeekDay in a previous step) clearly illustrates the perceptible higher demand for rentals during Sunday (i.e. factor category “1”) and Saturday (i.e. factor category “7”) over other days of the week in the chart below:

Another useful visual chart is the demand for rentals over a time period. Converting the calendar periods to an ISO date format while plotting is necessary for this purpose. Of course, there are many other parametric variations that you may want to consider for visualization purposes. The basic example is:

plot(ISOdate(mydata$Year, mydata$Month, mydata$Day), mydata$RentalCount)
title(“Rental Count for a time period”)

The seasonality of the rentals is evident in the corresponding chart is shown below:

Modeling
The simplest model to examine the relationship between the variables is linear regression. The commands below use the training data set to prepare the model and then use the test data set to validate the fit which is subsequently representing in chart format for quick visualization:

model1 = lm(RentalCount ~ Month + Day + FWeekDay + FSnow + FHoliday, train_data)
p1 = predict(model1, test_data)
plot(p1 – test_counts)
title(“Linear regression model”)

The explanation of the syntax for the model is thoroughly documented in R documentation which should be referenced for the details. For the purposes of this exercise, only the basic formulation is used for the linear regression model. The corresponding chart is shown below:

Viewing the chart it is readily apparent that there are noticeably large errors for many records. This observation should serve as a cue to evaluate alternate models. The basic approach in data mining is recursive partitioning that explores the structure of data sets. The rpart method will be used for the limited purposes of this exercise to illustrate the classification and regression trees that support the model through step-wise refinement. To install the rpart package, enter the following command when using rpart for the first time:

require(rpart)

The commands for the model using classification trees is as follow:

model2 = rpart(RentalCount ~ Month + Day + FWeekDay + FSnow + FHoliday, train_data)
p2 = predict(model2, test_data)
plot(p2 – test_counts)
title(“Rpart (Classification Tree) model”)

The ensuing chart illustrates a better model using classification trees than vanilla linear regression. The error residuals are much smaller in magnitude.

Forecasting
The preferred model can be used for predictions using the predict command as shown below:

predict(model2, data.frame(Month=1, Day=1, FWeekDay=factor(1), FSnow=factor(0), FHoliday=factor(0)))

Some examples are shown in the figure below with the clear manifestation that the rentals are higher during holidays:

For practical purposes the preferred usage would be to pass the independent parameters to a custom method that would return the computed value. Before committing to this approach it would be more prudent to refine the model for more robust performance.

References
• Data Science for Developers: Build your first predictive model with R, David Salgado
• Getting Started with R, MRAN
• An Introduction to R, W N Venables, D M Smith and the R Core Team
• Applied Predictive Modeling in R, Max Kuhn
• Bash on Windows Installation Guide, Jack Hammons

Advertisements

About charnumber

Still learning...
This entry was posted in Data Science. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s