In this introduction we will explore some useful R functions for data preparation. We will look very quickly at clustering and classification¶

In [1]:
#1.  First read the data from a CSV file into an R dataframe
sep=",",
stringsAsFactors = TRUE)
dim(W_df_orig)

1. 366
2. 24

Quick way to view some rows:¶

In [2]:
head(W_df_orig)

DateLocationMinTempMaxTempRainfallEvaporationSunshineWindGustDirWindGustSpeedWindDir9amHumidity3pmPressure9amPressure3pmCloud9amCloud3pmTemp9amTemp3pmRainTodayRISK_MMRainTomorrow
2007-11-01Canberra 8.0 24.3 0.0 3.4 6.3 NW 30 SW 29 1019.7 1015.0 7 7 14.4 23.6 No 3.6 Yes
2007-11-02Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39 E 36 1012.4 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
2007-11-03Canberra 13.7 23.4 3.6 5.8 3.3 NW 85 N 69 1009.5 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
2007-11-04Canberra 13.3 15.5 39.8 7.2 9.1 NW 54 WNW 56 1005.5 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
2007-11-05Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50 SSE 49 1018.3 1018.5 7 7 11.1 15.4 Yes 0.0 No
2007-11-06Canberra 6.2 16.9 0.0 5.8 8.2 SE 44 SE 57 1023.8 1021.7 7 5 10.9 14.8 No 0.2 No
In [3]:
tail(W_df)

Error in tail(W_df): object 'W_df' not found
Traceback:

1. tail(W_df)
In [ ]:
str(W_df)      #Quick view of the basic 'structure' of the data frame


Subsetting rows and columns¶

In [4]:
#1 Keep rows that are NOT missing data
keep_ind  = complete.cases(W_df_orig)
W_df      = W_df_orig[keep_ind,]

Y=as.numeric(W_df[,'RainTomorrow'])  #save thsi for later

#2 subset with select is good way to remove columns
W_df = subset(W_df, select=-c(RISK_MM,RainTomorrow))

dim(W_df)

1. 328
2. 22
In [5]:
# use as.Date( ) to convert strings to dates
W_df_date <- as.Date(W_df[,'Date'])
W_df_date[366]-W_df_date[1]    #now you can subtract dates

Time difference of NA days

Data preparation snippet: reshaping a dataframe with 'dcast' command¶

Now, imagine that each day we want to list a measurement for each wind direction all in the same row. You might think of it as doing an ANOVA where each factor level is it's own variable.

Run this section and notice what the new row looks like, Where are the new columns?

In [6]:
library(reshape2)

# long to wide: ie 'cast' repeated measure into wide table
W_long   =dcast(W_df,
formula=Date+Location+ ...~ WindGustDir,
# the ' ... '  means take all the rest of the columns as the index
# WindGustDir entries are labels for the repeated measures
fill=0,     #fill in empty cells; this could be 0 or NA, for example.
value.var="WindGustSpeed")
#this variable has the repeated measurement values

#optional: write.csv(W_cast,file='Weather_castwide.csv')

DateLocationMinTempMaxTempRainfallEvaporationSunshineWindDir9amWindDir3pmWindSpeed9amNNWNWSSESSESSWSWWWNWWSW
2007-11-01Canberra 8.0 24.3 0.0 3.4 6.3 SW NW 6 0 30 0 0 0 0 0 0 0 0
2007-11-02Canberra 14.0 26.9 3.6 4.4 9.7 E W 4 0 0 0 0 0 0 0 0 0 0
2007-11-03Canberra 13.7 23.4 3.6 5.8 3.3 N NNE 6 0 85 0 0 0 0 0 0 0 0
2007-11-04Canberra 13.3 15.5 39.8 7.2 9.1 WNW W 30 0 54 0 0 0 0 0 0 0 0
2007-11-05Canberra 7.6 16.1 2.8 5.6 10.6 SSE ESE 20 0 0 0 0 50 0 0 0 0 0
2007-11-06Canberra 6.2 16.9 0.0 5.8 8.2 SE E 20 0 0 0 44 0 0 0 0 0 0

Get factors using SVD¶

In [7]:
#1 Get numeric columns only
cols_numeric = sapply(W_df,is.numeric)   #get column classes as a list
W_dfnum       = W_df[,which(cols_numeric)]
dim(W_dfnum)

1. 328
2. 16
In [8]:
#2 turn it into a matrix
W_matrix = as.matrix(W_dfnum)

In [9]:
#3 mean center data
W_mncntr=scale(W_dfnum,center=TRUE,scale=FALSE)

In [10]:
#4 run SVD command
Wsvd=svd(W_mncntr)
str(Wsvd)

List of 3
$d: num [1:16] 367 307 215 164 109 ...$ u: num [1:328, 1:16] -0.03129 -0.01506 0.03569 0.00638 0.00357 ...

W_proj = as.matrix(W_mncntr) %*% Wsvd$v[,1:3] #project data onto 3 components plot(W_proj[,1],W_proj[,2],col=colassignments,main='data pts project to 1,2 SVD components, colored by kmeans') # to plot center points, first project them into components c3 = k4$centers%*% Wsvd$v[,1:3] points(c3[,1],c3[,2],pch=8,cex=2)  In [12]: #Y was created above, use it to select 2 colors #get class assignment in colors colassignments = col2use[Y] plot(W_proj[,1],W_proj[,2],col=colassignments,main='data pts project to 1,2 SVD components, colored by class')  Get simple classification model results¶ Let's run a linear model, make up a simple decision threshold to get a classification model of whether it will 'Rain Tomorrow'. Recall that the variable Y holds the outcomes 'Rain Tomorrow' In [13]: linmodel_result = lm(Y~.,data=W_dfnum) #Y is either 1 or 2, so let 1.5 be the cutoff of 'NO' rain prediction or 'YES' #Note the 'Y~.' is R's formula notation for Y=f( everything else)  In [14]: summary(Y)   Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.183 1.000 2.000  In [15]: #get model predictions (more generally we would use a test set to get prediction accuracy estimates) Y_pred=linmodel_result$fitted.values

#get the indices of predictions NO vs YES
Y_pred1_indices   =which(Y_pred<1.5)
Y_pred2_indices   =which(Y_pred>=1.5)

#set up No,Yes predictions
Y_pred_class                 =matrix(1,length(Y),1)
Y_pred_class[Y_pred2_indices]=2

#show a confusion matrix
table(Y,Y_pred_class)

   Y_pred_class
Y     1   2
1 259   9
2  31  29
In [16]:
#Now color the predictions onto the 2 SVD dimensions

colassignments = col2use[Y_pred_class]

plot(W_proj[,1],W_proj[,2],col=colassignments,main='data pts project to 1,2 SVD components, colored by PREDICTED class')