In this tutorial I will explain how you can build a logistic regression model and test it. First we’ll load a data set. It comes from Kaggle. The data set is compressed and we are going to read directly from the zip file (otherwise it would take too long: 4 versus 24MB).
temp <- tempfile()
download.file("https://github.com/minorsmart/FEB2017/blob/master/Witek/doctor/No-show-Issue-Comma-300k.csv.zip?raw=true",temp)
trying URL 'https://github.com/minorsmart/FEB2017/blob/master/Witek/doctor/No-show-Issue-Comma-300k.csv.zip?raw=true'
Content type 'application/zip' length 4368493 bytes (4.2 MB)
==================================================
downloaded 4.2 MB
data <- read.csv(unz(temp, "No-show-Issue-Comma-300k.csv"))
unlink(temp)
Now we have the data set in a data frame we can start preparing it for the model. We’ll first load some usefull packages and then look at the structure and quality of the data.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyr)
library(magrittr)
Attaching package: ‘magrittr’
The following object is masked from ‘package:tidyr’:
extract
library(caTools)
library(biglm)
Loading required package: DBI
library(caret)
Loading required package: lattice
Loading required package: ggplot2
Error in names(frame)[names(frame) == "x"] <- name :
names() applied to a non-vector
str(data)
'data.frame': 300000 obs. of 15 variables:
$ Age : int 19 24 4 5 38 5 46 4 20 51 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 1 1 1 1 1 ...
$ AppointmentRegistration: Factor w/ 295425 levels "2013-05-29T15:14:11Z",..: 152112 248200 23071 87268 274612 71585 225305 244749 125184 70354 ...
$ ApointmentData : Factor w/ 534 levels "2014-01-02T00:00:00Z",..: 290 440 36 164 490 152 408 442 232 125 ...
$ DayOfTheWeek : Factor w/ 7 levels "Friday","Monday",..: 7 7 6 5 6 6 6 1 6 6 ...
$ Status : Factor w/ 2 levels "No-Show","Show-Up": 2 2 2 2 2 1 2 2 2 2 ...
$ Diabetes : int 0 0 0 0 0 0 0 0 0 1 ...
$ Alcoolism : int 0 0 0 0 0 0 0 0 0 0 ...
$ HiperTension : int 0 0 0 0 0 0 0 0 0 1 ...
$ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
$ Smokes : int 0 0 0 0 0 0 0 0 0 0 ...
$ Scholarship : int 0 0 0 0 0 0 0 1 0 0 ...
$ Tuberculosis : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sms_Reminder : int 0 0 0 1 1 1 1 1 0 1 ...
$ AwaitingTime : int -29 -1 -1 -15 -6 -35 -18 -14 -14 -4 ...
summary(data)
Age Gender AppointmentRegistration ApointmentData
Min. : -2.00 F:200505 2015-04-15T14:19:34Z: 9 2014-10-22T00:00:00Z: 759
1st Qu.: 19.00 M: 99495 2014-03-19T17:10:21Z: 8 2014-09-03T00:00:00Z: 756
Median : 38.00 2014-07-03T08:50:50Z: 8 2014-11-24T00:00:00Z: 752
Mean : 37.81 2015-05-27T13:57:09Z: 8 2014-11-19T00:00:00Z: 742
3rd Qu.: 56.00 2013-12-23T12:40:00Z: 7 2015-05-25T00:00:00Z: 741
Max. :113.00 2014-01-07T15:42:57Z: 7 2014-09-29T00:00:00Z: 739
(Other) :299953 (Other) :295511
DayOfTheWeek Status Diabetes Alcoolism HiperTension Handcap
Friday :52771 No-Show: 90731 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
Monday :59298 Show-Up:209269 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Saturday : 1393 Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
Sunday : 6 Mean :0.07797 Mean :0.02501 Mean :0.2159 Mean :0.02052
Thursday :60262 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
Tuesday :62775 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :4.00000
Wednesday:63495
Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :-398.00
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: -20.00
Median :0.00000 Median :0.0000 Median :0.00000 Median :1.0000 Median : -8.00
Mean :0.05237 Mean :0.0969 Mean :0.00045 Mean :0.5742 Mean : -13.84
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.: -4.00
Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. : -1.00
The data set comprises 300,000 recorded doctor’s appointments with 15 variables. We have a lot of categorical variables of which many are boolean (0 or 1). We want to predict whether a patient will show up for a doctor’s appointment (variable Status).
We see some peculiar ages like 113 and -2 years. We’ll filter out all ages below 5 and above 95.
Next we will change the Status variable into a 0/1 boolean. This will help us interpret the predicted value at the end when doing a logistic regression.
Finally we will change all boolean variables into factors and do a visual inspection of the data.
data <- data[,-3]
data <- separate(data, ApointmentData, c("Date", "Time"), sep = "T")
data <- separate(data, Date, c("Year", "Month", "Day"), sep = "-")
data$Year <- as.factor(data$Year)
data <- data[,-6]
data <- filter(data, Age > 5, Age < 95)
data$Status <- (data$Status == "Show-Up")*1
cols <- names(data[4:15])
data %<>% mutate_each_(funs(factor(.)),cols)
str(data)
'data.frame': 272070 obs. of 16 variables:
$ Age : int 19 24 38 46 20 51 33 58 62 62 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 1 1 2 1 1 1 ...
$ Year : Factor w/ 2 levels "2014","2015": 2 2 2 2 1 1 2 1 1 1 ...
$ Month : Factor w/ 12 levels "01","02","03",..: 1 8 10 7 10 6 11 5 8 7 ...
$ Day : Factor w/ 31 levels "01","02","03",..: 14 19 27 7 28 17 9 22 11 14 ...
$ DayOfTheWeek: Factor w/ 7 levels "Friday","Monday",..: 7 7 6 6 6 6 2 5 2 2 ...
$ Status : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
$ Diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 2 1 ...
$ Alcoolism : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ HiperTension: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
$ Handcap : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Smokes : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Scholarship : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ Tuberculosis: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Sms_Reminder: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 2 1 2 ...
$ AwaitingTime: int -29 -1 -6 -18 -14 -4 -39 -22 -17 -6 ...
summary(data)
Age Gender Year Month Day DayOfTheWeek Status
Min. : 6.00 F:186972 2014:139462 07 : 24847 10 : 10922 Friday :48050 0: 81684
1st Qu.:24.00 M: 85098 2015:132608 10 : 24679 17 : 10005 Monday :53796 1:190386
Median :42.00 05 : 23606 09 : 9981 Saturday : 1352
Mean :41.41 08 : 23547 24 : 9909 Sunday : 6
3rd Qu.:57.00 09 : 23430 14 : 9745 Thursday :54259
Max. :94.00 06 : 23217 16 : 9702 Tuesday :57042
(Other):128744 (Other):211806 Wednesday:57565
Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder
0:248732 0:264576 0:207456 0:266572 0:256366 0:245217 0:271940 0:116897
1: 23338 1: 7494 1: 64614 1: 5023 1: 15704 1: 26853 1: 130 1:154437
2: 435 2: 736
3: 34
4: 6
AwaitingTime
Min. :-398.0
1st Qu.: -20.0
Median : -8.0
Mean : -13.7
3rd Qu.: -4.0
Max. : -1.0
barplot(sum(is.na(data)*1)/dim(data)[1])
selCols <- c("Age", "Gender", "Year", "Month", "DayOfTheWeek", "Diabetes", "Alcoolism", "HiperTension", "Handcap", "Smokes", "Scholarship", "Tuberculosis", "Sms_Reminder", "AwaitingTime")
for(i in selCols) {
counts <- table(data$Status, data[, i])
barplot(counts, col=c("red","darkblue"), xlab = i, legend = rownames(counts))
}
NA
The cleaning deleted around 28,000 records. We will now split the data frame into two sets. One for training the model and one for testing it. When we split we need to take care that each set has enough variability in the predicted variable (Status). Otherwise either it can not learn properly or it can not do a meaningful test (imagine all values of Status in one of the subsets were 1 or 0). This is where the function sample.split() from the caret package comes in.
data$spl <- sample.split(data$Status, SplitRatio=0.70)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
Now we can train a model using a logistic regression and have a look at the parameters.
reg.model <- glm (Status ~ . -spl, data = train, family = binomial)
summary(reg.model)
Call:
glm(formula = Status ~ . - spl, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9676 -1.3898 0.7560 0.8718 2.3050
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5609091 0.0416755 13.459 < 2e-16 ***
Age 0.0119706 0.0002943 40.672 < 2e-16 ***
GenderM -0.0216609 0.0110916 -1.953 0.05083 .
Year2015 -0.0201924 0.0102738 -1.965 0.04936 *
Month02 -0.1202732 0.0267946 -4.489 7.17e-06 ***
Month03 -0.0806014 0.0262810 -3.067 0.00216 **
Month04 -0.1485641 0.0262134 -5.667 1.45e-08 ***
Month05 -0.0598902 0.0256415 -2.336 0.01951 *
Month06 -0.1111399 0.0259940 -4.276 1.91e-05 ***
Month07 -0.1612490 0.0251920 -6.401 1.55e-10 ***
Month08 -0.0560830 0.0259257 -2.163 0.03052 *
Month09 -0.0448688 0.0261151 -1.718 0.08578 .
Month10 -0.1147210 0.0252488 -4.544 5.53e-06 ***
Month11 -0.0713707 0.0262503 -2.719 0.00655 **
Month12 -0.2633343 0.0264240 -9.966 < 2e-16 ***
Day02 0.0351254 0.0428283 0.820 0.41213
Day03 0.0095578 0.0419918 0.228 0.81995
Day04 0.0797694 0.0435459 1.832 0.06697 .
Day05 0.0783902 0.0442724 1.771 0.07662 .
Day06 -0.0648963 0.0422072 -1.538 0.12415
Day07 0.0128978 0.0426370 0.303 0.76227
Day08 -0.0318983 0.0437601 -0.729 0.46604
Day09 0.0247753 0.0412723 0.600 0.54831
Day10 0.0109201 0.0406658 0.269 0.78829
Day11 0.0403618 0.0420872 0.959 0.33756
Day12 0.0709767 0.0437583 1.622 0.10480
Day13 0.0827147 0.0431609 1.916 0.05531 .
Day14 0.0536390 0.0416005 1.289 0.19726
Day15 -0.0137989 0.0416458 -0.331 0.74039
Day16 0.1265090 0.0418264 3.025 0.00249 **
Day17 0.0489550 0.0414607 1.181 0.23770
Day18 0.1293072 0.0433217 2.985 0.00284 **
Day19 0.0149361 0.0433188 0.345 0.73025
Day20 0.0470219 0.0430391 1.093 0.27460
Day21 0.0818781 0.0435708 1.879 0.06022 .
Day22 0.0652958 0.0421746 1.548 0.12157
Day23 0.0945742 0.0421226 2.245 0.02475 *
Day24 0.0488648 0.0416365 1.174 0.24055
Day25 0.0563634 0.0432216 1.304 0.19221
Day26 0.0664785 0.0434105 1.531 0.12567
Day27 0.0472284 0.0428468 1.102 0.27035
Day28 0.0288835 0.0424642 0.680 0.49639
Day29 -0.1729751 0.0415410 -4.164 3.13e-05 ***
Day30 -0.1352923 0.0421204 -3.212 0.00132 **
Day31 -0.0383209 0.0531132 -0.721 0.47061
DayOfTheWeekMonday -0.0672555 0.0167797 -4.008 6.12e-05 ***
DayOfTheWeekSaturday -0.4599917 0.0692925 -6.638 3.17e-11 ***
DayOfTheWeekSunday 0.6566065 1.0992512 0.597 0.55029
DayOfTheWeekThursday 0.0547268 0.0166960 3.278 0.00105 **
DayOfTheWeekTuesday 0.0676607 0.0166461 4.065 4.81e-05 ***
DayOfTheWeekWednesday 0.0285228 0.0165414 1.724 0.08465 .
Diabetes1 -0.0130508 0.0209318 -0.623 0.53296
Alcoolism1 -0.3469510 0.0321346 -10.797 < 2e-16 ***
HiperTension1 0.0234370 0.0150994 1.552 0.12062
Handcap1 -0.0533033 0.0384266 -1.387 0.16540
Handcap2 0.0769533 0.1304250 0.590 0.55518
Handcap3 0.8397667 0.5556436 1.511 0.13070
Handcap4 0.6974461 1.1247666 0.620 0.53520
Smokes1 -0.2223364 0.0229172 -9.702 < 2e-16 ***
Scholarship1 -0.1800135 0.0166321 -10.823 < 2e-16 ***
Tuberculosis1 -0.1715891 0.2141520 -0.801 0.42299
Sms_Reminder1 0.1054209 0.0110856 9.510 < 2e-16 ***
Sms_Reminder2 0.0039772 0.0976678 0.041 0.96752
AwaitingTime 0.0102560 0.0003296 31.115 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 232752 on 190448 degrees of freedom
Residual deviance: 228520 on 190385 degrees of freedom
AIC: 228648
Number of Fisher Scoring iterations: 4
We can test how well the model performs by comparing its predictions with actual values. The model returns a likelihood value. We set the cutoff rate at a = 0.5.
a <- 0.5
test$predict <- predict(reg.model, newdata=test, type='response')
test$predictb <- ifelse(test$predict > a,1,0)
misClasificError <- mean(test$predictb != test$Status)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.699268570588451"
cfnMat <- confusionMatrix(test$predictb, test$Status)
print(cfnMat)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 189 230
1 24316 56886
Accuracy : 0.6993
95% CI : (0.6961, 0.7024)
No Information Rate : 0.6998
P-Value [Acc > NIR] : 0.6245
Kappa : 0.0051
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.007713
Specificity : 0.995973
Pos Pred Value : 0.451074
Neg Pred Value : 0.700549
Prevalence : 0.300229
Detection Rate : 0.002316
Detection Prevalence : 0.005133
Balanced Accuracy : 0.501843
'Positive' Class : 0
Unfortunately the model has hardly any predictive power. We can see from the confusion matrix that when it predicts 0 it is 42% right all the times and when it predicts 1 (= Show-up) it is 70% correct.
The No Information Rate is equal to the mean of the status variable. So, if we would have build a model that would generate a 1 regardless of any input information it would have an accuracy of 70%, which is almost equal to the accuracy of the logistic model.
Apparently the process underlying the meeting of doctor’s appointments is not linearly logistic (at least not by this model). We should try a different approach like decision trees or neural networks.
First we’ll try a simple tree model to see whether that improves accuracy. We’ll use the rpart package for this. After that we will try and see how an basic neural network performs on the data.
library(rpart)
tree.model <- rpart(Status ~ . -spl, data = train, method = 'class', minsplit = 2, minbucket = 1)
summary(tree.model)
Call:
rpart(formula = Status ~ . - spl, data = train, method = "class",
minsplit = 2, minbucket = 1)
n= 190449
CP nsplit rel error xerror xstd
1 0 0 1 0 0
Node number 1: 190449 observations
predicted class=1 expected loss=0.3002326 P(node) =1
class counts: 57179 133270
probabilities: 0.300 0.700
bestcp <- tree.model$cptable[which.min(tree.model$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree.model, cp = bestcp)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(tree.model)
Error in apply(model$frame$yval2[, yval2per], 1, function(x) x[1 + x[1]]) :
dim(X) must have a positive length
library(nnet)
train$Status <- as.factor(train$Status)
nn.model <- nnet(Status ~ . -spl, data=train, size=10, maxit=600)
# weights: 651
initial value 144749.304097
iter 10 value 115303.236946
iter 20 value 114832.634776
iter 30 value 114693.217041
iter 40 value 114286.327211
iter 50 value 113993.801649
iter 60 value 113662.052125
iter 70 value 113526.170687
iter 80 value 113464.557006
iter 90 value 113325.130883
iter 100 value 113172.711761
iter 110 value 112946.440811
iter 120 value 112842.026794
iter 130 value 112807.059693
iter 140 value 112781.404325
iter 150 value 112751.101211
iter 160 value 112732.559565
iter 170 value 112718.455782
iter 180 value 112705.196411
iter 190 value 112689.751047
iter 200 value 112672.035171
iter 210 value 112657.599566
iter 220 value 112639.365958
iter 230 value 112554.401492
iter 240 value 112524.463938
iter 250 value 112498.436166
iter 260 value 112479.519779
iter 270 value 112463.332356
iter 280 value 112449.571003
iter 290 value 112441.379271
iter 300 value 112435.902671
iter 310 value 112431.932632
iter 320 value 112429.265755
iter 330 value 112426.945935
iter 340 value 112426.219597
iter 350 value 112425.753640
iter 360 value 112425.362640
iter 370 value 112424.855230
iter 380 value 112422.017245
iter 390 value 112420.656325
iter 400 value 112420.240834
final value 112420.186441
converged
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
#plot the model
plot.nnet(nn.model)
Loading required package: scales
Loading required package: reshape
Attaching package: ‘reshape’
The following object is masked from ‘package:tidyr’:
expand
The following object is masked from ‘package:dplyr’:
rename
test$predict <- NULL
test$predictb <- NULL
test$Status <- as.factor(test$Status)
test$predict <- predict(nn.model, newdata=test, type='class')
misClasificError <- mean(test$predict != test$Status)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.702245745580182"
cfnMat <- confusionMatrix(test$predict, test$Status)
print(cfnMat)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 737 535
1 23768 56581
Accuracy : 0.7022
95% CI : (0.6991, 0.7054)
No Information Rate : 0.6998
P-Value [Acc > NIR] : 0.06185
Kappa : 0.0284
Mcnemar's Test P-Value : < 2e-16
Sensitivity : 0.03008
Specificity : 0.99063
Pos Pred Value : 0.57940
Neg Pred Value : 0.70419
Prevalence : 0.30023
Detection Rate : 0.00903
Detection Prevalence : 0.01558
Balanced Accuracy : 0.51035
'Positive' Class : 0