In this tutorial I will explain how you can build a logistic regression model and test it. First we’ll load a data set. It comes from Kaggle. The data set is compressed and we are going to read directly from the zip file (otherwise it would take too long: 4 versus 24MB).

temp <- tempfile()
download.file("https://github.com/minorsmart/FEB2017/blob/master/Witek/doctor/No-show-Issue-Comma-300k.csv.zip?raw=true",temp)
trying URL 'https://github.com/minorsmart/FEB2017/blob/master/Witek/doctor/No-show-Issue-Comma-300k.csv.zip?raw=true'
Content type 'application/zip' length 4368493 bytes (4.2 MB)
==================================================
downloaded 4.2 MB
data <- read.csv(unz(temp, "No-show-Issue-Comma-300k.csv"))
unlink(temp)

Now we have the data set in a data frame we can start preparing it for the model. We’ll first load some usefull packages and then look at the structure and quality of the data.

library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(tidyr)
library(magrittr)

Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract
library(caTools)
library(biglm)
Loading required package: DBI
library(caret)
Loading required package: lattice
Loading required package: ggplot2
Error in names(frame)[names(frame) == "x"] <- name : 
  names() applied to a non-vector
str(data)
'data.frame':   300000 obs. of  15 variables:
 $ Age                    : int  19 24 4 5 38 5 46 4 20 51 ...
 $ Gender                 : Factor w/ 2 levels "F","M": 2 1 1 2 2 1 1 1 1 1 ...
 $ AppointmentRegistration: Factor w/ 295425 levels "2013-05-29T15:14:11Z",..: 152112 248200 23071 87268 274612 71585 225305 244749 125184 70354 ...
 $ ApointmentData         : Factor w/ 534 levels "2014-01-02T00:00:00Z",..: 290 440 36 164 490 152 408 442 232 125 ...
 $ DayOfTheWeek           : Factor w/ 7 levels "Friday","Monday",..: 7 7 6 5 6 6 6 1 6 6 ...
 $ Status                 : Factor w/ 2 levels "No-Show","Show-Up": 2 2 2 2 2 1 2 2 2 2 ...
 $ Diabetes               : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Alcoolism              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HiperTension           : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Handcap                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Smokes                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Scholarship            : int  0 0 0 0 0 0 0 1 0 0 ...
 $ Tuberculosis           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Sms_Reminder           : int  0 0 0 1 1 1 1 1 0 1 ...
 $ AwaitingTime           : int  -29 -1 -1 -15 -6 -35 -18 -14 -14 -4 ...
summary(data)
      Age         Gender             AppointmentRegistration              ApointmentData  
 Min.   : -2.00   F:200505   2015-04-15T14:19:34Z:     9     2014-10-22T00:00:00Z:   759  
 1st Qu.: 19.00   M: 99495   2014-03-19T17:10:21Z:     8     2014-09-03T00:00:00Z:   756  
 Median : 38.00              2014-07-03T08:50:50Z:     8     2014-11-24T00:00:00Z:   752  
 Mean   : 37.81              2015-05-27T13:57:09Z:     8     2014-11-19T00:00:00Z:   742  
 3rd Qu.: 56.00              2013-12-23T12:40:00Z:     7     2015-05-25T00:00:00Z:   741  
 Max.   :113.00              2014-01-07T15:42:57Z:     7     2014-09-29T00:00:00Z:   739  
                             (Other)             :299953     (Other)             :295511  
    DayOfTheWeek       Status          Diabetes         Alcoolism        HiperTension       Handcap       
 Friday   :52771   No-Show: 90731   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 Monday   :59298   Show-Up:209269   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
 Saturday : 1393                    Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
 Sunday   :    6                    Mean   :0.07797   Mean   :0.02501   Mean   :0.2159   Mean   :0.02052  
 Thursday :60262                    3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Tuesday  :62775                    Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
 Wednesday:63495                                                                                          
     Smokes         Scholarship      Tuberculosis      Sms_Reminder     AwaitingTime    
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :-398.00  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: -20.00  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :1.0000   Median :  -8.00  
 Mean   :0.05237   Mean   :0.0969   Mean   :0.00045   Mean   :0.5742   Mean   : -13.84  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:  -4.00  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :2.0000   Max.   :  -1.00  
                                                                                        

The data set comprises 300,000 recorded doctor’s appointments with 15 variables. We have a lot of categorical variables of which many are boolean (0 or 1). We want to predict whether a patient will show up for a doctor’s appointment (variable Status).

We see some peculiar ages like 113 and -2 years. We’ll filter out all ages below 5 and above 95.

Next we will change the Status variable into a 0/1 boolean. This will help us interpret the predicted value at the end when doing a logistic regression.

Finally we will change all boolean variables into factors and do a visual inspection of the data.

data <- data[,-3]
data <- separate(data, ApointmentData, c("Date", "Time"), sep = "T")
data <- separate(data, Date, c("Year", "Month", "Day"), sep = "-")
data$Year <- as.factor(data$Year)
data <- data[,-6]
data <- filter(data, Age > 5, Age < 95)
data$Status <- (data$Status == "Show-Up")*1
cols <- names(data[4:15])
data %<>% mutate_each_(funs(factor(.)),cols)
str(data)
'data.frame':   272070 obs. of  16 variables:
 $ Age         : int  19 24 38 46 20 51 33 58 62 62 ...
 $ Gender      : Factor w/ 2 levels "F","M": 2 1 2 1 1 1 2 1 1 1 ...
 $ Year        : Factor w/ 2 levels "2014","2015": 2 2 2 2 1 1 2 1 1 1 ...
 $ Month       : Factor w/ 12 levels "01","02","03",..: 1 8 10 7 10 6 11 5 8 7 ...
 $ Day         : Factor w/ 31 levels "01","02","03",..: 14 19 27 7 28 17 9 22 11 14 ...
 $ DayOfTheWeek: Factor w/ 7 levels "Friday","Monday",..: 7 7 6 6 6 6 2 5 2 2 ...
 $ Status      : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
 $ Diabetes    : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 2 1 ...
 $ Alcoolism   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ HiperTension: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
 $ Handcap     : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Smokes      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Scholarship : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ Tuberculosis: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sms_Reminder: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 2 1 2 ...
 $ AwaitingTime: int  -29 -1 -6 -18 -14 -4 -39 -22 -17 -6 ...
summary(data)
      Age        Gender       Year            Month             Day            DayOfTheWeek   Status    
 Min.   : 6.00   F:186972   2014:139462   07     : 24847   10     : 10922   Friday   :48050   0: 81684  
 1st Qu.:24.00   M: 85098   2015:132608   10     : 24679   17     : 10005   Monday   :53796   1:190386  
 Median :42.00                            05     : 23606   09     :  9981   Saturday : 1352             
 Mean   :41.41                            08     : 23547   24     :  9909   Sunday   :    6             
 3rd Qu.:57.00                            09     : 23430   14     :  9745   Thursday :54259             
 Max.   :94.00                            06     : 23217   16     :  9702   Tuesday  :57042             
                                          (Other):128744   (Other):211806   Wednesday:57565             
 Diabetes   Alcoolism  HiperTension Handcap    Smokes     Scholarship Tuberculosis Sms_Reminder
 0:248732   0:264576   0:207456     0:266572   0:256366   0:245217    0:271940     0:116897    
 1: 23338   1:  7494   1: 64614     1:  5023   1: 15704   1: 26853    1:   130     1:154437    
                                    2:   435                                       2:   736    
                                    3:    34                                                   
                                    4:     6                                                   
                                                                                               
                                                                                               
  AwaitingTime   
 Min.   :-398.0  
 1st Qu.: -20.0  
 Median :  -8.0  
 Mean   : -13.7  
 3rd Qu.:  -4.0  
 Max.   :  -1.0  
                 
barplot(sum(is.na(data)*1)/dim(data)[1])

selCols <- c("Age", "Gender", "Year", "Month", "DayOfTheWeek", "Diabetes", "Alcoolism", "HiperTension", "Handcap", "Smokes", "Scholarship", "Tuberculosis", "Sms_Reminder", "AwaitingTime")
for(i in selCols) {
  
  counts <- table(data$Status, data[, i])
  barplot(counts, col=c("red","darkblue"), xlab = i, legend = rownames(counts))
  
}

NA

The cleaning deleted around 28,000 records. We will now split the data frame into two sets. One for training the model and one for testing it. When we split we need to take care that each set has enough variability in the predicted variable (Status). Otherwise either it can not learn properly or it can not do a meaningful test (imagine all values of Status in one of the subsets were 1 or 0). This is where the function sample.split() from the caret package comes in.

data$spl <- sample.split(data$Status, SplitRatio=0.70)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)

Now we can train a model using a logistic regression and have a look at the parameters.

reg.model <- glm (Status ~ . -spl, data = train, family = binomial)
summary(reg.model)

Call:
glm(formula = Status ~ . - spl, family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9676  -1.3898   0.7560   0.8718   2.3050  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.5609091  0.0416755  13.459  < 2e-16 ***
Age                    0.0119706  0.0002943  40.672  < 2e-16 ***
GenderM               -0.0216609  0.0110916  -1.953  0.05083 .  
Year2015              -0.0201924  0.0102738  -1.965  0.04936 *  
Month02               -0.1202732  0.0267946  -4.489 7.17e-06 ***
Month03               -0.0806014  0.0262810  -3.067  0.00216 ** 
Month04               -0.1485641  0.0262134  -5.667 1.45e-08 ***
Month05               -0.0598902  0.0256415  -2.336  0.01951 *  
Month06               -0.1111399  0.0259940  -4.276 1.91e-05 ***
Month07               -0.1612490  0.0251920  -6.401 1.55e-10 ***
Month08               -0.0560830  0.0259257  -2.163  0.03052 *  
Month09               -0.0448688  0.0261151  -1.718  0.08578 .  
Month10               -0.1147210  0.0252488  -4.544 5.53e-06 ***
Month11               -0.0713707  0.0262503  -2.719  0.00655 ** 
Month12               -0.2633343  0.0264240  -9.966  < 2e-16 ***
Day02                  0.0351254  0.0428283   0.820  0.41213    
Day03                  0.0095578  0.0419918   0.228  0.81995    
Day04                  0.0797694  0.0435459   1.832  0.06697 .  
Day05                  0.0783902  0.0442724   1.771  0.07662 .  
Day06                 -0.0648963  0.0422072  -1.538  0.12415    
Day07                  0.0128978  0.0426370   0.303  0.76227    
Day08                 -0.0318983  0.0437601  -0.729  0.46604    
Day09                  0.0247753  0.0412723   0.600  0.54831    
Day10                  0.0109201  0.0406658   0.269  0.78829    
Day11                  0.0403618  0.0420872   0.959  0.33756    
Day12                  0.0709767  0.0437583   1.622  0.10480    
Day13                  0.0827147  0.0431609   1.916  0.05531 .  
Day14                  0.0536390  0.0416005   1.289  0.19726    
Day15                 -0.0137989  0.0416458  -0.331  0.74039    
Day16                  0.1265090  0.0418264   3.025  0.00249 ** 
Day17                  0.0489550  0.0414607   1.181  0.23770    
Day18                  0.1293072  0.0433217   2.985  0.00284 ** 
Day19                  0.0149361  0.0433188   0.345  0.73025    
Day20                  0.0470219  0.0430391   1.093  0.27460    
Day21                  0.0818781  0.0435708   1.879  0.06022 .  
Day22                  0.0652958  0.0421746   1.548  0.12157    
Day23                  0.0945742  0.0421226   2.245  0.02475 *  
Day24                  0.0488648  0.0416365   1.174  0.24055    
Day25                  0.0563634  0.0432216   1.304  0.19221    
Day26                  0.0664785  0.0434105   1.531  0.12567    
Day27                  0.0472284  0.0428468   1.102  0.27035    
Day28                  0.0288835  0.0424642   0.680  0.49639    
Day29                 -0.1729751  0.0415410  -4.164 3.13e-05 ***
Day30                 -0.1352923  0.0421204  -3.212  0.00132 ** 
Day31                 -0.0383209  0.0531132  -0.721  0.47061    
DayOfTheWeekMonday    -0.0672555  0.0167797  -4.008 6.12e-05 ***
DayOfTheWeekSaturday  -0.4599917  0.0692925  -6.638 3.17e-11 ***
DayOfTheWeekSunday     0.6566065  1.0992512   0.597  0.55029    
DayOfTheWeekThursday   0.0547268  0.0166960   3.278  0.00105 ** 
DayOfTheWeekTuesday    0.0676607  0.0166461   4.065 4.81e-05 ***
DayOfTheWeekWednesday  0.0285228  0.0165414   1.724  0.08465 .  
Diabetes1             -0.0130508  0.0209318  -0.623  0.53296    
Alcoolism1            -0.3469510  0.0321346 -10.797  < 2e-16 ***
HiperTension1          0.0234370  0.0150994   1.552  0.12062    
Handcap1              -0.0533033  0.0384266  -1.387  0.16540    
Handcap2               0.0769533  0.1304250   0.590  0.55518    
Handcap3               0.8397667  0.5556436   1.511  0.13070    
Handcap4               0.6974461  1.1247666   0.620  0.53520    
Smokes1               -0.2223364  0.0229172  -9.702  < 2e-16 ***
Scholarship1          -0.1800135  0.0166321 -10.823  < 2e-16 ***
Tuberculosis1         -0.1715891  0.2141520  -0.801  0.42299    
Sms_Reminder1          0.1054209  0.0110856   9.510  < 2e-16 ***
Sms_Reminder2          0.0039772  0.0976678   0.041  0.96752    
AwaitingTime           0.0102560  0.0003296  31.115  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 232752  on 190448  degrees of freedom
Residual deviance: 228520  on 190385  degrees of freedom
AIC: 228648

Number of Fisher Scoring iterations: 4

We can test how well the model performs by comparing its predictions with actual values. The model returns a likelihood value. We set the cutoff rate at a = 0.5.

a <- 0.5
test$predict <- predict(reg.model, newdata=test, type='response')
test$predictb <- ifelse(test$predict > a,1,0)
misClasificError <- mean(test$predictb != test$Status)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.699268570588451"
cfnMat <- confusionMatrix(test$predictb, test$Status)
print(cfnMat)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   189   230
         1 24316 56886
                                          
               Accuracy : 0.6993          
                 95% CI : (0.6961, 0.7024)
    No Information Rate : 0.6998          
    P-Value [Acc > NIR] : 0.6245          
                                          
                  Kappa : 0.0051          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.007713        
            Specificity : 0.995973        
         Pos Pred Value : 0.451074        
         Neg Pred Value : 0.700549        
             Prevalence : 0.300229        
         Detection Rate : 0.002316        
   Detection Prevalence : 0.005133        
      Balanced Accuracy : 0.501843        
                                          
       'Positive' Class : 0               
                                          

Unfortunately the model has hardly any predictive power. We can see from the confusion matrix that when it predicts 0 it is 42% right all the times and when it predicts 1 (= Show-up) it is 70% correct.

The No Information Rate is equal to the mean of the status variable. So, if we would have build a model that would generate a 1 regardless of any input information it would have an accuracy of 70%, which is almost equal to the accuracy of the logistic model.

Apparently the process underlying the meeting of doctor’s appointments is not linearly logistic (at least not by this model). We should try a different approach like decision trees or neural networks.

First we’ll try a simple tree model to see whether that improves accuracy. We’ll use the rpart package for this. After that we will try and see how an basic neural network performs on the data.

library(rpart)
tree.model <- rpart(Status ~ . -spl, data = train, method = 'class', minsplit = 2, minbucket = 1)
summary(tree.model)
Call:
rpart(formula = Status ~ . - spl, data = train, method = "class", 
    minsplit = 2, minbucket = 1)
  n= 190449 

  CP nsplit rel error xerror xstd
1  0      0         1      0    0

Node number 1: 190449 observations
  predicted class=1  expected loss=0.3002326  P(node) =1
    class counts: 57179 133270
   probabilities: 0.300 0.700 
bestcp <- tree.model$cptable[which.min(tree.model$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree.model, cp = bestcp)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(tree.model)
Error in apply(model$frame$yval2[, yval2per], 1, function(x) x[1 + x[1]]) : 
  dim(X) must have a positive length
library(nnet)
train$Status <- as.factor(train$Status)
nn.model <- nnet(Status ~ . -spl, data=train, size=10, maxit=600)
# weights:  651
initial  value 144749.304097 
iter  10 value 115303.236946
iter  20 value 114832.634776
iter  30 value 114693.217041
iter  40 value 114286.327211
iter  50 value 113993.801649
iter  60 value 113662.052125
iter  70 value 113526.170687
iter  80 value 113464.557006
iter  90 value 113325.130883
iter 100 value 113172.711761
iter 110 value 112946.440811
iter 120 value 112842.026794
iter 130 value 112807.059693
iter 140 value 112781.404325
iter 150 value 112751.101211
iter 160 value 112732.559565
iter 170 value 112718.455782
iter 180 value 112705.196411
iter 190 value 112689.751047
iter 200 value 112672.035171
iter 210 value 112657.599566
iter 220 value 112639.365958
iter 230 value 112554.401492
iter 240 value 112524.463938
iter 250 value 112498.436166
iter 260 value 112479.519779
iter 270 value 112463.332356
iter 280 value 112449.571003
iter 290 value 112441.379271
iter 300 value 112435.902671
iter 310 value 112431.932632
iter 320 value 112429.265755
iter 330 value 112426.945935
iter 340 value 112426.219597
iter 350 value 112425.753640
iter 360 value 112425.362640
iter 370 value 112424.855230
iter 380 value 112422.017245
iter 390 value 112420.656325
iter 400 value 112420.240834
final  value 112420.186441 
converged
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
 
#plot the model
plot.nnet(nn.model)
Loading required package: scales
Loading required package: reshape

Attaching package: ‘reshape’

The following object is masked from ‘package:tidyr’:

    expand

The following object is masked from ‘package:dplyr’:

    rename

test$predict <- NULL
test$predictb <- NULL
test$Status <- as.factor(test$Status)
test$predict <- predict(nn.model, newdata=test, type='class')
misClasificError <- mean(test$predict != test$Status)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.702245745580182"
cfnMat <- confusionMatrix(test$predict, test$Status)
print(cfnMat)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   737   535
         1 23768 56581
                                          
               Accuracy : 0.7022          
                 95% CI : (0.6991, 0.7054)
    No Information Rate : 0.6998          
    P-Value [Acc > NIR] : 0.06185         
                                          
                  Kappa : 0.0284          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.03008         
            Specificity : 0.99063         
         Pos Pred Value : 0.57940         
         Neg Pred Value : 0.70419         
             Prevalence : 0.30023         
         Detection Rate : 0.00903         
   Detection Prevalence : 0.01558         
      Balanced Accuracy : 0.51035         
                                          
       'Positive' Class : 0               
                                          
---
title: "Logistic regression"
output: html_notebook
---

In this tutorial I will explain how you can build a logistic regression model and test it. First we'll load a data set. It comes from [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments). The data set is compressed and we are going to read directly from the zip file (otherwise it would take too long: 4 versus 24MB).

```{r}
temp <- tempfile()
download.file("https://github.com/minorsmart/FEB2017/blob/master/Witek/doctor/No-show-Issue-Comma-300k.csv.zip?raw=true",temp)
data <- read.csv(unz(temp, "No-show-Issue-Comma-300k.csv"))
unlink(temp)
```

Now we have the data set in a data frame we can start preparing it for the model. We'll first load some usefull packages and then look at the structure and quality of the data.

```{r, echo=TRUE}
library(dplyr)
library(tidyr)
library(magrittr)
library(caTools)
library(biglm)
library(caret)

str(data)
summary(data)
```

The data set comprises 300,000 recorded doctor's appointments with 15 variables. We have a lot of categorical variables of which many are boolean (0 or 1). We want to predict whether a patient will show up for a doctor's appointment (variable `Status`).

We see some peculiar ages like 113 and -2 years. We'll filter out all ages below 5 and above 95.

Next we will change the `Status` variable into a **0**/**1** boolean. This will help us interpret the predicted value at the end when doing a logistic regression.

Finally we will change all boolean variables into factors and do a visual inspection of the data.

```{r}
data <- data[,-3]
data <- separate(data, ApointmentData, c("Date", "Time"), sep = "T")

data <- separate(data, Date, c("Year", "Month", "Day"), sep = "-")
data$Year <- as.factor(data$Year)
data <- data[,-6]
data <- filter(data, Age > 5, Age < 95)
data$Status <- (data$Status == "Show-Up")*1
cols <- names(data[4:15])
data %<>% mutate_each_(funs(factor(.)),cols)
str(data)
summary(data)
barplot(sum(is.na(data)*1)/dim(data)[1])

selCols <- c("Age", "Gender", "Year", "Month", "DayOfTheWeek", "Diabetes", "Alcoolism", "HiperTension", "Handcap", "Smokes", "Scholarship", "Tuberculosis", "Sms_Reminder", "AwaitingTime")

for(i in selCols) {
  
  counts <- table(data$Status, data[, i])
  barplot(counts, col=c("red","darkblue"), xlab = i, legend = rownames(counts))
  
}
            
```

The cleaning deleted around 28,000 records. We will now split the data frame into two sets. One for training the model and one for testing it. When we split we need to take care that each set has enough variability in the predicted variable (`Status`). Otherwise either it can not learn properly or it can not do a meaningful test (imagine all values of `Status` in one of the subsets were **1** or **0**). This is where the function `sample.split()` from the `caret` package comes in.

```{r}
data$spl <- sample.split(data$Status, SplitRatio=0.70)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
```

Now we can train a model using a logistic regression and have a look at the parameters.

```{r}
reg.model <- glm (Status ~ . -spl, data = train, family = binomial)
summary(reg.model)
```
We can test how well the model performs by comparing its predictions with actual values. The model returns a likelihood value. We set the cutoff rate at a = `r a`.

```{r}
a <- 0.5
test$predict <- predict(reg.model, newdata=test, type='response')
test$predictb <- ifelse(test$predict > a,1,0)
misClasificError <- mean(test$predictb != test$Status)
print(paste('Accuracy',1-misClasificError))

cfnMat <- confusionMatrix(test$predictb, test$Status)
print(cfnMat)
```
Unfortunately the model has hardly any predictive power. We can see from the confusion matrix that when it predicts 0 it is `r paste0(round(100*cfnMat$table[1,2] / (cfnMat$table[1,1] + cfnMat$table[1,2])), "%")` right all the times and when it predicts 1 (= Show-up) it is `r paste0(round(100*cfnMat$table[2,2] / (cfnMat$table[2,1] + cfnMat$table[2,2])), "%")` correct.

The No Information Rate is equal to the mean of the status variable. So, if we would have build a model that would generate a **1** regardless of any input information it would have an accuracy of `r paste0(round(100*(mean(as.numeric(data$Status))-1)), "%")`, which is almost equal to the accuracy of the logistic model.

Apparently the process underlying the meeting of doctor's appointments is not linearly logistic (at least not by this model). We should try a different approach like decision trees or neural networks.

First we'll try a simple tree model to see whether that improves accuracy. We'll use the `rpart` package for this. After that we will try and see how an basic neural network performs on the data.

```{r}
library(rpart)
tree.model <- rpart(Status ~ . -spl, data = train, method = 'class', minsplit = 2, minbucket = 1)
summary(tree.model)

bestcp <- tree.model$cptable[which.min(tree.model$cptable[,"xerror"]),"CP"]

tree.pruned <- prune(tree.model, cp = bestcp)

library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(tree.model)

test$predict <- NULL
test$predictb <- NULL
test$Status <- as.factor(test$Status)
test$predict <- predict(tree.model, newdata=test, type='class')

misClasificError <- mean(test$predict != test$Status)
print(paste('Accuracy',1-misClasificError))

cfnMat <- confusionMatrix(test$predict, test$Status)
print(cfnMat)
```


```{r}
library(nnet)
train$Status <- as.factor(train$Status)
nn.model <- nnet(Status ~ . -spl, data=train, size=10, maxit=600)

library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
 
#plot the model
plot.nnet(nn.model)
```

```{r}
test$predict <- NULL
test$predictb <- NULL
test$Status <- as.factor(test$Status)
test$predict <- predict(nn.model, newdata=test, type='class')
misClasificError <- mean(test$predict != test$Status)
print(paste('Accuracy',1-misClasificError))

cfnMat <- confusionMatrix(test$predict, test$Status)
print(cfnMat)
```

