Logistic Regression

Review

Logistic Regression is used to study or model the association between a binary response variable (\(y\)) and a set of explanatory variables.

Simple Logistic Regression Model: \[\log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\beta_0 + \beta_1x \hskip.5in \Longrightarrow \hskip.2in \frac{\pi(x)}{1-\pi(x)}=e^{(\beta_0 + \beta_1x)}\] where, \(\pi(x)\) is the probability that \(y\) equals 1 (success) when the independent variable is \(x\).

Example 1: Predicting Heart Attacks based on Creatinine Kinase (CK)

rm(list=ls())   # This will clear the memory

data1=read.csv("data_example12_22.csv",header=T)
head(data1)

##   y ck
## 1 1 20
## 2 1 20
## 3 0 20
## 4 0 20
## 5 0 20
## 6 0 20

attach(data1)

result.ex1=glm(y~ck, family=binomial(link="logit"))
summary(result.ex1)$coef

##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -3.02835960 0.366971041 -8.252312 1.553589e-16
## ck           0.03510439 0.004080992  8.601927 7.838875e-18

Hence, the odds ratio of a heart attack increases by a factor of \(e^{.0351} \approx 1.036\). That is, as \(CK\) increases by 1 unit, the odds that the person had a heart attack increases by 3.6%.

confint(result.ex1,level=.90)

## Waiting for profiling to be done...

##                     5 %        95 %
## (Intercept) -3.66938129 -2.45759221
## ck           0.02881989  0.04227805

x=seq(20,500,by=40)
proportions=tapply(y,ck,mean)
plot(x,proportions)

curve(exp(-3.0284+0.035*x)/(1+exp(-3.0284+0.035*x)),20,500,lwd=2,col="blue",add=T)

points(x,exp(-3.0284+0.035*x)/(1+exp(-3.0284+0.035*x)),pch=20)

Suppose a person who came in had a \(CK = 140\).

predict(result.ex1,newdata=data.frame(ck=140))

##        1 
## 1.886255

This command gives you the value of \(\hat{\beta}_0+\hat{\beta}_1x\). Therefore, \(\hat{\pi}(140) = e^{1.886}/(1+e^{1.886}) \approx 0.8683\).

temp=which(ck==140)
fitted=result.ex1$fitted
fitted[temp[1]]

##      168 
## 0.868328

After we obtained the probability estimate that the patient had a heart attack (‘success’), we still have to decide whether that patient had a heart attack or not. Suppose we decide to label \(y=1\) when the estimated probability is higher than 0.7. Below are the r codes for computing the error rate using this cut-off value.

prediction=as.numeric(result.ex1$fitted>.7)
error=y-prediction
data.frame(y,prob=result.ex1$fitted,prediction,error)[1:3,]

##   y       prob prediction error
## 1 1 0.08897039          0     1
## 2 1 0.08897039          0     1
## 3 0 0.08897039          0     0

percent.error=sum(abs(error))/length(error)
percent.error       # 0.1472222

## [1] 0.1472222

false.pos=sum((y==0)*(prediction==1))   # 8
false.neg=sum((y==1)*(prediction==0))   # 45
true.pos=sum((y==1)*(prediction==1))    # 185
true.neg=sum((y==0)*(prediction==0))    # 122
data.frame(FalsePos=false.pos,FalseNeg=false.neg,TruePos=true.pos,TrueNeg=true.neg)

##   FalsePos FalseNeg TruePos TrueNeg
## 1        8       45     185     122

Find the optimal cut-off value that will minimize the error rate:

results=result.ex1; 
success=y

# Finding the best cut off value
cutoff=seq(0,1,by=.05)
m=length(cutoff)
False.pos=array(99,m)
False.neg=array(99,m)
Errors=array(99,m)
for(i in 1:m)
{
prediction=as.numeric(results$fitted>cutoff[i])
error=success-prediction
False.pos[i]=sum((success==0)*(prediction==1))
False.neg[i]=sum((success==1)*(prediction==0))
Errors[i]=False.pos[i]+False.neg[i]
}
Percent.error=Errors/length(error)
data.frame(CutOff=cutoff,FalsePositive=False.pos,FalseNegative=False.neg,ErrorRate=Percent.error)

##    CutOff FalsePositive FalseNegative  ErrorRate
## 1    0.00           130             0 0.36111111
## 2    0.05           130             0 0.36111111
## 3    0.10            42             2 0.12222222
## 4    0.15            42             2 0.12222222
## 5    0.20            42             2 0.12222222
## 6    0.25            42             2 0.12222222
## 7    0.30            16            15 0.08611111
## 8    0.35            16            15 0.08611111
## 9    0.40            16            15 0.08611111
## 10   0.45            16            15 0.08611111
## 11   0.50            16            15 0.08611111
## 12   0.55            16            15 0.08611111
## 13   0.60            16            15 0.08611111
## 14   0.65             8            45 0.14722222
## 15   0.70             8            45 0.14722222
## 16   0.75             8            45 0.14722222
## 17   0.80             8            45 0.14722222
## 18   0.85             8            45 0.14722222
## 19   0.90             3            75 0.21666667
## 20   0.95             3            75 0.21666667
## 21   1.00             0           230 0.63888889

Based on the above table, a good cut-off value is 0.45. This would yield 16 false positives and 15 false negatives and an error rate of about 8.6%.

You can also figure out the best cut-off value using the following commands:

cutoff[which(Percent.error==min(Percent.error))]; min(Percent.error)

## [1] 0.30 0.35 0.40 0.45 0.50 0.55 0.60

## [1] 0.08611111

Goodness-of-fit of the Logistic Model

The usual QQ-plot of the residuals is not very useful in assessing the fit of logistic.

residuals=result.ex1$residuals
qqnorm(residuals)

We use Hosmer-Lemeshow test to check for the Goodness of fit of the model.

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 3.5.2

## ResourceSelection 0.3-5   2019-07-22

hoslem.test(y,fitted)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  y, fitted
## X-squared = 115.99, df = 8, p-value < 2.2e-16

When the covariate is also binary

Example 2 (Example 7.1): Coronary Heart Disease in the Framingham Study

Cornfield ( 1962) presents preliminary results from the Framingham study of the association of cholesterol levels and blood pressure with the risk of developing coronary heart disease (CHD) in the population of the town of Framingham, Massachusetts over a six-year period. Cholesterol levels and blood pressure were measured at the beginning of the study and the cohort was then followed over time. these predictor variables have been categorized as high versus low values of serum cholesterol (\(X_1 : hichol = 1\) if \(>\) 200 mg/dL, 0 if \(<\) 200 mg/dL) and as high versus low values of systolic blood pressure (\(X_2: hisbp = 1\) if \(>\) 147 mmHg, 0 if \(<\) 147 mmHg). From the sample oí N = 1329 subjects, the data in each category of the covariates are presented in Table below.

Category (\(j\))	CHD Count (\(a_j\))	Size (\(n_j\))	\(X_1\)	\(X_2\)
1	10	431	0	0
2	10	142	0	1
3	38	532	1	0
4	34	224	1	1

type1a=c(1,1,0,0); type1b=c(0,1,0,0)
type2a=c(1,1,0,1); type2b=c(0,1,0,1)
type3a=c(1,1,1,0); type3b=c(0,1,1,0)
type4a=c(1,1,1,1); type4b=c(0,1,1,1)

A=c(rep(type1a,10),rep(type2a,10),rep(type3a,38),rep(type4a,34))
B=c(rep(type1b,431-10),rep(type2b,142-10),rep(type3b,532-38),rep(type4b,224-34))
data=matrix(c(A,B),ncol=4,byrow=T)

y=data[,1]
x0=data[,2]
x1=data[,3]
x2=data[,4]

table(data.frame(Response=y,Covariate=x1))

##         Covariate
## Response   0   1
##        0 553 684
##        1  20  72

source('~/Dropbox/Teaching/STAT766/my_rrci.R', encoding = 'UTF-8')
my.rrci(72,20,684,553,0.05)[,c(1,4)]

##         Value     OR
## 1    Estimate 2.9105
## 2 Lower Limit 1.7514
## 3 Upper Limit 4.8368

output=glm(y~x1, family=binomial(link="logit"))
summary(output)$coef

##              Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -3.319626  0.2276143 -14.584433 3.528436e-48
## x1           1.068334  0.2591508   4.122441 3.748793e-05

Hence, the effect of high versus low cholesterol is \(\beta_1= 1.068\) with an estimated odds ratio of \(\widehat{OR}_{HvL} = e^{1.068} = 2.91\). This indicates that high cholesterol is associated with a 191% increase [\((2.91 - 1 ) \times 100\)] in the odds of CHD.

confint(output)

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept) -3.7991625 -2.902151
## x1           0.5798444  1.601100

exp(c(0.58,1.60))

## [1] 1.786038 4.953032

Using both covariates

output2=glm(y~x1+x2, family=binomial(link="logit"))
summary(output2)$coef

##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -3.6282166  0.2475714 -14.655236 1.247232e-48
## x1           1.0301350  0.2605353   3.953916 7.688225e-05
## x2           0.9168417  0.2203159   4.161487 3.161817e-05

The effect of high versus low cholesterol, adjusting for blood pressure, is \(\beta_1 = 1.0301\) with an estimated odds ratio of \(\widehat{OR}_{HvL} = e^{1.0301} = 2.801\). This indicates that high cholesterol is associated with a 180% increase in the odds of CHD (adjusted for the effects of blood pressure).

The inverse provides the decrease in odds associated with low versus high cholesterol, or \(\widehat{OR}_{LvH} = e^{-1.0301} = 0.357\), which indicates a 64.3% decrease in the odds of CHD [\((1 - 0.357) \times 100\)].

Similarly, \(\beta_2 = 0.9168\) is the effect of high versus low systolic blood pressure adjusted for serum cholesterol, which yields an odds ratio of \(\widehat{OR}_{HvL} = 2.501\), indicating a 150% increase in the odds among those with high SBP.

Note that both p-values (7.688225e-05 and 3.161817e-05) from the Wald Z test are extremely small. Therefore, both of these covariates have statistically significant effect on the response variable.

The 95% confidence interval can be obtain as before using the following command:

confint(output2)

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept) -4.1452660 -3.170884
## x1           0.5386424  1.565314
## x2           0.4827586  1.348845

Stratified \(2\times 2\) Table

Example 3: Clinical Trial in Duodenal Ulcers (Data from Example 4.1)

In this example we want to assess the effectiveness of a new drug for the treatment of duodenal ulcers, where the drug is expected to promote healing by retarding the excretion of gastric juices that lead to ulcération of the duodenum.

Stratum 1: Acid-Dependent

Response	Drug (\(D\))	Placebo (\(P\))	Total
Positive	16	20	36
Negative	26	27	53
Total	42	47	89

Stratum 2: Drug-Dependent

Response	Drug (\(D\))	Placebo (\(P\))	Total
Positive	9	4	13
Negative	3	5	8
Total	12	9	21

Stratum 3: Intermediate

Response	Drug (\(D\))	Placebo (\(P\))	Total
Positive	28	16	44
Negative	18	28	46
Total	46	44	90

positive=c(rep(1,36),rep(0,53),rep(1,13),rep(0,8),rep(1,44),rep(0,46))
drug=c(rep(1,16),rep(0,20),rep(1,26),rep(0,27),rep(1,9),rep(0,4),rep(1,3),
       rep(0,5),rep(1,28),rep(0,16),rep(1,18),rep(0,28))
strata=c(rep(1,89),rep(2,21),rep(3,90))
strata=factor(strata)

output4.1=glm(positive~drug+strata, family=binomial(link="logit"))
summary(output4.1)$coef

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) -0.6297356  0.2603187 -2.419095 0.01555917
## drug         0.5023434  0.2884458  1.741552 0.08158682
## strata2      0.8352668  0.5024711  1.662318 0.09644903
## strata3      0.3277676  0.3042355  1.077348 0.28132477

Based on the p-values of the Wald Z tests, neither stratum 2 nor 3 has an effect on the overall risk significantly different from that of the first stratum, which is consistent with the results we got from chapter 4. The same can be said about the drug (p-value = 0.08158682).

The estimated odds ratio is \(e^{0.5023} \approx 1.65\), which is very similar to what we obtained using the Mantel-Haenszel procedure. This implies that the odds of a positive result increases by 65% when the drug is used. However, since the p-value is about 0.08, this increase could be a result of random chance.

Data <- as.table(array(c(16,26,20,27,9,3,4,5,28,18,16,28),
                   dim = c(2, 2, 3),
                   dimnames =
                   list(Response = c("+", "-"), Treatment=c("D", "P"),
                        Ulcer = c("Acid", "Drug", "Intermediate"))))
mantelhaen.test(Data, correct=F)

## 
##  Mantel-Haenszel chi-squared test without continuity correction
## 
## data:  Data
## Mantel-Haenszel X-squared = 3.0045, df = 1, p-value = 0.08303
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.934329 2.857044
## sample estimates:
## common odds ratio 
##          1.633836

Based on parameter estimates in output4.1, we have the following logistic models:

For stratum 1: \(\log \left(\frac{\pi}{1-\pi}\right) = -0.63 + 0.50(drug)\)

For stratum 2: \(\log \left(\frac{\pi}{1-\pi}\right) = (-0.63+0.84) + 0.50(drug)\)

For stratum 3: \(\log \left(\frac{\pi}{1-\pi}\right) = (-0.63+33) + 0.50(drug)\)

Logistic Model Building

To illustrate this process, we are going to consider the data set ‘Mroz’ from the R package ‘car’

# Install the R package 'car'
library(car)

## Warning: package 'car' was built under R version 3.5.2

## Loading required package: carData

data(Mroz)
some(Mroz)      # Will print some examples of data

##     lfp k5 k618 age  wc  hc        lwg    inc
## 46  yes  0    2  34  no  no  1.2447947  5.000
## 144 yes  0    1  51  no  no  0.8943118 13.732
## 186 yes  1    5  39  no yes  0.9681486 33.970
## 243 yes  0    1  32  no yes  1.6267294 17.400
## 264 yes  0    1  47 yes yes  1.4105872 41.400
## 358 yes  0    1  32  no yes -0.7419373 17.150
## 413 yes  0    1  31  no  no  1.1516048 14.600
## 514  no  0    0  53  no  no  0.2062952 10.700
## 551  no  0    1  50  no  no  0.9058391 29.400
## 696  no  1    3  33  no yes  0.7931218 15.000

help(Mroz)      # Will show the description of the data set.

There are 7 possible predictors (k5, k618, age, wc, hc, lwg, and inc) for the response variable ‘lfp’. One easy way to check which predictors have significant effect on the response variable is to simply run the model with all of then in it.

mroz.mod.all=glm(lfp~.,family=binomial(link=logit),data=Mroz)
summary(mroz.mod.all)$coef

##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  3.18214046 0.644375092  4.9383356 7.879216e-07
## k5          -1.46291304 0.197000605 -7.4259317 1.119888e-13
## k618        -0.06457068 0.068000828 -0.9495573 3.423372e-01
## age         -0.06287055 0.012783090 -4.9182591 8.731727e-07
## wcyes        0.80727378 0.229979884  3.5101930 4.477816e-04
## hcyes        0.11173357 0.206039719  0.5422914 5.876178e-01
## lwg          0.60469312 0.150817565  4.0094343 6.086437e-05
## inc         -0.03444643 0.008208376 -4.1964976 2.710743e-05

Notice that the variabe ‘hc’ is not significant as the p-value from the Wald-Z test is 0.5876. So we run the model again without this variable.

mroz.mod2=glm(lfp~.-hc,family=binomial(link=logit),data=Mroz)
summary(mroz.mod2)$coef

##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  3.21663627 0.640902118  5.0189197 5.196286e-07
## k5          -1.45610304 0.196144713 -7.4236161 1.139652e-13
## k618        -0.06426895 0.067965445 -0.9456122 3.443464e-01
## age         -0.06363571 0.012701237 -5.0101977 5.437415e-07
## wcyes        0.86225483 0.206726686  4.1709895 3.032799e-05
## lwg          0.60453743 0.150619081  4.0136842 5.977830e-05
## inc         -0.03317664 0.007832993 -4.2354995 2.280442e-05

Now we see that k618 is not significant (p-value=0.344), so we remove it as well.

mroz.mod.final=glm(lfp~.-hc-k618,family=binomial(link=logit),data=Mroz)
summary(mroz.mod.final)$coef

##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)  2.90192560 0.542897742  5.345253 9.029102e-08
## k5          -1.43180362 0.193195960 -7.411147 1.252119e-13
## age         -0.05853213 0.011415654 -5.127357 2.938381e-07
## wcyes        0.87237347 0.206389444  4.226832 2.370047e-05
## lwg          0.61568412 0.150142747  4.100658 4.119762e-05
## inc         -0.03367514 0.007800262 -4.317181 1.580346e-05

All the remaining predictors are now significant. So our final logistic model is:

\[\log \left(\frac{\pi}{1-\pi}\right) = 2.902 - 1.432(k5) - 0.059(age) + 0.872(wc) + 0.616(lwg) - 0.034(inc)\]

Based on the above model, when a woman ages by 1 year (and all other predictors remain the same), the odds of her being in the labor-force is multiplied by \(e^{-0.059} \approx 0.94\). This implies that as a woman ages by a year, the odds of her working actually decreases by 6%.

Similarly, a college educated woman with all other predictors the same has odds of working about \(e^{0.8724} \approx 2.39\) times higher than someone who didn’t go to college.

If a woman is 27 years old, has 2 kids younger than 6 years old, went to college, and with \(lwg = 1.2\) and household income of 21 thousand, then the likelihood that this woman is working is about 0.34 (see calculation below).

log.odds=2.902-1.432*2-0.059*27+0.872*1+1.2*0.616-0.034*21
exp(log.odds)/(1+exp(log.odds))

## [1] 0.341234

You can obtain the odds and confidence interval for the odds as follows:

round(exp(cbind(Estimate=coef(mroz.mod.final),confint(mroz.mod.final))),2)

## Waiting for profiling to be done...

##             Estimate 2.5 % 97.5 %
## (Intercept)    18.21  6.36  53.59
## k5              0.24  0.16   0.35
## age             0.94  0.92   0.96
## wcyes           2.39  1.60   3.61
## lwg             1.85  1.39   2.50
## inc             0.97  0.95   0.98

Finally, if you want to remove multiple variable at once, say you want to check if both variables relating to kids (k5 and k618) can be removed from the logistic model, you may use the following chi-square test:

mroz.mod.nokids=glm(lfp~.-k5-k618,family=binomial(link=logit),data=Mroz)
anova(mroz.mod.nokids,mroz.mod.all,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: lfp ~ (k5 + k618 + age + wc + hc + lwg + inc) - k5 - k618
## Model 2: lfp ~ k5 + k618 + age + wc + hc + lwg + inc
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       747     971.75                          
## 2       745     905.27  2   66.485 3.655e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The degrees of freedom for the Chi-square test is 2 (because we removed 2 predictors from the saturated model) and the resulting p-value is 3.655e-15. Therefore, these two variables have a significant effect on the odds that a woman would be working.