Logistic Regression: A detailed discussion

A regular logistic regression models a binary response variable, where the predictor variables could be continuous or ordinal(rank). It is a widely used classification technique.

The approach here, is to predict the outcome as a probability value that ranges between 0 and 1. You could just model the outcome as a linear function of all the predictors, but in that case, the value of y will be unconstrained. Since it is desirable to have the outcome (y) vary between 0 and 1 for classification purpose (values closer to zero can be classified as negative and those closer to 1 as positive), a link function for y that varies between 0 and 1 is used to build a generalised linear model. This model can then be used to estimate the probability of the outcome, and subsequently for classification based on the estimated probability.
Some examples of binary variables that could be modeled using logistic regression could be as follows:

  1. Candidates passing or failing in a competitive examination.
  2. The probability of a candidate winning or losing an election.
  3. Probability that a yet-to-be released movie is a box-office hit.

Logit Model

Logistic regression (a.k.a Logit model) Before feeding in predictor variables to the model, the ordinal variables (rank variables) must to be converted to factors as show below.

inputData$rank <- factor(inputData$rank) # Convert rank variable to a factor. Our variable has 3 ranking levels.
logitModel <- glm(Response ~ Pred1 + Pred2 + rankPred, data = inputData, family = “binomial”)
summary(logitModel) # View summary and model diagnostics

Obtain confidence intervals for coefficient estimates using:

confint(logitModel)
confint.default(logitModel)

The co-efficient values in below output should either be both positive or both negative, since the range cannot include the value zero for it  to be considered ‘significant’.

# Sample Output
                      2.5 %      97.5 %
(Intercept)    -0.774822875 2.256118188
Pred1          -0.002867999 0.008273849
Pred2          -0.001400580 0.011949674
Pred3           0.380088737 1.622517536
Pred4          -0.614677730 0.926307310

How To Interpret Model Summary Results ?

Call:
 glm(Response ~ Pred1 + Pred2 + rankPred, data = inputData, family = “binomial”)

 Deviance Residuals: 
    Min      1Q  Median      3Q     Max  
 -1.627  -0.896  -0.639   1.249   2.079  
 
 Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
 (Intercept)     -3.08128    2.23095   -3.80  0.00047 ***
 Pred1            0.01326    0.00209    2.19  0.03847 *  
 Pred2            0.70214    0.34282    2.52  0.02539 *  
 rankPred2       -0.86544    0.42649   -2.63  0.03283 *  
 rankPred3       -2.34221    0.24532   -2.88  0.00030 ***
 —
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
  1. The Deviance residuals is a measure of the model fit. The lesser the range of residual values the better the fit.
  2. Coefficients part of the output shows the co-efficient estimates (beta), standard errors, Z-value and p-Value (the significance of each predictor). The number of stars at the far right indicates the significance strength of the respective response. The interpretation for coefficients is slightly different for a continuous predictor (Pred1 & Pred2) as against an ordinal predictor (rankPred). For example, every one unit increase in Pred2 means the log odds of the response increases 0.702 times. On the otherhand, for ordinal predictor rankPred, though it has multiple entries in the output (rankPred2 and rankPred3) it is actually one variable (rankPred). The rankPred2 and rankPred3 are derived as binary variables from the original since it has 3 levels. Had there been 4 levels in rankPred, then the summary output would contain an additional entry (namely rankPred4).

Further, a n-level categorical variable can be modeled as ‘n’ discrete binary variables and used directly as predictors in glm().

model.matrix ( ~0 + factor.Var, inputData)  # convert n-level categorical variable to 'n' binary variables

Calculating Concordance and Discordance

Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). If this occurs, the model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance as shown in the CalculateConcordance() function below.

In other words, concordance is the percentage of predicted scores (a.k.a probability of positive outcome) where the scores of actual positive’s are greater than the scores of actual negative’s (will be 100% for ‘perfect’ models). This is calculated by taking into account the scores of all possible pairs of +’ves and -‘ves into account.

At the time of writing this, there is no in-built function to calculate concordance and discordance. So lets study how to calculate with an example.
# Prepare the data and fit logit model
accept <- c (1, 0, 1, 0, 1, 1, 0, 0, 0,1, 0, 1, 0, 0, 1)
acad   <- c (66, 60, 80, 60, 52, 60, 47, 90, 75, 35, 46, 75, 66, 54, 76)
sports <- c (2.6,4.6,4.5, 3.3, 3.13, 4, 1.9, 3.5, 1.2, 1.8, 1, 5.1, 3.3, 5.2, 4.9)
rank   <- c (3, 3, 1, 4, 4, 2, 4, 4, 4, 3, 3, 3, 2, 2, 1)
inputData  <- data.frame (accept, acad , sports, rank) # assemble the data frame
logitModel <- glm(accept ~ ., family="binomial", data = inputData )

# Function to calculate concordance and discordance
CalculateConcordance <- function (myMod){
fitted <- data.frame (cbind (myMod$y, myMod$fitted.values)) # actuals and fitted
colnames(fitted) <- c('response','score') # rename columns
ones <- fitted[fitted$response==1, ] # Subset ones
zeros <- fitted[fitted$response==0, ] # Subsetzeros
totalPairs <- nrow (ones) * nrow (zeros) # calculate total number of pairs to check
conc <- sum (c (vapply (ones$score, function(x) {((x > zeros$score))}, FUN.VALUE=logical(nrow(zeros)))))
disc <- totalPairs - conc
# Calc concordance, discordance and ties
concordance <- conc/totalPairs
discordance <- disc/totalPairs
tiesPercent <- (1-concordance-discordance)
return(list("Concordance"=concordance, "Discordance"=discordance,
"Tied"=tiesPercent, "Pairs"=totalPairs))
}
CalculateConcordance(logitModel) # call the fn

# Result
$Concordance
[1] 0.75
$Discordance
[1] 0.25
$Tied
[1] 0
$Pairs
[1] 56

How To Predict?

Predicting the outcome on the test data can be achieved with predict(). The testData is prepared before building the model normally with a ~20% of observartions as holdouts.
modelPredictions <- predict(logitModel, test_data, type = “response”)
modelPredictionStatus <- rep(“Positive”, n_rows) #n_rows is the number of rows in testData
modelPredictionStatus[modelPredictions > probabilityThreshold] = “Negative” #probabilityThreshold is the probability that determines the final outcome of response

How Well Did Your Model Predict?

A matrix tabulation in the form of {Confusion matrix} helps summarise the prediction performance.
table(modelPredictionStatus, actualStatus) # Confusion Matrix
mean(modelPredictionStatus != actualStatus) # Misclassification Error %

Example Application of Logit Regression

Predicting the odds of a candidate winning an election based on predictors such as campaign spend, popularity in social media, number of campaign tours, number of previous elections participated etc.

Similar To Logistic Regression

1. When the response variable is nominal, Multinomial Logistic regression is the probable choice.
2. In case of ordered response variable, i.e., the presence of ranking/priorities, Ordinal Logistic Regression can be a likely approach.
3. Interval Regression, a variation of Ordinal Logistic Regression is used when the ordered category into which each observation falls is known but the exact value is unknown.

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook