A regular logistic regression models abinary response variable,where the predictor variables could be continuous or ordinal(rank). It is a widely used classification technique.

The approach here, is to predict the outcome as a probability value that ranges between 0 and 1. You could just model the outcome as a linear function of all the predictors, but in that case, the value of *y* will be unconstrained. Since it is desirable to have the outcome (*y*) vary between 0 and 1 for classification purpose (values closer to zero can be classified as negative and those closer to 1 as positive), a link function for y that varies between 0 and 1 is used to build a generalised linear model. This model can then be used to estimate the probability of the outcome, and subsequently for classification based on the estimated probability.

Some examples of binary variables that could be modeled using logistic regression could be as follows:

- Candidates passing or failing in a competitive examination.
- The probability of a candidate winning or losing an election.
- Probability that a yet-to-be released movie is a box-office hit.

### Logit Model

**Logistic regression (a.k.a Logit model)** Before feeding in predictor variables to the model, the ordinal variables (rank variables) must to be converted to factors as show below.

`inputData$rank <- factor(inputData$rank) # Convert rank variable to a factor. Our variable has 3 ranking levels.`

logitModel <- glm(Response ~ Pred1 + Pred2 + rankPred, data = inputData, family = “binomial”)

summary(logitModel) # View summary and model diagnostics

Obtain** confidence intervals** for coefficient estimates using:

`confint(logitModel)`

confint.default(logitModel)

The co-efficient values in below output should either be both positive or both negative, since the range cannot include the value zero for it to be considered ‘significant’.

# Sample Output 2.5 % 97.5 % (Intercept) -0.774822875 2.256118188 Pred1 -0.002867999 0.008273849 Pred2 -0.001400580 0.011949674 Pred3 0.380088737 1.622517536 Pred4 -0.614677730 0.926307310

### How To Interpret Model Summary Results ?

Call: glm(Response ~ Pred1 + Pred2 + rankPred, data = inputData, family = “binomial”) Deviance Residuals: Min 1Q Median 3Q Max -1.627 -0.896 -0.639 1.249 2.079 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.08128 2.23095 -3.80 0.00047 *** Pred1 0.01326 0.00209 2.19 0.03847 * Pred2 0.70214 0.34282 2.52 0.02539 * rankPred2 -0.86544 0.42649 -2.63 0.03283 * rankPred3 -2.34221 0.24532 -2.88 0.00030 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

- The Deviance residuals is a measure of the model fit. The lesser the range of residual values the better the fit.
- Coefficients part of the output shows the co-efficient estimates (beta), standard errors, Z-value and p-Value (the significance of each predictor). The number of stars at the far right indicates the significance strength of the respective response. The interpretation for coefficients is slightly different for a continuous predictor (Pred1 & Pred2) as against an ordinal predictor (rankPred). For example, every one unit increase in Pred2 means the log odds of the response increases 0.702 times. On the otherhand, for ordinal predictor rankPred, though it has multiple entries in the output (rankPred2 and rankPred3) it is actually one variable (rankPred). The rankPred2 and rankPred3 are derived as binary variables from the original since it has 3 levels. Had there been 4 levels in rankPred, then the summary output would contain an additional entry (namely rankPred4).

Further, a n-level categorical variable can be modeled as ‘n’ discrete binary variables and used directly as predictors in glm().

`model.matrix ( ~0 + factor.Var, inputData) # convert n-level categorical variable to 'n' binary variables`

### Calculating Concordance and Discordance

Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). If this occurs, the model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by ** Concordance** and

**as shown in the**

*Discordance***function below**

*CalculateConcordance()*

*.*In other words, concordance is the percentage of predicted scores (a.k.a probability of positive outcome) where the scores of actual positive’s are greater than the scores of actual negative’s (will be 100% for ‘perfect’ models). This is calculated by taking into account the scores of all possible pairs of +’ves and -‘ves into account.

At the time of writing this, there is no in-built function to calculate concordance and discordance. So lets study how to calculate with an example.

` # Prepare the data and fit logit model`

accept <- c (1, 0, 1, 0, 1, 1, 0, 0, 0,1, 0, 1, 0, 0, 1)

acad <- c (66, 60, 80, 60, 52, 60, 47, 90, 75, 35, 46, 75, 66, 54, 76)

sports <- c (2.6,4.6,4.5, 3.3, 3.13, 4, 1.9, 3.5, 1.2, 1.8, 1, 5.1, 3.3, 5.2, 4.9)

rank <- c (3, 3, 1, 4, 4, 2, 4, 4, 4, 3, 3, 3, 2, 2, 1)

inputData <- data.frame (accept, acad , sports, rank) # assemble the data frame

logitModel <- glm(accept ~ ., family="binomial", data = inputData )

`# Function to calculate concordance and discordance`

CalculateConcordance <- function (myMod){

fitted <- data.frame (cbind (myMod$y, myMod$fitted.values)) # actuals and fitted

colnames(fitted) <- c('response','score') # rename columns

ones <- fitted[fitted$response==1, ] # Subset ones

zeros <- fitted[fitted$response==0, ] # Subsetzeros

totalPairs <- nrow (ones) * nrow (zeros) # calculate total number of pairs to check

conc <- sum (c (vapply (ones$score, function(x) {((x > zeros$score))}, FUN.VALUE=logical(nrow(zeros)))))

disc <- totalPairs - conc

# Calc concordance, discordance and ties

concordance <- conc/totalPairs

discordance <- disc/totalPairs

tiesPercent <- (1-concordance-discordance)

return(list("Concordance"=concordance, "Discordance"=discordance,

"Tied"=tiesPercent, "Pairs"=totalPairs))

}

CalculateConcordance(logitModel) # call the fn

# Result $Concordance [1] 0.75 $Discordance [1] 0.25 $Tied [1] 0 $Pairs [1] 56

### How To Predict?

Predicting the outcome on the test data can be achieved with predict(). The testData is prepared before building the model normally with a ~20% of observartions as holdouts.

`modelPredictions <- predict(logitModel, test_data, type = “response”)`

modelPredictionStatus <- rep(“Positive”, n_rows) #n_rows is the number of rows in testData

modelPredictionStatus[modelPredictions > probabilityThreshold] = “Negative” #probabilityThreshold is the probability that determines the final outcome of response

### How Well Did Your Model Predict?

A matrix tabulation in the form of {Confusion matrix} helps summarise the prediction performance.

`table(modelPredictionStatus, actualStatus) # Confusion Matrix`

mean(modelPredictionStatus != actualStatus) # Misclassification Error %

### Example Application of Logit Regression

Predicting the odds of a candidate winning an election based on predictors such as campaign spend, popularity in social media, number of campaign tours, number of previous elections participated etc.

### Similar To Logistic Regression

1. When the response variable is nominal, Multinomial Logistic regression is the probable choice.

2. In case of ordered response variable, i.e., the presence of ranking/priorities, Ordinal Logistic Regression can be a likely approach.

3. Interval Regression, a variation of Ordinal Logistic Regression is used when the ordered category into which each observation falls is known but the exact value is unknown.