Regression coefficients have a useful interpretation with a dependent variable - they show the increase or decrease in the predicted probability of having a characteristic or experiencing an event due to a one-unit change in the independent variables.
Despite the simple interpretation of the coefficients for linear regression with a dependent variable, the regression estimates face two sorts of problems. One type of problem is conceptual in nature, while the other type is statistical in nature.
The conceptual problem with linear regression with a binary dependent variable stems from the fact that probabilities have maximum and minimum values of 1 and 0. By definition, probabilities cannot exceed 1 or fall below 0. Yet, the linear regression line will continue to extend upward as the values of the independent variables increase, and continue to extend downward as the values of the independent variables decrease. Depending on the slope of the line and the observed X values, a model can give predicted values of the dependent variable above 1 and below 0. Such values make no sense and have little predictive use.
One solution to the boundary problem would assume that any value equal to or above 1 should be truncated to the maximum value of 1. The regression line would be straight until this maximum value, but afterward changes in the independent variables **would have no influence on the dependent variable. The same would hold for small values, which could be truncated at 0. Such a pattern would define sudden discontinuities in the relationship, whereby at certain points the effect of X on Y would change immediately to 0
However, another functional form of the relationship might make more theoretical sense than truncated linearity. With a floor and a ceiling, it seems likely that the effect of a unit change in the independent variable on the predicted probability would be smaller near the floor or ceiling than near the middle. Toward the middle of a relationship, the nonlinear curve may approximate linearity, but rather than continuing upward or downward at the same rate, the nonlinear curve would bend slowly and smoothly so as to approach 0 and 1. As values get closer and closer to 0 or 1, the relationship requires a larger and larger change in the independent variable to have the same impact as a smaller change in the independent variable at the middle of the curve. To produce a change in the probability of the outcome from .95 to .96 requires a larger change in the independent variable than it does to produce a change in the probability from .45 to .46. The general principle is that the same additional input has less impact on the outcome near the ceiling or floor, and that increasingly larger inputs are needed to have the same impact on the outcome near the ceiling or floor.
A more appropriate nonlinear relationship would look like the Figure below, where the curve levels off and approaches the ceiling of 1 and the floor of 0. Conceptually, the S-shaped curve makes better sense than the straight line.
Several examples illustrate the nonlinear relationship. If income increases the likelihood of owning a home, an increase of 10 thousand dollars of income from $70,000 to $80,000 would increase that likelihood more than an increase from $500,000 to $510,000. High-income persons would no doubt already have a high probability of home ownership, and a $10,000 increase would do little to increase their already high probability. The same would hold for an increase in income from $0 to $10,000: since neither income is likely to be sufficient to purchase a house, the increase in income would have little impact on ownership. In the middle-range, however, the additional $10,000 may make the difference between being able to afford a house and not being able to afford a house.
Within the range of a sample, the linear regression line may approximate a curvilinear relationship by taking the average of the diverse slopes implied by the curve. However, the linear relationship still understates the actual relationships in the middle and overstates the relationship at the extremes.
Linear regression with a binary dependent variable violates the assumptions of normality. If the residuals (the difference between the observed values and the values predicted by the model) are normally distributed, it means that most of the time, our predictions are pretty close to the observed value. This is like saying that on average, our line is doing a good job of estimating.
However, when the dependent variable is binary (i.e., it can take only two values, often coded as 0 and 1), these assumptions are violated. To address this issues when dealing with a binary dependent variable, researchers often turn to specialized models such as logistic regression.
Although many nonlinear functions can represent the S-shaped curve, the logistic or logit transformation has become popular because of its desirable properties and relative simplicity.The logit function defines a relationship between the values of X and the S-shaped curve in probabilities. As will become clear, the probabilities need to be transformed in a way that defines a linear rather than nonlinear relationship with X. The logit transformation does this. Given the probability, the logit transformation involves.
ln(p/1-p) # Odds ratio
Natural log of the ratio of the probability of an event happening to the probability of that event not happening.