In the second week of class, we will look at linear regression in more depth, and examine the challenge of over-fitting.
3.1 Linear Regression in depth
Last week, we were exposed to the Linear Regression model. Today we will take it apart and look at it carefully to see how it works and what are the needed assumptions to use this model.
3.1.1 One predictor
Suppose we just use one predictor, \(Age\) to predict our outcome \(MeanBloodPressure\).
\[
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
\]
Our model would look like the following like the red line from our Training data:
This model is formed by making the the line of best fit determined by the minimum of the sum of squared difference between the observed response (in blue points) and predicted response (in red line). Using the termology from last week, we say that we fit a line where the Mean Squared Error (MSE) is minimized.
Left panel: difference between response and predicted response (residual). Right panel: squared difference between predicted response and response.
3.2 Assumptions of linear regression
Any model that one uses has some assumptions about the data that allows the model to make good predictions. Note that there are other types of assumptions if your modeling technique is focused on inference.
Let’s take a look what are the assumptions needed for a sound model, and what we can do to address it not upheld.
3.2.1 Linearity of responder-predictor relationship
The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn’t ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.
To check for this relationship, we have to calculate the residual, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a residual plot of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
If there’s a trend in the data, that means there are non-linear associations between some of the predictors and the response.
We see there’s a slight curve in our residual plot. We will look at ways to deal with this later in this lecture.
3.2.2 No Outliers
An outlier is an obseravtion for which the response is far from the value predicted response (y-axis). An observation has high leverage if it has an unusual predictor value (x-axis). Outliers and high leverage observations arise may arise out of incorrect measurements, among many other causes. When these observations cause significant changes to the regression model, they are called influential. They may greatly contribute to the Mean Squared Error, as observations away the majority of the data will have exponentially large residuals.
Some possible solutions:
We can eyeball for for potential influential points by exploratory data analysis, and see how the model changes if we remove it. We may troubleshoot with the instruments that generated the data in the first place to diagnosis.
We can detect an influential point via computing the studentized residuals or cook’s distance and decide whether it makes sense to remove it.
We can use a different linear regression method, called Huber loss regression, that allows more tolerance for outliers.
3.2.3 Predictors are not colinear
Colinearity is the situation when two or more predictors are linearly related to each other. If we put collinear predictors into our regression model, they start to serve as redundant information to our model and can degrade predictive performance.
Some possible solutions:
We can detect collinearity to look at the correlation matrix between predictors. This works well for pairwise correlations.
When there is a collinear relationship between three or more predictors, pairwise methods will fail. We may consider the Variance Inflation Factor to detect them, but doesn’t necessarily recommend which variables to remove.
Suppose that we are consider the predictors of our training set:
3.2.4 Number of predictors is less than the number of samples
Sometimes, in machine learning, we have more predictors than the number of samples. This is called a high dimensional problem. Our regression method will not work here and we need to find ways to reduce the number of predictors.
We recommend evaluate the assumptions of your linear regression model on the training set before evaluating it on the test set.
3.3 Extesnsions of the Linear Model
3.3.1 Polynomial Regression
Let’s go back to revisit the non-linearity of responder-predictor relationship. To deal with the slight curve in the residual plot, we can extend our model to accommodate non-linear relationships via Polynomial Regression.
Here is what polynomial regression is capable, visually:
\[
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
\] We can include transformed versions of the predictors to have other shapes, such as a quadratic shape:
This is still a linear model – we have added a new predictor that gives us a quadratic shape. We use the poly() function to generate our polynomial predictor.
Okay, looks better! Less of a curved trend in the residual plot.
In general, it is rather unusual to see the polynomial term grow beyond 4, as they become more overly flexible and take on more strange shapes. We will take a look at the consequence of this in a moment.
3.3.2 Interactions
Suppose we think that \(BMI\) and \(Gender\) may be good predictors of \(MeanBloodPressure\):
Let’s explore the relationship between \(MeanBloodPressure\) and \(BMI\) separately for values of \(Gender\).
According to our model, the relationship between \(BMI\) and \(MeanBloodPressure\) is a linear line with slope \(\beta_1\), and the additional predictor of \(Gender\) will change our prediction by only a constant, \(\beta_2\). Visually, that would look like two parallel lines, with \(\beta_2\) dictating the distance between parallel lines. However, this plot suggests that our original model isn’t quite right: the additional predictor of \(Gender\) changes our prediction by more than a constant - it is dependent on \(BMI\) also.
When multiple predictors have an synergistic effect on the outcome, their effect on the outcome occurs jointly - this is called an Interaction. To incorporate this into our model, we add an interaction term:
Which creates unique, non-parallel lines depending on the value of \(Gender\).
3.4 Overfitting
Last week, we discussed as the model learns from any data, it will learn to recognize its patterns, and sometimes it will recognize patterns that are only specific to this data and not reproducible anywhere else. This is called Overfitting, and why we constructed the training and testing datasets to identify this phenomena.
Let’s look about overfitting in more detail: there can be different magnitudes of overfitting. The more flexible models we employ, the higher risk there will be overfitting, because these models will identify patterns too specific to the training data and not generalize to the test data. For instance, linear regression is a fairly inflexible approach, because it just uses a straight line to model the data. However, if we use polynomial regression, especially for higher degree polynomials, the model becomes more flexible, with a higher risk of overfitting.
Let’s take a look at our first model again:
\[
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
\]
#Use a small part of the data to illlustrate overfitting.nhanes_tiny = nhanes.sample(n=300, random_state=2)y, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)linear_model = sm.OLS(y_train, X_train).fit()plt.clf()fig, (ax1, ax2) = plt.subplots(2, layout='constrained')ax1.plot(X_train.BMI, linear_model.predict(X_train), label="fitted line")ax1.scatter(X_train.BMI, y_train, alpha=.5, color="brown", label="Training set")ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))train_err = np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) **2)ax1.set_title('Training Error: '+str(round(train_err, 2)))ax2.plot(X_test.BMI, linear_model.predict(X_test), label="fitted line")ax2.scatter(X_test.BMI, y_test, alpha=.5, color="brown", label="Testing set")ax2.set(xlabel='BMI', ylabel='Mean Blood Pressure')ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))ax2.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))test_err = np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) **2)ax2.set_title('Testing Error: '+str(round(test_err, 2)))plt.show()
<Figure size 672x480 with 0 Axes>
We see that Training Error < Testing Error.
Let’s look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:
train_err = []test_err = []polynomials =list(range(1, 10))for p_degree in polynomials:if p_degree ==1: y, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)else: y, X = model_matrix("MeanBloodPressure ~ poly(BMI, degree="+str(p_degree) +")", nhanes_tiny) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) model = sm.OLS(y_train, X_train) linear_model = model.fit() train_err.append(np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) **2)) test_err.append(np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) **2))plt.clf()plt.plot(polynomials, train_err, color="blue", label="Training Error")plt.plot(polynomials, test_err, color="red", label="Testing Error")plt.xlabel('Polynomial Degree')plt.ylabel('Error')plt.legend()plt.show()
As our Polynomial Degree increased, the following happened:
In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This makes sense, as the model does not generalize as well to the testing set.
As the degrees increased, both training and testing error decreased slightly.
After degree 4, we see that the Training Error continued to decrease, but the Testing Error blew up! This is an example of Overfitting, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set at all.
We want to find a model that is “just right” that doesn’t underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. It seems that our ideal prediction model is around a polynomial of degree 4, with the minimal Testing Error.
3.4.1 Another example
Here is another illustration of the phenomena, using synthetic controlled data:
On the left shows the Training Data in black dots. Then, three models are displayed: linear regression (orange line), two other models of increasing complexity in blue and green.
On the right shows the training error (grey curve), testing error (red curve), and where each of the three models on the left land in the error rate with their respective colors.
Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor.
Hopefully you start to see the importance of examining the Testing Error instead of the Training Error to evaluate our model. A highly flexible data will overfit the model and make it seem like the Training Error is small, but it will not generalize to the Testing data, which will have Testing Error.
3.5 Appendix: Inference
For this course, we focus on prediction from our machine learning models. These models have an equally important usage in statistical inference. This appendix gives a quick overview of what that is about.
3.5.1 Population and Sample
The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:
Population: The entire collection of individual units that a researcher is interested to study. For NHANES, this could be the entire US population.
Sample: A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.
Now let’s examine the function \(f()\)’s trained values, which are called parameters.
For the Linear Model, the model we first fitted was of the following form:
\[
BloodPressure=\beta_0 + \beta_1 \cdot BMI
\]
which is an equation of a line.
\(\beta_0\) is a parameter describing the intercept of the line, and \(\beta_1\) is a parameter describing the slope of the line.
Suppose that from fitting the model on the Training Set, \(\beta_1=2\). That means increasing \(BMI\) by 1 will lead to an increase of \(BloodPressure\) by 2. This measures the strength of association between a variable and the outcome.
Let’s see this in practice:
import statsmodels.api as smy, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)linear_model = sm.OLS(y_train, X_train).fit()linear_model.summary()
OLS Regression Results
Dep. Variable:
MeanBloodPressure
R-squared:
0.078
Model:
OLS
Adj. R-squared:
0.071
Method:
Least Squares
F-statistic:
10.70
Date:
Wed, 18 Feb 2026
Prob (F-statistic):
0.00138
Time:
22:38:51
Log-Likelihood:
-502.67
No. Observations:
128
AIC:
1009.
Df Residuals:
126
BIC:
1015.
Df Model:
1
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
69.8978
4.044
17.284
0.000
61.895
77.901
BMI
0.4600
0.141
3.271
0.001
0.182
0.738
Omnibus:
2.841
Durbin-Watson:
2.019
Prob(Omnibus):
0.242
Jarque-Bera (JB):
2.458
Skew:
-0.190
Prob(JB):
0.293
Kurtosis:
3.562
Cond. No.
106.
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Based on the output, \(\beta_0=69\), \(\beta_1=.55\). We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87). Notice that we used a different package called statsmodels to look the model inference.