3  Regression

In the second week of class, we will look at linear regression in more depth, and examine the challenge of over-fitting.

3.1 Linear Regression in depth

Last week, we were exposed to the Linear Regression model. Today we will take it apart and look at it carefully to see how it works and what are the needed assumptions to use this model.

3.1.1 One predictor

Suppose we just use one predictor, \(Age\) to predict our outcome \(MeanBloodPressure\).

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot Age \]

Our model would look like the following like the red line from our Training data:

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from formulaic import model_matrix
from sklearn import linear_model
import statsmodels.api as sm


nhanes = pd.read_csv("classroom_data/NHANES.csv")
nhanes.drop_duplicates(inplace=True)
nhanes['MeanBloodPressure'] = nhanes['BPDiaAve'] + (nhanes['BPSysAve'] - nhanes['BPDiaAve']) / 3 
nhanes_train, nhanes_test = train_test_split(nhanes, test_size=0.2, random_state=42)

y_train, X_train = model_matrix("MeanBloodPressure ~ Age", nhanes_train)
linear_reg = linear_model.LinearRegression()
linear_reg = linear_reg.fit(X_train, y_train)
y_train_predicted = linear_reg.predict(X_train)

plt.clf()
plt.scatter(X_train.Age, y_train, alpha=.2)
plt.plot(X_train.Age, y_train_predicted, color="red")
plt.xlabel('Age')
plt.ylabel('Mean Blood Pressure')
plt.show()

This model is formed by making the the line of best fit determined by the minimum of the sum of squared difference between the observed response (in blue points) and predicted response (in red line). Using the termology from last week, we say that we fit a line where the Mean Squared Error (MSE) is minimized.

Another illustration:

Image source: https://kenndanielso.github.io/mlrefined/blog_posts/8_Linear_regression/8_1_Least_squares_regression.html

Left panel: difference between response and predicted response (residual). Right panel: squared difference between predicted response and response.

3.2 Assumptions of linear regression

Any model that one uses has some assumptions about the data that allows the model to make good predictions. Note that there are other types of assumptions if your modeling technique is focused on inference.

Let’s take a look what are the assumptions needed for a sound model, and what we can do to address it not upheld.

3.2.1 Linearity of responder-predictor relationship

The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn’t ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.

To check for this relationship, we have to calculate the residual, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a residual plot of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.

If there’s a trend in the data, that means there are non-linear associations between some of the predictors and the response.

residual = y_train - y_train_predicted

plt.clf()
plt.scatter(y_train_predicted, residual, alpha=.2)
plt.xlabel('Predicted response')
plt.ylabel('Residual')
plt.show()

We see there’s a slight curve in our residual plot. We will look at ways to deal with this later in this lecture.

3.2.2 No Outliers

An outlier is an obseravtion for which the response is far from the value predicted response (y-axis). An observation has high leverage if it has an unusual predictor value (x-axis). Outliers and high leverage observations arise may arise out of incorrect measurements, among many other causes. When these observations cause significant changes to the regression model, they are called influential. They may greatly contribute to the Mean Squared Error, as observations away the majority of the data will have exponentially large residuals.

Some possible solutions:

  • We can eyeball for for potential influential points by exploratory data analysis, and see how the model changes if we remove it. We may troubleshoot with the instruments that generated the data in the first place to diagnosis.

  • We can detect an influential point via computing the studentized residuals or cook’s distance and decide whether it makes sense to remove it.

  • We can use a different linear regression method, called Huber loss regression, that allows more tolerance for outliers.

3.2.3 Predictors are not colinear

Colinearity is the situation when two or more predictors are linearly related to each other. If we put collinear predictors into our regression model, they start to serve as redundant information to our model and can degrade predictive performance.

Some possible solutions:

  • We can detect collinearity to look at the correlation matrix between predictors. This works well for pairwise correlations.

  • When there is a collinear relationship between three or more predictors, pairwise methods will fail. We may consider the Variance Inflation Factor to detect them, but doesn’t necessarily recommend which variables to remove.

Suppose that we are consider the predictors of our training set:

#some cleanup
obj_columns = nhanes_train.select_dtypes(['object']).columns
nhanes_train[obj_columns] = nhanes_train[obj_columns].apply(lambda x: x.astype('category'))

cat_columns = nhanes_train.select_dtypes(['category']).columns
nhanes_train[cat_columns] = nhanes_train[cat_columns].apply(lambda x: x.cat.codes)

#create correlation matrix
corr_matrix = nhanes_train.corr()

plt.clf()
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()

Let’s look at a pair of predictors up close:

plt.clf()
ax = sns.regplot(y="Age", x="BMI", data=nhanes_train, lowess=True, scatter_kws={'alpha':0.1}, line_kws={'color':"r"})
ax.set_xlim([10, 50])
plt.show()

3.2.4 Number of predictors is less than the number of samples

Sometimes, in machine learning, we have more predictors than the number of samples. This is called a high dimensional problem. Our regression method will not work here and we need to find ways to reduce the number of predictors.

We recommend evaluate the assumptions of your linear regression model on the training set before evaluating it on the test set.

3.3 Extesnsions of the Linear Model

3.3.1 Polynomial Regression

Let’s go back to revisit the non-linearity of responder-predictor relationship. To deal with the slight curve in the residual plot, we can extend our model to accommodate non-linear relationships via Polynomial Regression.

Here is what polynomial regression is capable, visually:

Source: https://madrury.github.io/jekyll/update/statistics/2017/08/04/basis-expansions.html

Recall our original equation:

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot Age \] We can include transformed versions of the predictors to have other shapes, such as a quadratic shape:

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot Age + \beta_2 \cdot Age^2 \]

This is still a linear model – we have added a new predictor that gives us a quadratic shape. We use the poly() function to generate our polynomial predictor.

y_train, X_train = model_matrix("MeanBloodPressure ~ poly(Age, degree=2, raw=True)", nhanes_train)

linear_reg = linear_model.LinearRegression()
linear_reg = linear_reg.fit(X_train, y_train)
y_train_predicted = linear_reg.predict(X_train)

plt.clf()
plt.scatter(X_train[X_train.columns[1]], y_train, alpha=.2)
plt.scatter(X_train[X_train.columns[1]], y_train_predicted, color="red")
plt.xlabel('Age')
plt.ylabel('Mean Blood Pressure')
plt.show()

Let’s look at our Residual Plot:

residual = y_train - y_train_predicted

plt.clf()
plt.scatter(y_train_predicted, residual, alpha=.2)
plt.xlabel('Predicted response')
plt.ylabel('Residual')
plt.show()

Okay, looks better! Less of a curved trend in the residual plot.

In general, it is rather unusual to see the polynomial term grow beyond 4, as they become more overly flexible and take on more strange shapes. We will take a look at the consequence of this in a moment.

3.3.2 Interactions

Suppose we think that \(BMI\) and \(Gender\) may be good predictors of \(MeanBloodPressure\):

Let’s explore the relationship between \(MeanBloodPressure\) and \(BMI\) separately for values of \(Gender\).

plt.clf()
ax = sns.lmplot(y="MeanBloodPressure", x="BMI", hue="Gender", data=nhanes_train, lowess=False, scatter_kws={'alpha':0.1})
ax.set(xlim=(10, 50)) 
plt.show()
<Figure size 672x480 with 0 Axes>

The only model we know that relates all of these variables is:

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender \]

According to our model, the relationship between \(BMI\) and \(MeanBloodPressure\) is a linear line with slope \(\beta_1\), and the additional predictor of \(Gender\) will change our prediction by only a constant, \(\beta_2\). Visually, that would look like two parallel lines, with \(\beta_2\) dictating the distance between parallel lines. However, this plot suggests that our original model isn’t quite right: the additional predictor of \(Gender\) changes our prediction by more than a constant - it is dependent on \(BMI\) also.

When multiple predictors have an synergistic effect on the outcome, their effect on the outcome occurs jointly - this is called an Interaction. To incorporate this into our model, we add an interaction term:

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender + \beta_3 \cdot BMI \cdot Gender \]

Let’s see what happens:

y_train, X_train = model_matrix("MeanBloodPressure ~ BMI + Gender + BMI*Gender", nhanes_train)
linear_reg = linear_model.LinearRegression()
linear_reg = linear_reg.fit(X_train, y_train)
y_train_predicted = linear_reg.predict(X_train)

plt.clf()
plt.scatter(X_train.BMI, y_train, alpha=.2)
plt.scatter(X_train.BMI, y_train_predicted, color="red")
plt.xlabel('BMI')
plt.ylabel('Mean Blood Pressure')
plt.show()

Which creates unique, non-parallel lines depending on the value of \(Gender\).

3.4 Overfitting

Last week, we discussed as the model learns from any data, it will learn to recognize its patterns, and sometimes it will recognize patterns that are only specific to this data and not reproducible anywhere else. This is called Overfitting, and why we constructed the training and testing datasets to identify this phenomena.

Let’s look about overfitting in more detail: there can be different magnitudes of overfitting. The more flexible models we employ, the higher risk there will be overfitting, because these models will identify patterns too specific to the training data and not generalize to the test data. For instance, linear regression is a fairly inflexible approach, because it just uses a straight line to model the data. However, if we use polynomial regression, especially for higher degree polynomials, the model becomes more flexible, with a higher risk of overfitting.

Let’s take a look at our first model again:

\[ MeanBloodPressure= \beta_0 + \beta_1 \cdot Age \]

#Use a small part of the data to illlustrate overfitting.
nhanes_tiny = nhanes.sample(n=300, random_state=2)

y, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
linear_model = sm.OLS(y_train, X_train).fit()

plt.clf()
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')

ax1.plot(X_train.BMI, linear_model.predict(X_train), label="fitted line")
ax1.scatter(X_train.BMI, y_train, alpha=.5, color="brown", label="Training set")
ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')
ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))
train_err = np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) ** 2)
ax1.set_title('Training Error: ' + str(round(train_err, 2)))

ax2.plot(X_test.BMI, linear_model.predict(X_test), label="fitted line")
ax2.scatter(X_test.BMI, y_test, alpha=.5, color="brown", label="Testing set")
ax2.set(xlabel='BMI', ylabel='Mean Blood Pressure')
ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax2.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))
test_err = np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) ** 2)
ax2.set_title('Testing Error: ' + str(round(test_err, 2)))

plt.show()
<Figure size 672x480 with 0 Axes>

We see that Training Error < Testing Error.

Let’s look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:

p_degree = 2
y, X = model_matrix("MeanBloodPressure ~ BMI + poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

X_train_BMI = X_train.BMI
X_test_BMI = X_test.BMI
X_train.drop('BMI', axis=1, inplace=True)
X_test.drop('BMI', axis=1, inplace=True)

model = sm.OLS(y_train, X_train)
linear_model = model.fit()

plt.clf()
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')

ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
ax1.scatter(X_train_BMI, linear_model.predict(), label="fitted line")

ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')
train_err = np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) ** 2)
ax1.set_title('Training Error: ' + str(round(train_err, 2)))
ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))

ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
ax2.scatter(X_test_BMI, linear_model.predict(X_test), label="fitted line")
ax2.set(xlabel='BMI', ylabel='Blood Pressure')
test_err = np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) ** 2)
ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax2.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))

fig.suptitle('Polynomial Degree: ' + str(p_degree))
plt.show()
<Figure size 672x480 with 0 Axes>

We see that both Training and Testing error both decreased slightly!

What happens if we keep increasing the model complexity?

for p_degree in [4, 10]:
  y, X = model_matrix("MeanBloodPressure ~ BMI + poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)
  
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
  
  X_train_BMI = X_train.BMI
  X_test_BMI = X_test.BMI
  X_train.drop('BMI', axis=1, inplace=True)
  X_test.drop('BMI', axis=1, inplace=True)
  
  model = sm.OLS(y_train, X_train)
  linear_model = model.fit()
  
  plt.clf()
  fig, (ax1, ax2) = plt.subplots(2, layout='constrained')
  
  ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
  ax1.scatter(X_train_BMI, linear_model.predict(), label="fitted line")
  
  ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')
  train_err = np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) ** 2)
  ax1.set_title('Training Error: ' + str(round(train_err, 2)))
  ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
  ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))

  ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
  ax2.scatter(X_test_BMI, linear_model.predict(X_test), label="fitted line")
  ax2.set(xlabel='BMI', ylabel='Mean Blood Pressure')
  test_err = np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) ** 2)
  ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
  ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
  ax2.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))

  fig.suptitle('Polynomial Degree: ' + str(p_degree))
  plt.show()
<Figure size 672x480 with 0 Axes>

<Figure size 672x480 with 0 Axes>

Let’s summarize it:

train_err = []
test_err = []
polynomials = list(range(1, 10))

for p_degree in polynomials:
  if p_degree == 1:
    y, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)
  else:
    y, X = model_matrix("MeanBloodPressure ~ poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
  
  model = sm.OLS(y_train, X_train)
  linear_model = model.fit()

  train_err.append(np.mean((linear_model.predict(X_train) - y_train.MeanBloodPressure) ** 2))
  test_err.append(np.mean((linear_model.predict(X_test) - y_test.MeanBloodPressure) ** 2))
  
plt.clf()
plt.plot(polynomials, train_err, color="blue", label="Training Error")
plt.plot(polynomials, test_err, color="red", label="Testing Error")
plt.xlabel('Polynomial Degree')
plt.ylabel('Error')
plt.legend()
plt.show()  

As our Polynomial Degree increased, the following happened:

  • In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This makes sense, as the model does not generalize as well to the testing set.

  • As the degrees increased, both training and testing error decreased slightly.

  • After degree 4, we see that the Training Error continued to decrease, but the Testing Error blew up! This is an example of Overfitting, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set at all.

We want to find a model that is “just right” that doesn’t underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. It seems that our ideal prediction model is around a polynomial of degree 4, with the minimal Testing Error.

3.4.1 Another example

Here is another illustration of the phenomena, using synthetic controlled data:

On the left shows the Training Data in black dots. Then, three models are displayed: linear regression (orange line), two other models of increasing complexity in blue and green.

On the right shows the training error (grey curve), testing error (red curve), and where each of the three models on the left land in the error rate with their respective colors.

Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor.

Lastly, we recommend examining interactive tutorial: https://mlu-explain.github.io/bias-variance/

Hopefully you start to see the importance of examining the Testing Error instead of the Training Error to evaluate our model. A highly flexible data will overfit the model and make it seem like the Training Error is small, but it will not generalize to the Testing data, which will have Testing Error.

3.5 Appendix: Inference

For this course, we focus on prediction from our machine learning models. These models have an equally important usage in statistical inference. This appendix gives a quick overview of what that is about.

3.5.1 Population and Sample

The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:

Population: The entire collection of individual units that a researcher is interested to study. For NHANES, this could be the entire US population.

Sample: A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.

If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here https://www.nature.com/collections/qghhqm/pointsofsignificance.

3.5.2 Parameter inference

Now let’s examine the function \(f()\)’s trained values, which are called parameters.

For the Linear Model, the model we first fitted was of the following form:

\[ BloodPressure=\beta_0 + \beta_1 \cdot BMI \]

which is an equation of a line.

\(\beta_0\) is a parameter describing the intercept of the line, and \(\beta_1\) is a parameter describing the slope of the line.

Suppose that from fitting the model on the Training Set, \(\beta_1=2\). That means increasing \(BMI\) by 1 will lead to an increase of \(BloodPressure\) by 2. This measures the strength of association between a variable and the outcome.

Let’s see this in practice:

import statsmodels.api as sm

y, X = model_matrix("MeanBloodPressure ~ BMI", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
linear_model = sm.OLS(y_train, X_train).fit()

linear_model.summary()
OLS Regression Results
Dep. Variable: MeanBloodPressure R-squared: 0.078
Model: OLS Adj. R-squared: 0.071
Method: Least Squares F-statistic: 10.70
Date: Wed, 18 Feb 2026 Prob (F-statistic): 0.00138
Time: 22:38:51 Log-Likelihood: -502.67
No. Observations: 128 AIC: 1009.
Df Residuals: 126 BIC: 1015.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 69.8978 4.044 17.284 0.000 61.895 77.901
BMI 0.4600 0.141 3.271 0.001 0.182 0.738
Omnibus: 2.841 Durbin-Watson: 2.019
Prob(Omnibus): 0.242 Jarque-Bera (JB): 2.458
Skew: -0.190 Prob(JB): 0.293
Kurtosis: 3.562 Cond. No. 106.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Based on the output, \(\beta_0=69\), \(\beta_1=.55\). We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87). Notice that we used a different package called statsmodels to look the model inference.

3.5.2.1

3.6