2 Problem Set-Up

In the first week of class, we will go over what machine learning models are good for, and look at an classification model as an example to show the model development workflow. Then, we will get our hands on the NumPy package to prepare our data for our models for the rest of the course.

2.1 Classification model example

Suppose that we are given the National Health And Nutrition Examination Survey (NHANES) dataset and want to build a machine learning model to classify whether a person has hypertension blood pressure based on clinical and demographic variables.

Using algebraic expressions, we formulate the following:

\[ Hypertension=f(Age, BMI) \]

Where \(f(Age, BMI)\) is a machine learning model that takes in the variables \(Age\), \(BMI\), and make a classification on whether someone has \(Hypertension\).

A machine learning model, such as the one described above, has two main uses:

Classification and Prediction (Focus of this course): How accurately can we classify or predict the outcome?
- Classification: Given a new person’s \(Age, BMI\), classify whether the person has \(Hyptertension\). The outcome is a yes/no classification.
- Prediction: Given a person’s \(Age, BMI\), predict the person’s \(BloodPressure\) value. The outcome is a continuous value.
Inference (Secondary in this course): Which predictors are associated with the response, and how strong is the association?
- Classification model example: What is the odds ratio of of \(Age\) on \(Hyptertension\)? If the odds ratio of \(Age\) on \(Hyptertension\) is 2, then an increase of 1 in \(Age\) increases the odds of \(Hyptertension\) by 2.
- Prediction model example: Suppose the model is described as \(BloodPressure = f(Age,BMI)=20 + 3 \cdot Age - .2 \cdot BMI\). Each variable has a relationship to the outcome: an increase of \(Age\) by 1 will lead to an increase of \(BloodPressure\) by 3. This measures the strength of association between a variable and the outcome.

Let’s start with the easiest case for just \(Hypertension = f(Age)\), a single predictor.

Before we fit models, we often visualize the data to get a sense whether our setup makes sense.

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from formulaic import model_matrix
import statsmodels.api as sm

nhanes = pd.read_csv("classroom_data/NHANES.csv")
nhanes['Hypertension'] = (nhanes['BPDiaAve'] > 80) | (nhanes['BPSysAve'] > 130)

nhanes['Hypertension2'] = nhanes['Hypertension'].replace({True: "Hypertension", False: "No Hypertension"})
plt.clf()
ax = sns.boxplot(y="Hypertension2", x="BMI", data=nhanes)
ax.set_ylabel('')
plt.show()

Okay, great, it looks like when someone’s BMI is higher, then it is more likely that the person has Hypertension.

Now, let’s build the model \(Hypertension = f(BMI)\) to make a prediction of \(Hyptertension\) given \(BMI\).

y, X = model_matrix("Hypertension ~ BMI", nhanes)
logit_model = sm.Logit(y, X).fit() 

plt.clf()
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
plt.xlabel('BMI')
plt.ylabel('Probability of Hypertension')
plt.legend()
plt.show()

Optimization terminated successfully.
         Current function value: 0.515543
         Iterations 6

Instead of boxplots, we plotted the data just using points, with “Hypertension” having a probability of 1 and “No Hypertension” having a probability of 0. We see that we have a fitted line in blue for every value of BMI, which represents our machine learning model \(f(BMI)\). This model is called Logistic Regression.

The first thing we want to investigate about this model is how well it performs in terms of Classification. Just using \(BMI\) as a variable, what is the Accuracy of \(f(BMI)\) classifying whether a person has \(Hypertension\)? Notice that first \(f(BMI)\) gives us continuous probability values, such as given a BMI of 30, there is a 20% chance the person has Hypertension. We need a discrete cutoff of this model to decide whether the person has Hypertension.

A reasonable cutoff to start is 50%: if the probability of having Hypertension is >=50%, then classify that person having Hypertension. Same for < 50%. This is called the Decision Boundary.

plt.clf()
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
plt.xlabel('BMI')
plt.ylabel('Probability of Hypertension')
plt.axhline(y=0.5, color='r', linestyle='--', label='Prediction Cutoff')
plt.legend();
plt.show()

Given this decision boundary, what is the accuracy?

from sklearn.metrics import (confusion_matrix, accuracy_score)

prediction_cut = [1 if x >= .5 else 0 for x in logit_model.predict()]
print('Accuracy = ', accuracy_score(y, prediction_cut))

Accuracy =  0.762300186838281

Okay, that’s a starting point!

We can break down classification accuracy to four additional results:

tn, fp, fn, tp = confusion_matrix(y, prediction_cut).ravel().tolist()
print("True Positive:", tp, "\nFalse Positive: ", fp, "\nTrue Negative: ", tn, "\nFalse Negative:", fn)

True Positive: 109 
False Positive:  137 
True Negative:  7235 
False Negative: 2153

define tp, fp, tn, fn

define confusion matrix

cm = confusion_matrix(y, prediction_cut) 
print("Confusion Matrix : \n", cm)

Confusion Matrix : 
 [[7235  137]
 [2153  109]]

2.1.1 Summary of Example

So what have we done so far? / Preview of what is to come:

Selected a predictor and binary outcome, and visualized it
- Eventually we will look at a continuous outcome, multiple predictors, and how to select multiple predictors
Fit it to a logistic regression model, which is a classification model
- Logistic regression is a type of linear model, which is the basis for most machine learning models
We evaluated the model in terms of accuracy, true positive rate, true negative rate, false positive rate, false negative rate
- We evaluated the model on the same data that we built the model. Ideally, we want to evaluate the model on data it has never seen before. More on this next week.

Before we race ahead….there’s a lot of new Python data structures that we are working with in this course. So let’s brush up on our data structures and how to make sense of the new ones coming our way!

2.2 Review of Data Structures

We will be seeing a lot of different data structures in this course beyond DataFrames, Series, and Lists. So let’s review how we think about learning new data structures to make our lives easier when we encounter new data structures.

Let’s review the List data structure. For any data structure, we ask the following:

What does it contain (in terms of data)?
What can it do (in terms of functions)?

And if it “makes sense” to us, then it is well-designed data structure.

Formally, a data structure in Python (also known as an Object) may contain the following:

Value that holds the essential data for the data structure.
Attributes that hold subset or additional data for the data structure.
Functions called Methods that are for the data structure and have to take in the variable referenced as an input.

Let’s see how this applies to the List:

Value: the contents of the list, such as [2, 3, 4].
Attributes that store additional values: Not relevant for lists.
Methods that can be used on the object: my_list.append(x)

How about Dataframe?

Value: the 2-dimensional spreadsheet of the dataframe.
Attributes that store additional values: df.shape gives the number of rows and columns. df.my_col_name access the column called “my_col_name”.
Methods that can be used on the object: df.merge(other_df, on=“column_name”)

Feel free to look at the cheatsheet on data structures from Intro to Python to refresh yourself.

2.2.1 NumPy

A new Data Structure we will work with in this course is NumPy’s ndarray (“n-dimensional array”) data structure. It is commonly referred as “NumPy Array”. It is very similar to a Dataframe, but has the following characteristics for building machine learning models:

All elements are homogeneous and numeric.
There are no column or row names.
Mathematical operations are optimized to be fast.

So, let’s see some examples:

Value: the 2-dimensional numerical table. It actually can be any dimension, but we will just work with 1-dimensional (similar to a List) and 2-dimensional.
Attributes that store additional values:
- Two-dimensional subsetting, similar to lists: data[:5, :3] subsets for for the first 5 rows and first three columns. data[:5, [0, 2, 3]] subsets for the first 5 rows and 1st, 3rd, and 4th columns.
- data.shape gives the shape of the NumPy Array. data.dim will tell you the number of dimensions of the NumPy Array.
Methods that can be used on the object:
- data.sum(axis=0) sums over rows, data.sum(axis=1) sums over columns.

For this course, we often load in a dataset in the Pandas Dataframe format, and then once we pick the our outcome and predictors, we will transform the Dataframe into an NumPy Array, such as this line of code we saw earlier: y, X = model_matrix("Hypertension ~ BMI", nhanes). We specify our outcome, predictor, and Dataframe for the model_matrix() function, and the outputs are two NumPy Arrays, one for the outcome, and one for the predictors. Any downstream Machine Learning modeling work off the NumPy Arrays y and X.

More introduction can be found on NumPy’s tutorial guide.

2.2.2 What is this data structure?

If you are not sure about what your variable’s data structure, use the type() function, such as type(mystery_data) and it will tell you.

2.3 Exercises

Exercises for week 1 can be found here.