| Type | Transformation | Description |
|-----------|----------------|-------------|
| Numerical | Binarization | Convert numeric to binary. |
| Numerical | Binning | Group numeric values. |
| Numerical | Log Transformation | Manage data scale disparities. |
| Numerical | Scaling | Standardize or scale features. |
| Categorical | One-Hot Encoding | Convert categories to binary columns. |
| Categorical | Feature Hashing | Compress categories into hash vectors. |
| Temporal | Temporal Features | Convert to bins, manage time zones. |
| Missing | Imputation | Fill missing values. [L2](https://drc-cs.github.io/cs326/lectures/L02_data_sources/#/25).|
| High-Dimensional | Feature Selection | Choose relevant features. |
| High-Dimensional | Feature Sampling | Reduce feature space via feature sampling. |
| High-Dimensional | Random Projection | Reduce feature space via random projection. |
| High-Dimensional | Principal Component Analysis (PCA) | Reduce dimensions while preserving data variability. |
| Method | Description | When to Use |
| --- | --- | --- |
| Forward / backward fill | Fill missing value using the last / next valid value. | Time Series |
| Imputation by interpolation | Use interpolation to estimate missing values. | Time Series |
| Mean value imputation | Fill missing value with mean from column. | Random missing values |
| Conditional mean imputation | Estimate mean from other variables in the dataset. | Random missing values |
| Random imputation | Sample random values from a column. | Random missing values |
| KNN imputation | Use K-nearest neighbors to fill missing values. | Random missing values |
| Multiple Imputation | Uses many regression models and other variables to fill missing values. | Random missing values |
## High-Dimensional Data | Why Reduce Dimensions?
- **Curse of Dimensionality**: As the number of features increases, the amount of data required to cover the feature space grows exponentially.
- **Overfitting**: High-dimensional data is more likely to overfit the model, leading to poor generalization.
- **Computational Complexity**: High-dimensional data requires more computational resources to process.
- **Interpretability**: High-dimensional data can be difficult to interpret and visualize.
## High-Dimensional Data | The Curse of Dimensionality
**tldr;** As the number of features increases, the amount of data required to cover the feature space grows exponentially. This can lead to overfitting and poor generalization.
**Intuition**: Consider a 2D space with a unit square. If we divide the square into 10 equal parts along each axis, we get 100 smaller squares. If we divide it into 100 equal parts along each axis, we get 10,000 smaller squares. The number of smaller squares grows exponentially with the number of divisions. Without exponentially growing data points for these smaller squares, a model needs to make more and more inferences about the data.
**Takeaway**: With regards to machine learning, this means that as the number of features increases, the amount of data required to cover the feature space grows exponentially. Given that we need more data to cover the feature space effectively, and we rarely do, this can lead to overfitting and poor generalization.
### Random Split
Ensure you shuffle the data to avoid bias. Important when your dataset is ordered.
### Stratified Split
Used with imbalanced data to ensure each set reflects the overall distribution of the target variable. Important when your dataset has a class imbalance.
### Time-Based Split
Used for time series data to ensure the model is evaluated on future data. Important when your dataset is time-dependent.
### Group-Based Split
Used when data points are not independent, such as in medical studies. Important when your dataset has groups of related data points.
```python
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
```
Chavan, 2023
## Why Split on Groups?
Imagine we're working with a medical dataset aimed at predicting the likelihood of a patient having a particular disease based on various features, such as age, blood pressure, cholesterol levels, etc.
### Dataset
- **Patient A:**
- Visit 1: Age 50, Blood Pressure 130/85, Cholesterol 210
- Visit 2: Age 51, Blood Pressure 135/88, Cholesterol 215
- **Patient B:**
- Visit 1: Age 60, Blood Pressure 140/90, Cholesterol 225
- Visit 2: Age 61, Blood Pressure 145/92, Cholesterol 230
## Incorrect Splitting
- **Training Set:**
- Patient A, Visit 1
- Patient B, Visit 1
- **Testing Set:**
- Patient A, Visit 2
- Patient B, Visit 2
In this splitting scenario, the model could learn specific patterns from Patient A and Patient B in the training set and then simply recall them in the testing set. Since it has already seen data from these patients, even with slightly different features, **it may perform well without actually generalizing to unseen patients**.
## Correct Splitting
- **Training Set:**
- Patient A, Visit 1
- Patient A, Visit 2
- **Testing Set:**
- Patient B, Visit 1
- Patient B, Visit 2
In these cases, the model does not have prior exposure to the patients in the testing set, ensuring an unbiased evaluation of its performance. It will need to apply its learning to truly "new" data, similar to real-world scenarios where new patients must be diagnosed based on features the model has learned from different patients.
## Key Takeaways for Group-Based Splitting
- **Independence:** Keeping data separate between training and testing sets maintains the independence necessary for unbiased model evaluation.
- **Generalization:** This approach ensures that the model can generalize its learning from one set of data to another, which is crucial for effective predictions.
## Linear Regression | Concept
Linear regression attempts to model the relationship between two or more variables by fitting a linear equation to observed data. The components to perform linear regression:
$$ \widehat{y} = X\beta $$
Where $ \widehat{y} $ is the predicted value, $ X $ is the feature matrix, and $ \beta $ is the coefficient vector. The goal is to find the coefficients that minimize the error between the predicted value and the actual value.

## Linear Regression | Cost Function
The objective of linear regression is to minimize the cost function $ J(\beta) $:
$$ J(\beta) = \frac{1}{2m} \sum_{i=1}^m (\widehat{y}_i - y_i)^2 $$
Where $ \widehat{y} = X\beta $ is the prediction. This is most easily solved by finding the normal equation solution:
$$ \beta = (X^T X)^{-1} X^T y $$
The normal equation is derived by setting the gradient of $J(\beta) $ to zero. This is a closed-form solution that can be computed directly.
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
```
## Linear Regression | Normal Equation Notes
### Adding a Bias Term
Practically, if we want to include a bias term in the model, we can add a column of ones to the feature matrix $ X $. Your H3 will illustrate this concept.
$$ \widehat{y} = X\beta $$
### Gradient Descent
For large datasets, the normal equation can be computationally expensive. Instead, we can use gradient descent to minimize the cost function iteratively. We'll talk about gradient descent within the context of logistic regression later today.
## Linear Regression | Regression Model Evaluation
To evaluate a regression model, we can use metrics such as mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and R-squared.
| Metric | Formula | Notes |
| --- | --- | --- |
| Mean Squared Error (MSE) | `$\frac{1}{m} \sum_{i=1}^m (\widehat{y}_i - y_i)^2$` | Punishes large errors more than small errors. |
| Mean Absolute Error (MAE) | `$\frac{1}{m} \sum_{i=1}^m \|\widehat{y}_i - y_i\|$` | Less sensitive to outliers than MSE. |
| Mean Absolute Percentage Error (MAPE) | `$\frac{1}{m} \sum_{i=1}^m \left \| \frac{\widehat{y}_i - y_i}{y_i} \right\| \times 100$` | Useful for comparing models with different scales. |
| R-squared | `$1 - \frac{\sum(\widehat{y}_i - y_i)^2}{\sum(\bar{y} - y_i)^2}$` | Proportion of the variance in the dependent variable that is predictable from the independent variables. |
## Linear Regression | Pros and Cons
### Pros
- Simple and easy to understand.
- Fast to train.
- Provides a good, very interpretable baseline model.
### Cons
- Assumes a linear relationship between the features and the target variable.
- Sensitive to outliers.
## Linear Regression | A Brief Note on Regularization
Regularization is a technique used to prevent overfitting by adding a penalty term to the cost function. The two most common types of regularization are L1 (Lasso) and L2 (Ridge) regularization.
Recall that the cost function for linear regression is:
$$ J(\beta) = \frac{1}{2m} \sum_{i=1}^m (\widehat{y}_i - y_i)^2 $$
**L1 Regularization**: Adds the absolute value of the coefficients to the cost function. This effectively performs feature selection by pushing some coefficients towards zero.
$$ J(\beta) = J(\beta) + \lambda \sum_{j=1}^n |\beta_j| $$
**L2 Regularization**: Adds the square of the coefficients to the cost function. This shrinks the coefficients, but does not set them to zero. This is useful when all features are assumed to be relevant.
$$ J(\beta) = J(\beta) + \lambda \sum_{j=1}^n \beta_j^2 $$
## L3 | Q4
When is L2 regularization (Ridge) preferred over L1 regularization (Lasso)?
A. When all features are assumed to be relevant.
B. When some features are assumed to be irrelevant.
## Logistic Regression | Concept
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Joshi, 2019
## Logistic Regression | Formula
This model is based on the sigmoid function $\sigma(z)$:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Where
$$ z = X\beta $$
Note that $\sigma(z)$ is the probability that the dependent variable is 1 given the input $X$. Consider the similar form of the linear regression model:
$$ \widehat{y} = X\beta $$
The key difference is that the output of logistic regression is passed through the sigmoid function to obtain a value between 0 and 1, which can be interpreted as a probability. This works because the sigmoid function maps any real number to the range [0, 1]. While linear regression predicts the value of the dependent variable, logistic regression predicts the probability that the dependent variable is 1.
## Logistic Regression | No Closed-Form Solution
In linear regression, we can calculate the optimal coefficients $\beta$ directly. However, in logistic regression, we cannot do this because the sigmoid function is non-linear. This means that there is no closed-form solution for logistic regression.
Instead, we use gradient descent to minimize the cost function. Gradient descent is an optimization algorithm that iteratively updates the parameters to minimize the cost function, and forms the basis of many machine learning algorithms.
machinelearningspace.com (2013)
## Logistic Regression | Cost Function
The cost function used in logistic regression is the cross-entropy loss:
$$ J(\beta) = -\frac{1}{m} \sum_{i=1}^m [y_i \log(\widehat{y}_i) + (1 - y_i) \log(1 - \widehat{y}_i)] $$
$$ \widehat{y} = \sigma(X\beta) $$
Let's make sure we understand the intuition behind the cost function $J(\beta)$.
If the true label ($y$) is 1, we want the predicted probability ($\widehat{y}$) to be close to 1. If the true label ($y$) is 0, we want the predicted probability ($\widehat{y}$) to be close to 0. The cost goes up as the predicted probability diverges from the true label.
## Logistic Regression | Gradient Descent
To minimize $ J(\beta) $, we update $ \beta $ iteratively using the gradient of $ J(\beta) $:
$$ \beta := \beta - \alpha \frac{\partial J}{\partial \beta} $$
Where $ \alpha $ is the learning rate, and the gradient $ \frac{\partial J}{\partial \beta} $ is:
$$ \frac{\partial J}{\partial \beta} = \frac{1}{m} X^T (\sigma(X\beta) - y) $$
Where $ \sigma(X\beta) $ is the predicted probability, $ y $ is the true label, $ X $ is the feature matrix, $ m $ is the number of instances, $ \beta $ is the coefficient vector, and $ \alpha $ is the learning rate.
This is a simple concept that forms the basis of many gradient-based optimization algorithms, and is widely used in deep learning.
Similar to linear regression -- if we want to include a bias term, we can add a column of ones to the feature matrix $ X $.
machinelearningspace.com (2013)
## L3 | Q5
Okay, so let's walk through an example. Suppose you have already done the following:
1. Obtained the current prediction ($\widehat{y}$) with $ \sigma(X\beta) $.
2. Calculated the gradient $ \frac{\partial J}{\partial \beta} $.
What do you do next?
## Logistic Regression | Classifier
Once we have the optimal coefficients, we can use the logistic function to predict the probability that the dependent variable is 1.
We can then use a threshold to classify the instance as 0 or 1 (usually 0.5). The following code snippet shows how to use the scikit-learn library to fit a logistic regression model and make predictions.
```python
from sklearn.linear_model import LogisticRegression
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 1, 1]
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)
logistic_regression_model.predict([[3.5, 3.5]])
```
```python
array([[0.3361201, 0.6638799]])
```
## L3 | Q6
Our logistic regression model was trained with $X = [[1, 2], [2, 3], [3, 4], [4, 5]]$ and $y = [0, 0, 1, 1]$. We then made a prediction for the point $[3.5, 3.5]$.
What does this output represent?
```python
array([[0.3361201, 0.6638799]])
```
## k-Nearest Neighbors (KNN) | Concept
KNN is a non-parametric method used for classification (and regression!).
The principle behind nearest neighbor methods is to find a predefined number of samples closest in distance to the new point, and predict the label from these using majority vote.
## k-Nearest Neighbors (KNN) | What is it doing?
Given a new instance $ x' $, KNN classification computes the distance between $ x' $ and all other examples. The k closest points are selected and the predicted label is determined by majority vote.
### Euclidean Distance
`$ d(x, x') =\sqrt{\sum_{i=1}^n (x_i - x'_i)^2} $`
### Manhattan Distance
`$ d(x, x') = \sum_{i=1}^n |x_i - x'_i| $`
### Cosine Distance
`$ d(x, x') = 1 - \frac{x \cdot x'}{||x|| \cdot ||x'||} $`
### Jaccard Distance (useful for categorical data!)
`$ d(x, x') = 1 - \frac{|x \cap x'|}{|x \cup x'|} $`
### Hamming Distance (useful for strings!)
`$ d(x, x') = \frac{1}{n} \sum_{i=1}^n x_i \neq x'_i $`
```python
from sklearn.neighbors import KNeighborsClassifier
# Default is Minkowski distance.
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X_train, y_train)
```
## k-Nearest Neighbors (KNN) | Example
Given the following data points (X) and their corresponding labels (y), what is the predicted label for the point (3.5, 3.5) using KNN with k=3?
```python
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 1, 1]
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
knn.predict([[3.5, 3.5]])
```
## k-Nearest Neighbors (KNN) | Hyperparameters
### Number of Neighbors (k)
The number of neighbors to consider when making predictions.
### Distance Metric
The metric used to calculate the distance between points.
### Weights
Uniform weights give equal weight to all neighbors, while distance weights give more weight to closer neighbors.
### Algorithm
The algorithm used to compute the nearest neighbors. Some examples include Ball Tree, KD Tree, and Brute Force.
## k-Nearest Neighbors (KNN) | Pros and Cons
### Pros
- Simple and easy to understand.
- No training phase.
- Can be used for both classification and regression.
### Cons
- Computationally expensive.
- Sensitive to the scale of the data.
- Requires a large amount of memory.
## k-Nearest Neighbors (KNN) | Classification Model Evaluation
To evaluate a binary classification model like this, we can use metrics such as accuracy, precision, recall, F1 score, and ROC-AUC.
| Metric | Formula | Notes |
| --- | --- | --- |
| Accuracy | $\frac{TP + TN}{TP + TN + FP + FN}$ | Easy to interpret but flawed.
| Precision | $\frac{TP}{TP + FP}$ | Useful when the cost of false positives is high. |
| Recall | $\frac{TP}{TP + FN}$ | Useful when the cost of false negatives is high. |
| F1 Score | $2 \times \frac{Precision \times Recall}{Precision + Recall}$ | Harmonic mean of precision and recall. |
| ROC-AUC | Area under the ROC curve. | Useful for imbalanced datasets. |
## Summary
- We discussed the importance of splitting data into training, validation, and test sets.
- We delved into k-Nearest Neighbors, Linear Regression, and Logistic Regression with Gradient Descent, exploring practical implementations and theoretical foundations.
- Understanding these foundational concepts is crucial for advanced machine learning and model fine-tuning!