# Introduction to Data Science Pipelines ## L9
# Welcome to CS 326 ## Please check in and enter the provided code.
# Lockdown Browser (Canvas) 1. [Download Lockdown Browser](https://download.respondus.com/lockdown/download7.php?id=171646780). 2. Install the browser on your computer. 3. Ensure that you can log in to your Northwestern University Canvas account using the browser. **Note**: I have enabled the Monitoring feature for students that are unable to join us in-person, but I ask that everyone use it to ensure a fair testing environment.
# Intro Poll ## On a scale of 1-5, how confident are you feeling about Exam Part II?
# Exam Part II Review
## Exam Part II You will **not** be expected to do any programming on the exam. You will be asked to interpret code, identify errors, and explain concepts. The exam will be **closed book** and **closed notes**. You will not be allowed to use any resources during the exam. You will not need a calculator. The following topics will likely be covered on the exam -- they are not exhaustive but should give you a good idea of what to expect. Anything talked about in lecture or on your homework is fair game. The exam is approximately 60/40 in point value between Part I and Part II and of the course. There are ~ 50 questions written as of today. # Data Sources [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L1/#/) - Be able to identify structured, semi-structured, and unstructured data, as well as the advantages and disadvantages of each. - Given a scenario with missing data, pick the appropriate method to handle it. - Be able to describe several methods for identifying outliers in a dataset. # Hello World [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H1) - Understand the roles of conda, GitHub, vscode, and pytest in your development workflow. # Exploratory Data Analysis [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L1/#/) - Understand central tendency (mean, median, mode) and spread (range, variance, standard deviation). - Identify skew (positive, negative). - Identify kurtosis (leptokurtic, mesokurtic, platykurtic). - Know key properties of a normal distribution. # Correlation [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L2/#/) - Differentiate when to use Pearson vs Spearman correlation - Interpret correlation results (negative / positive / no relationship). - Identify a scenario as Simpson's Paradox (or not). - Recall Support, Confidence, and Lift and how they are used in association rule mining. # Hypothesis Testing [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L2/#/) [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H2) - Construct an A/B Test to test a hypothesis. - Define hypothesis testing for a scenario in terms of $H_0$ and $H_1$. - Provided a scenario, identify the hypothesis test to use (t-test, paired t-test, chi-squared test, anova). - Understand what a p-value represents, how it is used in hypothesis testing, and how to interpret it. - Know the non-parametric analogs to the tests we covered in lecture. # Data Preprocessing [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L3/#/) - Define feature engineering in the context of machine learning applications. - Define and be able to identify data that has been scaled and the method used to scale it (min-max, standard). - Describe the curse of dimensionality and how it affects machine learning models. - Understand dimensionality reduction techniques (Feature Selection, Feature Sampling, or PCA). # Machine Learning I [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L3/#/) - Define the terms: training set, validation set, and test set and their primary uses. - Identify a scenario as a classification or regression problem. - Explain the KNN algorithm and how it works. - Explain where the normal equation for linear regression comes from. - Be able to identify L1 and L2 regularization and explain at a high level how they work. - Understand the intuition behind the cross-entropy loss function. # Machine Learning [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H3) - Be able to look at code for logistic regression gradient descent and identify missing or incorrect components. - Provided with a **simple** numpy operation, identify the shape of the output. This may include an axis argument. [[🔗]](https://numpy.org/doc/stable/user/basics.broadcasting.html) # Machine Learning II [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L4/#/) - Explain ROC curves (axes) and what the AUC represents. - Explain the value of k-fold cross-validation. - Explain the value of a softmax function in the context of a multi-class classification problem. # Machine Learning III [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L4/#/) - Explain the ID3 algorithm and how it works (understand entropy & information gain). - Be able to identify a decision tree model as overfitting or underfitting. - Differentiate between and be able to explain different ensemble modeling methods (bagging, boosting, stacking). # Clustering [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L6/#/) - Understand the k-means algorithm. - Be able to look at an elbow plot or silhouette score and identify the optimal number of clusters for k-means. - Understand agglomerative (bottom-up) clustering, and be able to identify different linkage patterns. - Understand the difference between hierarchical and partitioning clustering. - Describe DBScan in the context of $\epsilon$ and $\text(m)$. # Recommendation Systems [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L6/#/) - Differentiate between content and collaborative filtering. - Understand User-User vs Item-Item collaborative filtering in a User-Item Matrix. - Understand the intuition behind matrix factorization in collaborative filtering. # Natural Language Processing I [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L7/#/) - Explain TF-IDF and how it is used in text analysis. - Define a bag-of-words model and explain how it is used in text analysis. - Explain the intuition behind a Markov model and how it is used in text analysis. # Natural Language Processing II [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L7/#/) - Explain the benefits of word embeddings over one-hot encoding. - Explain the difference between traditional word embeddings over contextual embeddings. - Differentiate between a CBOW and Skip-Gram model. - Know what vector databases do and how they are used in NLP / RAG systems. # Time Series [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L8/#/) - Identify additive vs multiplicative decomposition in time series data. - Know the value of differencing in time series data (i.e. what does it do, why is that important, and how do we evaluate the number of differences). - Look at ACF / PACF plots and determine what order of AR or MA to use in an ARIMA model. - Explain walk-forward validation in the context of time series data. # Ethics [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L8/#/) - Evaluate metrics for their interpretability. - Know GDPR, CCPA, and HIPAA and how they relate to data science. - Differentiate between opt-in and opt-out consent.
# Exit Poll ## On a scale of 1-5, how confident are you feeling about Exam Part II?
# Exam Part II