# Introduction to Data Science Pipelines ## L5
# Welcome to CS 326 ## Please check in and enter the provided code.
# Exam Part I Review
## Ground Rules You will **not** be expected to do any programming on the exam. You will be asked to interpret code, identify errors, and explain concepts. The exam will be **closed book** and **closed notes**. You will not be allowed to use any resources during the exam. You will not need a calculator. The following topics will be covered on the exam -- they are not exhaustive but should give you a good idea of what to expect. # Data Sources [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L1/#/) - Be able to identify structured, semi-structured, and unstructured data, as well as the advantages and disadvantages of each. - Given a scenario with missing data, pick the appropriate method to handle it. - Be able to describe several methods for identifying outliers in a dataset. # Hello World [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H1) - Understand the roles of conda, GitHub, vscode, and pytest in your development workflow. # Exploratory Data Analysis [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L1/#/) - Understand central tendency (mean, median, mode) and spread (range, variance, standard deviation). - Identify skew (positive, negative). - Identify kurtosis (leptokurtic, mesokurtic, platykurtic). - Know key properties of a normal distribution. # Correlation [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L2/#/) - Differentiate when to use Pearson vs Spearman correlation - Interpret correlation results (negative / positive / no relationship). - Identify a scenario as Simpson's Paradox (or not). - Recall Support, Confidence, and Lift and how they are used in association rule mining. # Hypothesis Testing [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L2/#/) [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H2) - Construct an A/B Test to test a hypothesis. - Define hypothesis testing for a scenario in terms of $H_0$ and $H_1$. - Provided a scenario, identify the hypothesis test to use (t-test, paired t-test, chi-squared test, anova). - Understand what a p-value represents, how it is used in hypothesis testing, and how to interpret it. - Know the non-parametric analogs to the tests we covered in lecture. # Data Preprocessing [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L3/#/) - Define feature engineering in the context of machine learning applications. - Define and be able to identify data that has been scaled and the method used to scale it (min-max, standard). - Describe the curse of dimensionality and how it affects machine learning models. - Understand dimensionality reduction techniques (Feature Selection, Feature Sampling, or PCA). # Machine Learning I [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L3/#/) - Define the terms: training set, validation set, and test set and their primary uses. - Identify a scenario as a classification or regression problem. - Explain the KNN algorithm and how it works. - Explain where the normal equation for linear regression comes from. - Be able to identify L1 and L2 regularization and explain at a high level how they work. - Understand the intuition behind the cross-entropy loss function. # Machine Learning [[homework]](https://github.com/drc-cs/SUMMER25-CS326/tree/main/homeworks/H3) - Be able to look at code for logistic regression gradient descent and identify missing or incorrect components. - Provided with a **simple** numpy operation, identify the shape of the output. This may include an axis argument. [[🔗]](https://numpy.org/doc/stable/user/basics.broadcasting.html) # Machine Learning II [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L4/#/) - Explain ROC curves (axes) and what the AUC represents. - Explain the value of k-fold cross-validation. - Explain the value of a softmax function in the context of a multi-class classification problem. # Machine Learning III [[slides]](https://drc-cs.github.io/SUMMER25-CS326/lectures/L4/#/) - Explain the ID3 algorithm and how it works (understand entropy & information gain). - Be able to identify a decision tree model as overfitting or underfitting. - Differentiate between and be able to explain different ensemble modeling methods (bagging, boosting, stacking).
# Exam Part I
## Lockdown Browser You will be required to use the Lockdown Browser for the exam. [[Click Here to Download](https://download.respondus.com/lockdown/download7.php?id=171646780)] After installing the Lockdown Browser, you can access the exam by visiting the [canvas page](https://canvas.northwestern.edu/courses/233999/quizzes) for the course and clicking on the quiz tab.
# Docker Essentials
## Docker Essentials | Agenda
1. Why Containerization? 2. What is Docker and Why is it Useful? 3. Benefits of Using Docker 4. Docker Theory 5. Using Docker 6. Docker Demo
>
Vocabulary will be placed in boxes.
## Why Containerization? A **container** is a lightweight, portable, and efficient way to package applications and their dependencies. Containers isolate applications from the host system and other containers, making them easier to deploy and manage. | **Container** | **Virtual Machine** | |---------------|---------------------| | Lightweight | Heavyweight | | Faster startup | Slower startup | | Less resource usage | More resource usage | | Shared kernel | Separate kernel | >
The **kernel** is the core of an operating system that manages system resources.
>
**Virtual machines** are software emulations of physical computers that run an operating system and applications.
## What is Docker and Why is it Useful? **Docker** is an open-source platform designed to simplify the process of creating, deploying, and running applications. You can think of Docker as a self-contained package that includes everything an application needs to run: the code, runtime, system tools, libraries, and settings. This package is called a **image**. When you run an image, it creates a **container**, an isolated environment that runs the application. >
An **image** is a snapshot of an application and its dependencies.
>
A **container** is a running instance of an image.
## Benefits of Using Docker 1. Consistency Across Environments 2. Efficiency 3. Scalability 4. Isolation and Security ## Benefits of Using Docker | Consistency
Docker ensures that software behaves the same on every machine. Developers can be confident that applications that work on their computers will work in production. Docker containers are typically based on a Linux distribution, which provides a consistent environment for applications.
>
**Linux** distributions: Variants of the Linux operating system. Linux is by far the most popular OS in the world for web servers, cloud computing, and supercomputers.
## Benefits of Using Docker | Efficiency Containers share the host's operating system kernel, which makes them more lightweight and efficient than traditional virtual machines. This results in faster application delivery, reduced resource consumption, and lower overhead.
Docker 2022
## Benefits of Using Docker | Scalability Docker makes it easy to scale applications horizontally by adding more containers. This supports modern cloud-native development practices. Containers can be easily replicated and distributed across multiple hosts, providing flexibility and scalability.
Thakur 2024
## Benefits of Using Docker | Isolation and Security
Containers encapsulate applications and their dependencies completely, providing isolation that improves security.
Miller 2023
## Docker Architecture
NordicAPIs
## Docker Architecture | Key Docker Components
- Docker Daemon - Docker Client - Docker Images - Docker Containers - Docker Registries - Namespaces and Control Groups
NordicAPIs
## Docker Architecture | Daemon The **Docker Daemon** (`dockerd`) is the heart of Docker, responsible for running containers on a host. It listens for API requests and manages Docker objects (images, containers, networks, etc.).
NordicAPIs
>
A **daemon** is a background process that runs continuously, waiting for requests to process.
## Docker Architecture | Client The **Docker Client** is a command-line tool (CLI) used by the user to interact with the Docker daemon.
Common CLI commands include
docker pull
,
docker build
, and
docker run
. The client sends commands to the daemon, which executes them on the host.
NordicAPIs
## Docker Architecture | Images **Docker Images** are immutable, read-only templates used to create containers.
An image might include an OS, application code, and dependencies required to run an application. Images are built from a series of layers. Each layer represents a modification to the previous layer, allowing for efficient storage and distribution of images.
NordicAPIs
## Docker Architecture | Containers **Docker Containers** are running instances of Docker images.
They can be started, stopped, moved, or deleted using Docker commands. Containers are isolated from each other and the host system, but they share the host OS's kernel. This makes them lightweight and efficient.
NordicAPIs
## Docker Architecture | Registries **Docker Registries** store Docker images. A popular public registry is Docker Hub, but private registries can also be used. Docker images can be pushed to and pulled from registries, allowing for easy distribution and sharing of images.
Other examples of registries include: - **Amazon Elastic Container Registry (ECR)** - **Google Container Registry (GCR)** - **Azure Container Registry (ACR)**
NordicAPIs
## Docker Architecture | Core Concepts ### Namespaces and Control Groups Docker uses Linux namespaces to provide isolation for containers and control groups (cgroups) to limit resource usage. ### Union File System Layers are used to create Docker images. Each layer is a modification over the previous one, which allows efficient storage and reduced bandwidth usage when distributing an image. A Union File System (UFS) combines these layers into a single view (union) of the file system.
>
**Namespaces**: Isolate containers from each other and the host system.
>
**Control Groups (cgroups)**: Limit resource usage for containers.
>
**Union File System**: Efficiently store and distribute Docker images.
## Using Docker 1. Installing Docker 2. Building from a Dockerfile 3. Running Containers ## Using Docker | Installing Docker To start using Docker, you need to install the Docker Engine on your machine. It can be downloaded from the Docker website and is available for various operating systems, including Windows, MacOS, and Linux. Download Docker Desktop: [Docker Desktop](https://www.docker.com/products/docker-desktop) ## Using Docker | Building a Dockerfile
A **Dockerfile** is a text document that contains all the commands needed to assemble a Docker **image**. It starts with a
FROM
instruction that specifies the base image. Usually, it also includes commands like
WORKDIR
,
COPY
,
RUN
, and
CMD
to set up the environment and run the application.
```dockerfile # Use an official Python runtime as a parent image. FROM python:3.8-slim # Set the working directory. WORKDIR /app # Copy the current directory contents into the container at /app. COPY . /app # Install any needed packages specified in requirements.txt. # RUN is used to execute commands during the build process. RUN pip install --no-cache-dir -r requirements.txt # Start the application. CMD specifies the command to run when the container starts. CMD ["python", "app.py"] ```
## Dockerfile | Cheatsheet Here are some common Dockerfile commands. | Command | Description | |---------|-------------| |
FROM
| Specifies the base image to use. | |
WORKDIR
| Sets the working directory for subsequent commands. | |
COPY
| Copies files from the host to the container. | |
RUN
| Executes commands during the build process. | |
CMD
| Specifies the command to run when the container starts. | |
EXPOSE
| Exposes a port to the host machine. | |
ENV
| Sets environment variables. | |
ENTRYPOINT
| Configures the container to run as an executable. | ## Docker | CLI Cheatsheet Here are some common Docker CLI commands. | Command | Description | |---------|-------------| |
docker --version
| Checks the installed version of Docker. | |
docker pull [image_name]
| Pulls an image from a registry. | |
docker build -t [image_name] .
| Builds an image from a Dockerfile. | |
docker run [image_name]
| Runs a container from an image, common flags include
-d
for detached mode,
-p
for port mapping, and
-v
for volume mounting. | |
docker ps
| Lists running containers. | |
docker images
| Lists images on the host. | ## Docker Demo Let's see Docker in action. Many of you working on Windows had difficulty installing the
cs326
requirements for local development. Imagine you could run the course env locally with minimal setup... Let's see how Docker can help with that. ## Docker Demo | Course Environment
To the right is a dockerfile that will install the course environment and run
code-server
in a container. This will allow you to run the course environment locally without installing any dependencies to your operating system (Windows, MacOS, Linux).
```dockerfile # Use the official Miniconda3 image as a parent image. FROM continuumio/miniconda3 # Clone the class repository. RUN apt-get update && apt-get install -y git curl RUN git clone https://github.com/drc-cs/SUMMER25-CS326.git # Set the working directory. WORKDIR /SUMMER25-CS326 # Create a new Conda environment from the environment.yml file. RUN conda env create -f environment.yml # Install vscode server. RUN curl -fsSL https://code-server.dev/install.sh | bash # Add code-server to PATH ENV PATH="/root/.local/bin:${PATH}" # Expose the port that the server is running on. EXPOSE 8080 # Run the code-server command when the container starts. CMD ["code-server", "--auth", "none", "--bind-addr", "0.0.0.0:8080", "."] ```
## Docker Demo | Running the Course Environment Since we run the
code-server
on port 8080, we need to map the container's port to the host machine. We can do this using the
-p
flag. Here we mapped the container's port 8080 to the host machine's port 8080. Since code-server is already running within the docker container, we can access it by visiting
localhost:8080
in your browser. As long as the container is running, you can access the course environment by visiting
localhost:8080
in your browser at any time. It will even work offline!
```bash # Build the Docker image. docker build -t cs326-env . # Run the Docker container on port 8080 (local). docker run -p 8080:8080 cs326-env ```
## Conclusion Docker is a powerful tool that simplifies the process of creating, deploying, and running applications. It provides consistency, efficiency, scalability, isolation, and portability for modern software development. Understanding Docker's core concepts and best practices can help you leverage its benefits in your projects.
NordicAPIs