## H1 | Installing Conda
We will be using Python for this course. Conda is a package manager that will help us install Python and other packages.
Don't have conda installed? [[click-here]](https://docs.conda.io/en/latest/miniconda.html)
## H1 | Cloning Repo & Installing Environment
We will be using a public GitHub repository for this course. Enter the following commands in your terminal to clone the repository and install the class environment.
```bash
git clone https://github.com/drc-cs/SUMMER25-CS326.git
cd SUMMER25-CS326
```
We will be using a conda environment (cs326) for this course.
```bash
conda env create -f environment.yml
conda activate cs326
```
Don't have git installed? [[click-here]](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
## H1 | VSCode
All demonstrations will be in VSCode, which is a popular IDE. You're welcome to use any IDE you want, but I will be best equipped to help you if you use VSCode.
Don't have visual studio code installed? [[click-here]](https://code.visualstudio.com/)
After installing VSCode, you will need to install the Python extension:
1. Open VSCode
2. Click on the "Extensions" icon in the "Activity Bar" on the side of the window
3. Search for "Python" and click "Install"
Add the code command to your PATH so you can open VSCode from the terminal.
1. Open the "Command Palette" (Ctrl+Shift+P on PC or Cmd+Shift+P on Mac)
2. Search for "Shell Command: Install 'code' command in PATH" and click it
## H1 | Opening Repo in VSCode
Restart your terminal and open the cloned repository in VSCode using the following command:
```bash
code SUMMER25-CS326
```
You should see the following folders:
- README.md: Contains the course syllabus.
- lectures/: Contains the lecture slides in markdown format.
- homeworks/: Contains the homework assignments.
## H1 | Pulling
Before you start working on any homework, make sure you have the latest version of the repository.
The following command (when used inside the class folder) will pull the latest version of the repository and give you access to the most up-to-date homework:
```bash
git pull
```
If you have any issues with using this git-based system, please reach out.
## H1 | Opening Homework
Open the homeworks/ folder in VSCode. You should see a folder called H1/. Open the folder and you will see three files:
- hello_world.py: This file contains placeholders for the methods you will write.
- hello_world.ipynb: This is a Jupyter notebook that provides a useful narrative for the homework and methods found in the hello_world.py file.
- hello_world_test.py: This is the file that will be used to test your code. Future homeworks will not include this file, and this is for demonstration purposes only.
We'll do the first homework together.
## Homework Demonstration
## H1 | Submitting Homework
You will submit your homework using the provided submission script.
But first, you need a username (your **northwestern** email e.g. JaneDoe2024@u.northwestern.edu) and password!
```bash
python account.py --create-account
```
Once you have a username and password, you can submit your completed homework. You should receive your score or feedback within a few seconds, but this may take longer as the homeworks get more involved.
```bash
python submit.py --homework H1/hello_world.py --username your_username --password your_password
```
You can save your username and password as environment variables so you don't have to enter them every time (or expose them in your notebooks)!
```bash
export AG_USERNAME="your_username"
export AG_PASSWORD="your_password"
```
## H1 | Homework Grading
The highest score will be recorded, so long as it is submitted before the deadline! You have 2 attempts for every homework.
Late homeworks will be penalized 10% per day.
## H1 | Expected Learnings / TLDR;
With this hello_world assignment, we worked on the following:
1. Environment installation practice (conda)
2. Exposure to git management (GitHub)
3. Local development IDE practice (vscode)
4. Familiarity with unit testing (pytest)
These tools are all critical for any industry position and are often expected for entry level positions. Please continue to familiarize yourself with them over the course of the quarter.
# Data Sources
## Data Sources
Data is at the heart of all data science! It's the raw material that we use to build models, make predictions, and draw conclusions. Data can be gathered from a variety of sources, and it comes in many different forms.
Common Data Sources:
- Bulk Downloads
- APIs
- Scraping, Web Crawling
- BYOD (Bring Your Own Data)
Forbes 2022
## Data Sources | Bulk Downloads
Large datasets that are available for download from the internet.
| Source | Examples | Advantages | Disadvantages |
| --- | --- | --- | --- |
| Government | data.gov | often free, very large datasets | Non-specific |
| Academic | UCI Machine Learning Repository | usually free, large datasets | Non-specific |
| Industry | Kaggle, HuggingFace | sometimes free, large datasets | Non-specific, sometimes expensive |
```python
# Demonstration of pulling from the UCI machine learning repository.
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, header=None)
```
```python
# Demonstration of pulling from the HuggingFace datasets library.
from datasets import load_dataset
dataset = load_dataset('imdb')
```
## Data Sources | APIs
Data APIs are often readily available for sign up (usually for a fee).
While offering a lot of data, APIs can be restrictive in terms of the data they provide (and the rate at which you can pull it!).
API's often have better structure (usually backed by a db), and they often have the additional benefit of streaming or real-time data that you can poll for updates.
```python
import requests
API_KEY = 'your_api_key'
url = f'https://api.yourapi.com/data?api_key={API_KEY}'
r = requests.get(url)
data = r.json()
```
## Data Sources | Scraping, Web Crawling
Web crawling is a free way to collect data from the internet. But be **cautious**. Many websites have terms of service that prohibit scraping, and you can easily overstep those and find yourself in legal trouble.
Examples of packages built for webscraping in Python include beautifulsoup and scrapy.
```python
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('p')
```
## Data Sources | BYOD (Bring Your Own Data)
Data that you collect yourself. In this case, you determine the format.
Examples for collecting survey data include Qualtrics and SurveyMonkey.
```python
import pandas as pd
df = pd.read_csv('survey_data.csv')
```
## Data Sources | BYOD (Bring Your Own Data)
Collecting data yourself is common in academia and industry. But you need to be careful.
**Policies Exist**
- *GDPR*: Controllers of personal data must put in place appropriate technical and organizational measures to implement the data protection principles.
- *HIPAA*: Covered entities must have in place appropriate administrative, technical, and physical safeguards to protect the privacy of protected health information.
**Bias *Always* Exists** Bias is everywhere. It's in the data you collect, the data you don't collect, and the data you use to train your models. In almost all cases, we perform a *sampling*. It's your job to ensure it is a reasonable sample.
**Ethical Concerns**
Don't collect data that you don't need. Definitely don't collect data that you don't have permission to collect.
## L1 | Q2
You're building a model to predict the price of a house based on its location, size, and number of bedrooms. Which of the following data sources would be a great first place to look?
A. Bulk Downloads
B. APIs
C. Scraping, web crawling
D. BYOD (Bring Your Own Data)
## L1 | Q3
You just built an amazing stock market forecasting model. Congrats! Now, you want to test it on real-time data. Which of the following data sources would be a great first place to look?
A. Bulk Downloads
B. APIs
C. Scraping, web crawling
D. BYOD (Bring Your Own Data)
# Data Structures
## Data Structures
| Type | Example | Advantages | Disadvantages |
| --- | --- | --- | --- |
| Structured | Relational Database | easy to query, fast | not flexible, hard to update schema |
| Semi-Structured | XML, CSV, JSON | moderate flexibility, easy to add more data | slow, harder to query |
| Unstructured | Plain Text, Images, Audio, Video | very flexible, easy to add more data | slow, hardest to query |
## Data Structures | Structured Example (Relational Database)
| Type | Example | Advantages | Disadvantages |
| --- | --- | --- | --- |
| Structured | Relational Database | easy to query, fast | not flexible, hard to update schema |
| Semi-Structured | XML, CSV, JSON | moderate flexibility, easy to add more data | slow, harder to query |
| Unstructured | Plain Text, Images, Audio, Video | very flexible, easy to add more data | slow, hardest to query |
```sql
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
name VARCHAR(100),
hire_date DATE
);
```
## Data Structures | Semi-Structured Example (JSON)
| Type | Example | Advantages | Disadvantages |
| --- | --- | --- | --- |
| Structured | Relational Database | easy to query, fast | not flexible, hard to update schema |
| Semi-Structured | XML, CSV, JSON | moderate flexibility, easy to add more data | slow, harder to query |
| Unstructured | Plain Text, Images, Audio, Video | very flexible, easy to add more data | slow, hardest to query |
```json
employeeData = {
"employee_id": 1234567,
"name": "Jeff Fox",
"hire_date": "1/1/2013"
};
```
## Data Structures | Unstructured Example (Plain Text)
| Type | Example | Advantages | Disadvantages |
| --- | --- | --- | --- |
| Structured | Relational Database | easy to query, fast | not flexible, hard to update schema |
| Semi-Structured | XML, CSV, JSON | moderate flexibility, easy to add more data | slow, harder to query |
| Unstructured | Plain Text, Images, Audio, Video | very flexible, easy to add more data | slow, hardest to query |
```text
Dear Sir or Madam,
My name is Jeff Fox (employee id 1234567). I'm excited to start on 1/1/2013.
Sincerely,
Jeff Fox
```
# Relational Databases
## Relational Databases
Ramos 2022
## Relational Databases
Ramos 2022
## Relational Databases
Relational databases are a type of database management system (DBMS) that store and manage data in a structured format using tables. Each table, or relation, consists of rows and columns, where rows represent individual records and columns represent the attributes of the data. Widely used systems include MySQL and PostgreSQL.
### Key Vocabulary:
- **Tables:** Organized into rows and columns, with each row being a unique data entry and each column representing a data attribute.
- **Relationships:** Tables are connected through keys, with primary keys uniquely identifying each record and foreign keys linking related records across tables.
- **SQL (Structured Query Language):** The standard language used to query and manipulate data within a relational database.
## SQL Query Cheat Sheet (Part 1)
### `CREATE TABLE`
```sql
/* Create a table called table_name with column1, column2, and column3. */
CREATE TABLE table_name (
column1 INT PRIMARY KEY, /* Primary key is a unique identifier for each row. */
column2 VARCHAR(100), /* VARCHAR is a variable-length string up to 100 characters. */
column3 DATE /* DATE is a date type. */
);
```
### `INSERT INTO`
```sql
/* Insert values into column1, column2, and column3 in table_name. */
INSERT INTO table_name (column1, column2, column3) VALUES (value1, value2, value3);
```
### `UPDATE`
```sql
/* Update column1 in table_name to 'value' where column2 is equal to 'value'. */
UPDATE table_name SET column1 = 'value' WHERE column2 = 'value';
```
### `DELETE`
```sql
/* Delete from table_name where column1 is equal to 'value'. */
DELETE FROM table_name WHERE column1 = 'value';
```
## SQL Query Cheat Sheet (Part 2)
### `SELECT`
```sql
/* Select column1 and column2 from table_name.*/
SELECT column1, column2 FROM table_name;
```
### `WHERE`
```sql
/* Select column1 and column2 from table_name where column1 is equal to 'value' and column2 is equal to 'value'. */
SELECT column1, column2 FROM table_name WHERE column1 = 'value' AND column2 = 'value';
```
### `ORDER BY`
```sql
/* Select column1 and column2 from table_name and order by column1 in descending order. */
SELECT column1, column2 FROM table_name ORDER BY column1 DESC;
```
### `LIMIT`
```sql
/* Select column1 and column2 from table_name and limit the results to 10. */
SELECT column1, column2 FROM table_name LIMIT 10;
```
## SQL Query Cheat Sheet (Part 3)
### `JOIN`
```sql
/* Select column1 and column2 from table1 and table2 where column1 is equal to column2. */
SELECT column1, column2 FROM table1 JOIN table2 ON table1.column1 = table2.column2;
```
### `GROUP BY`
```sql
/* Select column1 and column2 from table_name and group by column1. */
SELECT column1, column2 FROM table_name GROUP BY column1;
```
### `COUNT`
```sql
/* Select the count of column1 from table_name. */
SELECT COUNT(column1) FROM table_name;
/* Group by column2 and select the count of column1 from table_name. */
SELECT column2, COUNT(column1) FROM table_name GROUP BY column2;
```
### `SUM`
```sql
/* Select the sum of column1 from table_name. */
SELECT SUM(column1) FROM table_name;
/* Group by column2 and select the sum of column1 from table_name. */
SELECT column2, SUM(column1) FROM table_name GROUP BY column2;
```
### SQL and Pandas (🔥)
```python
import pandas as pd
import sqlite3
# Create a connection to a SQLite database.
conn = sqlite3.connect('example.db')
# Load a DataFrame into the database.
df = pd.DataFrame({'column1': [1, 2, 3], 'column2': ['a', 'b', 'c']})
df.to_sql('table_name', conn, if_exists='replace')
# Query the database.
query = 'SELECT * FROM table_name'
df = pd.read_sql(query, conn)
```
## L1 | Q4
Write a SQL query that selects the name and year columns from a movies table where the year is greater than 2000.
## Practice Your SQL!
The majority of data science interviews will have a SQL component. It's a good idea to practice your SQL skills. Here are a few resources to get you started:
- ### [SQLZoo](https://sqlzoo.net/)
- ### [W3Schools](https://www.w3schools.com/sql/)
- ### [LeetCode](https://leetcode.com/problemset/database/)
- ### [HackerRank](https://www.hackerrank.com/domains/sql)
- ### [SQL Practice](https://www.sql-practice.com/)
# Data Cleaning
## Data Cleaning
Data is often dirty! Don't ever give your machine learning or statistical model dirty data.
Remember the age-old adage:
> Garbage in, garbage out.
Data cleaning is the process of converting source data into target data without errors, duplicates, or inconsistencies. You will often need to structure data in a way that is useful for your analysis, so learning some basic data manipulation is **essential**.
## Data Cleaning | Common Data Issues
1. Incompatible data
2. Missing values
3. Extreme Outliers
## Data Cleaning | Handling Incompatible Data
| Data Issue | Description | Example | Solution |
| --- | --- | --- | --- |
| Unit Conversions | Numerical data conversions can be tricky. | 1 mile != 1.6 km | Measure in a common unit, or convert with caution. |
| Precision Representations | Data can be represented differently in different programs. | 64-bit float to 16-bit integer | Use the precision necessary and hold consistent. |
| Character Representations | Data is in different character encodings. | ASCII, UTF-8, ... | Create using the same encoding, or convert with caution. |
| Text Unification | Data is in different formats. | D'Arcy; Darcy; DArcy; D Arcy; D&-#-3-9-;Arcy | Use a common format, or convert with caution. RegEx will be your best friend.|
| Time / Date Unification | Data is in different formats. | 10/11/2019 vs 11/10/2019 | Use standard libraries & UTC. A personal favorite is seconds since epoch. |
## Data Cleaning | Handling Missing Values
Data is often missing from datasets. It's important to identify why it's missing. Once you have established that it is **missing at random**, you can proceed with **substitution**.
When missing data, we have a few options at our disposal:
1. Drop the entire row
2. Drop the entire column
3. Substitute with a reasonable value
## Data Cleaning | Handling Missing Values with Substitution
| Method | Description | When to Use |
| --- | --- | --- |
| Forward / backward fill | Fill missing value using the last / next valid value. | Time Series |
| Imputation by interpolation | Use interpolation to estimate missing values. | Time Series |
| Mean value imputation | Fill missing value with mean from column. | Random missing values |
| Conditional mean imputation | Estimate mean from other variables in the dataset. | Random missing values |
| Random imputation | Sample random values from a column. | Random missing values |
| KNN imputation | Use K-nearest neighbors to fill missing values. | Random missing values |
| Multiple Imputation | Uses many regression models and other variables to fill missing values. | Random missing values |
```python
import pandas as pd
# Load data.
df = pd.read_csv('data.csv')
# Forward fill.
df_ffill = df.fillna(method='ffill')
# Backward fill.
df_bfill = df.fillna(method='bfill')
# Mean value imputation.
df_mean = df.fillna(df.mean())
# Random value imputation.
df_random = df.fillna(df.sample())
# Imputation by interpolation.
df_interpolate = df.interpolate()
```
## L1 | Q5
You have streaming data that is occasionally dropping values. Which of the following methods would be appropriate to fill missing values when signal fails to update?
*Please note, in this scenario, you can't use the future to predict the past.*
A. Forward fill
B. Imputation by interpolation
C. Mean value imputation
D. Backward fill
## Data Cleaning | Handling Outliers
Outliers are extreme values that deviate from other observations on data. They may indicate a variability in a measurement, experimental errors, or a novelty.
Outliers can be detected using:
- **Boxplots**: A visual representation of the five-number summary.
- **Scatterplots**: A visual representation of the relationship between two variables.
- **z-scores**: A measure of how many standard deviations a data point is from the mean.
- **IQR**: A measure of statistical dispersion, being equal to the difference between the upper and lower quartiles.
Handling outliers should be done on a case-by-case basis. Don't throw away data unless you have a very compelling (and **documented**) reason to do so!
## Data Cleaning | Keep in Mind
- Cleaning your data is an iterative process.
- **Hot tip**: Focus on making your data preprocessing *fast*. You will be doing it a lot, and you'll want to be able to iterate quickly. Look into libraries like dask, pyspark, and ray for large datasets.
- Data cleaning is often planned with visualization.
- Always look at the data. Always. We'll go over plotting approaches soon.
- Data cleaning can fix modeling problems.
- Your first assumption should always be "something is wrong with the data", not "I should try another model".
- Data cleaning is not a one-size-fits-all process, and often requires domain expertise.
# EDA
## Exploratory Data Analysis (EDA) | Introduction
**Exploratory Data Analysis** (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
This should be done every single time you are exposed to a new dataset. **ALWAYS** look at the data (x3).
EDA will help you to identify early issues or patterns in the data, and will guide you in the next steps of your analysis. It is an absolutely **critical**, but often overlooked step.
## Exploratory Data Analysis (EDA) | Methodology
We can break down EDA into two main topics:
- **Descriptive EDA**: Summarizing the main characteristics of the data.
- **Graphical EDA**: Visualizing the data to understand its structure and patterns.
# Descriptive EDA
## Descriptive EDA | Describe $x$
## Descriptive EDA | Describe $x$
## Descriptive EDA | Examples
- **Central tendency**
- Mean, Median, Mode
- **Spread**
- Range, Variance, interquartile range (IQR)
- **Skewness**
- A measure of the asymmetry of the distribution. Typically close to 0 for a normal distribution.
- **Kurtosis**
- A measure of the "tailedness" of the distribution. Typically close to 3 for a normal distribution.
- **Modality**
- The number of peaks in the distribution.
## Central Tendency
- **Mean**: The average of the data.
- $ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $
- np.mean(data)
- **Median**: The middle value of the data, when sorted.
- [1, 2, **4**, 5, 6]
- np.median(data)
- **Mode**: The most frequent value in the data.
```python
from scipy.stats import mode
data = np.random.normal(0, 1, 1000)
mode(data)
```
## Spread
- **Range**: The difference between the maximum and minimum values in the data.
- np.max(data) - np.min(data)
- **Variance**: The average of the squared differences from the mean.
- $ \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 $
- np.var(data)
- **Standard Deviation**: The square root of the variance.
- `$ \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} $`
- np.std(data)
- **Interquartile Range (IQR)**: The difference between the 75th and 25th percentiles.
- np.percentile(data, 75) - np.percentile(data, 25)
## Skewness
A measure of the lack of "symmetry" in the data.
**Positive skew (> 0)**: the right tail is longer; the mass of the distribution is concentrated on the left of the figure.
**Negative skew (< 0)**: the left tail is longer; the mass of the distribution is concentrated on the right of the figure.
```python
import numpy as np
from scipy.stats import skew
data = np.random.normal(0, 1, 1000)
print(skew(data))
```
## Skewness | Plot
## L1 | Q6
Which of the following would correctly calculate the median of the following list?
```python
data = [1, 2, 4, 3, 5]
```
A.
```python
median = sorted(data)[len(data) // 2]
```
B.
```python
median = sorted(data)[len(data) // 2 - 1]
```
C.
```python
median = sorted(data)[len(data) // 2 + 1]
```
## L1 | Q7
Is this distribution positively or negatively skewed?
A. Positively skewed
B. Negatively skewed
C. No skew
## Kurtosis
A measure of the "tailedness" of the distribution.
- **Leptokurtic (> 3)**: the tails are fatter than the normal distribution.
- **Mesokurtic (3)**: the tails are the same as the normal distribution.
- **Platykurtic (< 3)**: the tails are thinner than the normal distribution.
```python
import numpy as np
from scipy.stats import kurtosis
data = np.random.normal(0, 1, 1000)
print(kurtosis(data))
```
## Kurtosis | Plot
## Modality
The number of peaks in the distribution.
- **Unimodal**: One peak.
- **Bimodal**: Two peaks.
- **Multimodal**: More than two peaks.
Trees In Space 2016
## Normal Distribution | Definition
A normal distribution is a continuous probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. Mathematically, it is defined as:
Where:
- $ \mu $ is the mean.
- $ \sigma^2 $ is the variance.
## Normal Distribution | Properties
The normal distribution has several key properties:
- **Symmetry**: The distribution is symmetric about the mean.
- **Unimodality**: The distribution has a single peak.
- **68-95-99.7 Rule**:
- 68% of the data falls within 1 standard deviation of the mean.
- 95% within 2 standard deviations.
- 99.7% within 3 standard deviations.
## Normal Distribution | Properties
How can we tell if our data is normally distributed?
- **Skewness**: close to 0.
- **Kurtosis**: close to 3.
- **QQ Plot**: A plot of the quantiles of the data against the quantiles of the normal distribution.
- **Shapiro-Wilk Test**: A statistical test to determine if the data is normally distributed.
Khandelwal 2023
## Descriptive EDA | Not Covered
There are many ways to summarize data, and we have only covered a few of them. Here are some common methods that we did not cover today:
- **Covariance**: A measure of the relationship between two variables.
- **Correlation**: A normalized measure of the relationship between two variables.
- **Outliers**: Data points that are significantly different from the rest of the data.
- **Missing Values**: Data points that are missing from the dataset.
- **Percentiles**: The value below which a given percentage of the data falls.
- **Frequency**: The number of times a value occurs in the data.
# Graphical EDA
## Graphical EDA | Data Types
There are three primary types of data -- nominal, ordinal, and numerical.
| Data Type | Definition | Example |
| --- | --- | --- |
| Nominal | Categorical data without an inherent order | ["red", "green", "orange"] |
| Ordinal | Categorical data with an inherent order | ["small", "medium", "large"]
["1", "2", "3"] |
| Numerical | Continuous or discrete numerical data | [3.1, 2.1, 2.4] |
## Graphical EDA | Choosing a Visualization
The type of visualization you choose will depend on:
- **Data type**: nominal, ordinal, numerical.
- **Dimensionality**: 1D, 2D, 3D+.
- **Story**: The story you want to tell with the data.
Whatever type of plot you choose, make sure your visualization is information dense **and** easy to interpret. It should always be clear what the plot is trying to convey.
## Graphical EDA | A Note on Tools
Matplotlib and Plotly are the most popular libraries for data visualization in Python.
| Library | Pros | Cons |
| --- | --- | --- |
| Matplotlib | Excellent for static publication-quality plots, very fast render, old and well supported. | Steeper learning curve, many ways to do the same thing, no interactivity, OOTB color schemes. |
| Plotly | Excellent for interactive plots, easy to use, easy tooling for animations, built-in support for dashboarding and publishing online. | Not as good for static plots, less fine-grained control, high density renders can be non-trivial. |
## Graphical EDA | Basic Visualization Types
### 1D Data
- Bar chart
- Pie chart
- Histogram
- Boxplot
- Violin plot
- Line plot
### 2D Data
- Scatter plot
- Heatmap
- Bubble plot
- Line plot
- Boxplot
- Violin plot
### 3D+ Data
- 3D scatter plot
- Bubble plot
- Color scatter plot
- Scatter plot matrix
## 1D Data | Histograms
When you have numerical data, histograms are a great way to visualize the distribution of the data. If there is a clear distribution, it's often useful to fit a probability density function (PDF).
```python
import numpy as np
from plotly import express as px
data = np.random.normal(0, 1, 100)
fig = px.histogram(data, nbins = 50)
fig.show()
```
## 1D Data | Boxplots
Boxplots are a great way to visualize the distribution of the data, and to identify outliers.
```python
import numpy as np
from plotly import express as px
data = np.random.normal(0, 1, 100)
fig = px.box(data)
fig.show()
```
## 1D Data | Violin Plots
Violin plots are similar to box plots, but they also show the probability density of the data at different values.
```python
import numpy as np
from plotly import express as px
data = np.random.normal(0, 1, 100)
fig = px.violin(data)
fig.show()
```
## 1D Data | Bar Charts
Bar charts are a great way to visualize the distribution of categorical data.
```python
import numpy as np
from plotly import express as px
data = np.random.choice(["A", "B", "C"], 1000)
fig = px.histogram(data)
fig.show()
```
## 1D Data | Pie Charts
Pie charts are another way to visualize the distribution of categorical data.
```python
import numpy as np
from plotly import express as px
data = np.random.choice(["A", "B", "C"], 100)
fig = px.pie(data)
fig.show()
```
## 2D Data | Scatter Plots
Scatter plots can help visualize the relationship between two numerical variables.
```python
import numpy as np
from plotly import express as px
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
fig = px.scatter(x = x, y = y)
fig.show()
```
## 2D Data | Heatmaps (2D Histogram)
Heatmaps help to visualize the density of data in 2D.
```python
import numpy as np
from plotly import express as px
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
fig = px.density_heatmap(x = x, y = y)
fig.show()
```
## 3D+ Data | Bubble Plots
Bubble plots are a great way to visualize the relationship between three numerical variables and a categorical variable.
```python
import numpy as np
from plotly import express as px
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
z = np.random.normal(0, 1, 100)
c = np.random.choice(["A", "B", "C"], 100)
fig = px.scatter(x = x, y = y, size = z, color = c)
fig.show()
```
## 3D+ Data | Scatter Plots
Instead of using the size of the markers (as in the bubble plot), you can use another axis to represent a third numerical variable. And, you still have the option to color by a categorical variable.
```python
import numpy as np
from plotly import express as px
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
z = np.random.normal(0, 1, 100)
c = np.random.choice(["A", "B", "C"], 100)
fig = px.scatter_3d(x = x, y = y, z = z, color = c)
fig.show()
```
## Graphical EDA | Advanced Visualization Types
- ML Results
- Residual Plots
- Regression in 3D
- Decision Boundary
- Parallel Coordinates
- Maps / Chloropleth
### Residual Plots
Residual plots are a great way to visualize the residuals of a model. They can help you identify patterns in the residuals, which can help you identify issues with your model.
### Regression in 3D
Regression in 3D is a great way to visualize the relationship between three numerical variables.
### Decision Boundary
Decision boundaries are a great way to visualize the decision-making process of a classification model.
### Parallel Coordinates
Parallel coordinates are a great way to visualize the relationship between multiple numerical variables, often used to represent hyperparameter tuning results.
### Maps / Chloropleth
Chloropleths are a great way to visualize data that is geographically distributed. They can help you understand the spatial distribution of the data, and can help you identify patterns in the data.
## EDA TLDR;
- **Descriptive EDA**: Summarizing the main characteristics of the data.
- **Graphical EDA**: Visualizing the data to understand its structure and patterns.