Reference Datasets for the Workshop

This page lists ten classic, freely available datasets that can be loaded directly from the internet — no local files needed. Each entry includes the original academic source, a code snippet to load the data, a quick exploration command, and a basic visualisation.

Note

The last three datasets (MNIST, Fashion-MNIST, CIFAR-10) are large image collections and may take several minutes to download. They are included here for reference but are not recommended for use during the timed hands-on exercises. Start with Titanic, Iris, Palmer Penguins, or Gapminder for the exercises.

Quick reference

Dataset

Size

Good for

Titanic

891 rows

Bar charts, survival analysis, categorical comparisons

Iris

150 rows

Scatter plots, species comparison, classification basics

Palmer Penguins

344 rows

Scatter plots, multi-species comparison, missing values

Wine Quality

1 599 rows

Distribution plots, regression, continuous quality scores

Tips

244 rows

Scatter, correlation, categorical colour coding

Diamonds

53 940 rows

Large-scale exploration, overplotting, sampling strategies

Gapminder

Multi-year country data

Time series, bubble charts, storytelling with data

MNIST

70 000 images (28×28 px)

Image classification, neural network basics

Fashion-MNIST

70 000 images (28×28 px)

Image classification, drop-in MNIST replacement

CIFAR-10

60 000 images (32×32 px, colour)

Colour image classification


Titanic

Passenger survival data from the 1912 disaster of the RMS Titanic. Features include passenger class (Pclass), sex, age, fare paid, cabin, port of embarkation, and whether the passenger survived. The most widely used introductory dataset for classification and exploratory data analysis.

Source: Various versions in circulation; this one comes from the pandas documentation. Underlying records from Encyclopedia Titanica.

Load:

import pandas as pd
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url, index_col='Name')
titanic.head()

Explore:

titanic.describe()

Visualise:

import altair as alt
alt.Chart(titanic.reset_index()).mark_bar().encode(
    x=alt.X('Age:Q', bin=True),
    y='count()',
    color='Survived:N'
).properties(title='Survival by Age')

Iris

Measurements of 150 iris flowers across three species (Iris setosa, I. versicolor, I. virginica). Four numeric features: sepal length, sepal width, petal length, petal width (all in cm). The most famous classification benchmark dataset in machine learning, introduced by statistician R. A. Fisher in 1936.

Source: Fisher, R. A. (1936). “The use of multiple measurements in taxonomic problems.” Annals of Eugenics, 7(2), 179–188. DOI: 10.1111/j.1469-1809.1936.tb02137.x. Data originally collected by Edgar Anderson (1935).

Load:

import pandas as pd
from sklearn.datasets import load_iris

iris_raw = load_iris()
iris = pd.DataFrame(iris_raw.data, columns=iris_raw.feature_names)
iris['species'] = [iris_raw.target_names[t] for t in iris_raw.target]
iris.head()

Explore:

iris.describe()

Visualise:

import altair as alt
alt.Chart(iris).mark_circle(size=60).encode(
    x='petal length (cm):Q',
    y='petal width (cm):Q',
    color='species:N',
    tooltip=iris.columns.tolist()
).properties(title='Iris: Petal Dimensions by Species')

Palmer Penguins

Measurements of 344 penguins from three species (Adélie, Chinstrap, Gentoo) on three islands in the Palmer Archipelago, Antarctica. Features include bill length, bill depth, flipper length, body mass, island, and sex. Recommended as a modern, more intuitive replacement for the Iris dataset — and it has a more interesting story behind it.

Source: Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package. DOI: 10.5281/zenodo.3960218. Original measurements: Gorman, K. B., Williams, T. D., & Fraser, W. R. (2014). PLOS ONE, 9(3), e90081. DOI: 10.1371/journal.pone.0090081.

Load:

import pandas as pd
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
penguins = pd.read_csv(url)
penguins.head()

Explore:

penguins.describe()

Visualise:

import altair as alt
alt.Chart(penguins).mark_circle(size=60).encode(
    x='bill_length_mm:Q',
    y='flipper_length_mm:Q',
    color='species:N',
    tooltip=['species', 'island', 'bill_length_mm', 'body_mass_g']
).properties(title='Penguins: Bill Length vs Flipper Length')

Wine Quality

Chemical analysis of 1 599 red Portuguese Vinho Verde wines, each rated for quality on a scale from 0 to 10 by wine experts. Features include fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free and total sulfur dioxide, density, pH, sulphates, and alcohol. Good for both regression (predicting the quality score) and classification (good vs. average vs. poor wine).

Source: Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). “Modeling wine preferences by data mining from physicochemical properties.” Decision Support Systems, 47(4), 547–553. Available at the UCI Machine Learning Repository.

Load:

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url, sep=';')
wine.head()

Explore:

wine.describe()

Visualise:

import altair as alt
alt.Chart(wine).mark_bar().encode(
    x='quality:O',
    y='count()',
).properties(title='Distribution of Wine Quality Ratings')

Tips

244 restaurant bills recorded by a waiter over several months. Features include total bill, tip amount, sex of the bill payer, smoker status, day of the week, meal time (Lunch/Dinner), and party size. A classic dataset for exploring correlations and building simple regression models.

Source: Bryant, P. G. & Smith, M. A. (1995). Practical Data Analysis: Case Studies in Business Statistics. Richard D. Irwin Publishing. Popularised in Python by the seaborn library.

Load:

import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

Explore:

tips.describe()

Visualise:

import altair as alt
alt.Chart(tips).mark_circle(size=60).encode(
    x='total_bill:Q',
    y='tip:Q',
    color='time:N',
    size='size:Q',
    tooltip=tips.columns.tolist()
).properties(title='Tip Amount vs Total Bill')

Diamonds

Prices and physical attributes of 53 940 diamonds. Features include carat, cut (Fair/Good/Very Good/Premium/Ideal), color (D–J), clarity (I1–IF), depth percentage, table percentage, and xyz dimensions in mm. Large enough to illustrate overplotting and why you sometimes need to sample or aggregate data before visualising.

Source: Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer. Originally sourced from diamondse.info (2008). Bundled in the seaborn and ggplot2 libraries.

Overplotting with large datasets

With 53 940 rows, plotting every point creates a dense blob where nothing is visible. Use random sampling (diamonds.sample(2000)) or 2D binning (mark_rect() in Altair) when exploring this dataset.

Load:

import seaborn as sns
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Explore:

diamonds.describe()

Visualise (sampled to avoid overplotting):

import altair as alt
alt.Chart(diamonds.sample(2000, random_state=42)).mark_circle(
    size=20, opacity=0.5
).encode(
    x='carat:Q',
    y='price:Q',
    color='cut:N',
    tooltip=['carat', 'cut', 'color', 'clarity', 'price']
).properties(title='Diamond Price vs Carat (random sample of 2 000)')

Gapminder

Country-level life expectancy, GDP per capita, and population from 1952 to 2007 (every 5 years, 142 countries). Made famous worldwide by the late statistician and public health researcher Hans Rosling in his 2006 TED talk. The animated bubble chart Rosling used is one of the most influential data visualisations of the 21st century.

Source: Gapminder Foundation. Rosling, H. (2006). TED talk: “The best stats you’ve ever seen”. Data bundled in the vega_datasets Python package.

Load:

from vega_datasets import data
gapminder = data.gapminder()
gapminder.head()

Explore:

gapminder.describe()

Visualise (one year snapshot):

import altair as alt
alt.Chart(gapminder[gapminder.year == 2000]).mark_circle().encode(
    x=alt.X('fertility:Q', title='Fertility Rate'),
    y=alt.Y('life_expect:Q', title='Life Expectancy'),
    size=alt.Size('pop:Q', legend=None),
    color='cluster:N',
    tooltip=['country', 'life_expect', 'fertility', 'pop']
).properties(title='Life Expectancy vs Fertility Rate (Year 2000)')

MNIST — Handwritten Digits

Note

This dataset is approximately 55 MB. Downloading it for the first time may take a few minutes depending on your connection speed. Not recommended for the timed hands-on exercises.

70 000 grayscale images (28 × 28 pixels) of handwritten digits 0–9, split into 60 000 training and 10 000 test images. The “hello world” benchmark of image classification and neural networks. Every machine learning practitioner has seen this dataset.

Source: LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11), 2278–2324. Dataset: http://yann.lecun.com/exdb/mnist/.

Load:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=True, parser='auto')
print(f"Images: {mnist.data.shape}")
print(f"Labels: {sorted(mnist.target.unique())}")

Visualise a grid of sample digits:

import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 8, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
    ax.imshow(mnist.data.iloc[i].values.reshape(28, 28), cmap='gray')
    ax.set_title(mnist.target.iloc[i])
    ax.axis('off')
plt.suptitle('MNIST: Sample Handwritten Digits')
plt.tight_layout()
plt.show()

Fashion-MNIST

Note

This dataset is approximately 30 MB. Downloading it for the first time may take a minute or two. Not recommended for the timed hands-on exercises.

70 000 grayscale images (28 × 28 pixels) of clothing items across 10 categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot. Drop-in replacement for MNIST (same format, same size) but considered harder and more visually interesting. Created by Zalando Research as a more realistic benchmark.

Source: Xiao, H., Rasul, K., & Vollgraf, R. (2017). “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.” arXiv:1708.07747. Created by Zalando Research.

Load:

from sklearn.datasets import fetch_openml
fashion = fetch_openml('Fashion-MNIST', version=1, as_frame=True, parser='auto')
label_names = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
print(f"Images: {fashion.data.shape}")

Visualise a grid of sample items:

import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 8, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
    ax.imshow(fashion.data.iloc[i].values.reshape(28, 28), cmap='gray')
    ax.set_title(label_names[int(fashion.target.iloc[i])], fontsize=8)
    ax.axis('off')
plt.suptitle('Fashion-MNIST: Sample Items')
plt.tight_layout()
plt.show()

CIFAR-10 — Colour Images

Note

This dataset is approximately 150 MB. Downloading it for the first time will take several minutes. Not recommended for the timed hands-on exercises.

60 000 small colour images (32 × 32 pixels, RGB) across 10 real-world categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The natural next step from MNIST — images are in colour and represent real-world objects. A standard benchmark for convolutional neural networks.

Source: Krizhevsky, A. (2009). “Learning Multiple Layers of Features from Tiny Images.” Technical Report, University of Toronto. Dataset: https://www.cs.toronto.edu/~kriz/cifar.html.

Load:

from sklearn.datasets import fetch_openml
cifar = fetch_openml('CIFAR_10_small', version=1, as_frame=True, parser='auto')
label_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
print(f"Images: {cifar.data.shape}")

Visualise a grid of sample images:

import matplotlib.pyplot as plt
import numpy as np
fig, axes = plt.subplots(2, 8, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
    img = cifar.data.iloc[i].values.reshape(3, 32, 32).transpose(1, 2, 0)
    ax.imshow(img.astype('uint8'))
    ax.set_title(label_names[int(cifar.target.iloc[i])], fontsize=8)
    ax.axis('off')
plt.suptitle('CIFAR-10: Sample Images')
plt.tight_layout()
plt.show()