Introduction to Pandas

Objectives

  • Explain the relationship between Pandas and NumPy, highlighting when to use each library effectively.

  • Define and differentiate between Series and DataFrame data structures in Pandas.

  • Demonstrate the creation of Pandas objects from NumPy arrays, solidifying foundational knowledge.

Instructor note

  • Teaching : 10 min

  • Demo: 10 min

What is Pandas?

  • A powerful Python library for data manipulation and analysis

  • Built on top of NumPy (which you’re already familiar with)

  • Essential tool in data science, machine learning, and analytics

  • Name comes from “panel data” - economic term for multidimensional data

Note

  • “Pandas is to data analysis what NumPy is to numerical computing.”

  • “If you’ve ever used Excel or SQL, Pandas will feel somewhat familiar.”

  • “Today we’ll see how the NumPy skills you’ve learned provide the foundation for Pandas.”

Pandas vs NumPy Relationship

Key Differences:

NumPy

Pandas

Homogeneous arrays

Heterogeneous data

Numerical focus

Tabular data focus

Unlabeled axes

Labeled axes

Fast mathematical operations

Data manipulation operations

Memory efficient

Feature rich

Demo

import numpy as np
import pandas as pd

# Try to put mixed types in NumPy - notice what happens
mixed_numpy = np.array([1, 'string', 3.14])
print(f"Mixed NumPy array: {mixed_numpy}")
print(f"Mixed array dtype: {mixed_numpy.dtype}")  # Converts to common type

# Pandas Series - can handle mixed types
pandas_series = pd.Series([1, 'string', 3.14])
print("\nPandas Series with mixed types:")
print(pandas_series)

Demo

print([type(i) for i in mixed_numpy])
print([type(i) for i in pandas_series])

DataFrame Data Structures

  • Pandas introduces two new data structures

    • Series: A labeled one-dimensional array

    • DataFrames: Labeled Tables for Powerful Data Analysis

Pandas Series

  • 1D labeled array capable of holding any data type

  • Like a cross between a list and a dictionary

  • Has an Index (labels) and values (data)

  • Built on NumPy, so supports vectorized operations

Demo

# Creating a Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Pandas Series with custom index:")
print(s)
print(f"The index: {s.index}")
print(f"The values: {s.values}")  # Notice this returns a NumPy array

# Dictionary-like access
print(f"\nValue at index 'b': {s['b']}")
print(f"Values at indices 'a' and 'c': {s[['a', 'c']]}")

# NumPy-like operations
print(f"\nAdding 5 to all values: \n{s + 5}")
print(f"Values greater than 25: \n{s > 25}")

More info

Additional notes: Why ues labels?

Note

Why ues labels and not numerical indexing?

  • Labels provide context and meaning to your data (similar to Python Dictionaries)

  • Flexibility in reshaping and manipulating:

    • Numerical indexing assumes a fixed order

    • With labels, you can add, remove, or reorder elements without affecting the data retrieval process

  • Handling Missing Data:

    • Numerical indexing can become problematic with missing entries in data. However, Labels let you explicitly handle missing entries using designated labels (e.g., NaN)

  • Intrinsic Data alignment with labels:

    • When performing operations between two Series, pandas aligns data based on the label of the indices, not the integer location

    • This makes operations across data structures with potentially differently ordered labels possible and coherent

The Pandas DataFrame

  • 2D labeled data structure with columns of potentially different types

  • Think: spreadsheet or SQL table

  • Collection of Series objects that share the same index

  • Primary data structure you’ll use in data analysis

Demo

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Boston', 'Chicago']
}
df = pd.DataFrame(data)
print("Pandas DataFrame:")
print(df)

# DataFrame info
print(f"\nDataFrame shape: {df.shape}")
print(f"DataFrame columns: {df.columns}")
print(f"DataFrame index: {df.index}")

# Accessing a column (returns a Series)
print("\nAccessing the Age column:")
print(df['Age'])
print(f"Type of column: {type(df['Age'])}")

More info

Additional notes: Pandas data structures
  • A Series is the building block in Pandas - think of it as a single column of data

  • A DataFrame is a collection of Series objects sharing an index - like a table made up of columns

  • The real power comes from being able to manipulate, filter, and analyze these structures easily

DataFrame is made up of Series objects, and that both have labeled axes that make data access intuitive.

External resources: Intro to data structures

Creating Pandas Objects from NumPy Arrays

  • Pandas objects can be created directly from NumPy arrays

  • Indexes can be automatically generated or custom-defined

  • The underlying NumPy array is still accessible

  • Pandas adds powerful indexing and data manipulation features

Numpy array to Pandas series

Demo

# NumPy array to Pandas Series
numpy_data = np.array([100, 200, 300, 400])
s1 = pd.Series(numpy_data)
print("Series with default index:")
print(s1)

s2 = pd.Series(numpy_data, index=['w', 'x', 'y', 'z'])
print("\nSeries with custom index:")
print(s2)

More info

Additional notes: Pandas dataframe from Numpy 2D array

Numpy 2D array to Pandas dataframe:

# NumPy 2D array to DataFrame
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df1 = pd.DataFrame(array_2d)
print("\nDataFrame from 2D array (default column names):")
print(df1)

df2 = pd.DataFrame(array_2d, 
                  columns=['A', 'B', 'C'],
                  index=['row1', 'row2', 'row3'])
print("\nDataFrame with custom row and column labels:")
print(df2)

# Converting back to NumPy
numpy_again = df2.to_numpy()
print("\nConverted back to NumPy:")
print(numpy_again)

DataFrame from 2D array (default column names):
   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

DataFrame with custom row and column labels:
      A  B  C
row1  1  2  3
row2  4  5  6
row3  7  8  9

Converted back to NumPy:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Note:

  • Your existing NumPy knowledge transfers directly to Pandas

  • The key difference is the addition of labels, which makes data more self-describing

  • Notice how Pandas automatically names axes when none are provided, but lets you customize when needed

  • You can always convert back to NumPy when you need pure numerical performance

    • When you convert a Pandas Series to a NumPy array using .to_numpy(), you are essentially extracting the data values

    • The labeled index, which is a key feature of Pandas Series, is not part of the underlying NumPy array structure

    • The resulting NumPy array reverts to its default numerical indexing.

Additional notes: Other ways to create Pandas DataFrames

Creating Pandas DataFrames:

import pandas as pd

# 1. From Dictionaries

# Standard dictionary method
data = {'Name': 'John', 'Age': 30, 'City': 'New York'}
df1a = pd.DataFrame(data, index=[0])

# Using from_dict() method
df1b = pd.DataFrame.from_dict(data, orient='index').T
print("DataFrame from simple dictionary using from_dict:")
print(df1b)

# Dictionary with equal-length lists
data = {'Name': ['John', 'Anna', 'Peter'], 
        'Age': [30, 28, 45], 
        'City': ['New York', 'Boston', 'Chicago']}
df2a = pd.DataFrame(data)

# Using from_dict() method (default orient='columns')
df2b = pd.DataFrame.from_dict(data)
print("\nDataFrame from dictionary with lists using from_dict:")
print(df2b)

# 2. From Dictionaries with Lists

# Nested dictionary (records format)
records = [
    {'Name': 'John', 'Age': 30, 'City': 'New York'},
    {'Name': 'Anna', 'Age': 28, 'City': 'Boston'},
    {'Name': 'Peter', 'Age': 45, 'City': 'Chicago'}
]
df3a = pd.DataFrame(records)

# Using from_records instead of from_dict for this format
df3b = pd.DataFrame.from_records(records)
print("\nDataFrame from nested dictionary using from_records:")
print(df3b)

# Using from_dict with orient='index' to transpose dictionary
indexed_dict = {
    'Person1': {'Name': 'John', 'Age': 30, 'City': 'New York'},
    'Person2': {'Name': 'Anna', 'Age': 28, 'City': 'Boston'},
    'Person3': {'Name': 'Peter', 'Age': 45, 'City': 'Chicago'}
}
df3c = pd.DataFrame.from_dict(indexed_dict, orient='index')
print("\nDataFrame from nested dictionary using from_dict with orient='index':")
print(df3c)

# 3. From Lists of Lists

# Create from list of lists with column names
data = [
    ['John', 30, 'New York'],
    ['Anna', 28, 'Boston'],
    ['Peter', 45, 'Chicago']
]
columns = ['Name', 'Age', 'City']
df4 = pd.DataFrame(data, columns=columns)

# For lists of lists, we typically use the DataFrame constructor
# But we can convert to dict first if needed
dict_from_lists = {columns[i]: [row[i] for row in data] for i in range(len(columns))}
df5 = pd.DataFrame.from_dict(dict_from_lists)
print("\nDataFrame from list of lists converted to dict first, then using from_dict:")
print(df5)

More info

Additional notes: Exercises

Exercise 1:

Imagine the random_array represents quarterly sales data for three different products. Explain how the provided code transforms this raw numerical data into a more informative and manageable structure using Pandas, allowing for labeled access and calculations.

Have students execute:

# Create a 3x4 NumPy array of random integers
random_array = np.random.randint(1, 100, size=(3, 4))
print("Original NumPy array:")
print(random_array)

Exercise 1 - Solution:

# Convert to DataFrame with custom labels
quarters_df = pd.DataFrame(
    random_array,
    columns=['Q1', 'Q2', 'Q3', 'Q4'],
    index=['Product A', 'Product B', 'Product C']
)
print("\nAs Pandas DataFrame with labels:")
print(quarters_df)

# Extract Q2 column as a Series
q2_series = quarters_df['Q2']
print("\nQ2 column as a Series:")
print(q2_series)
print(f"Series type: {type(q2_series)}")

Exercise 2: Creating pandas DataFrames from list of dictionaries:

  • Each dictionary represents a row and its keys are the column names

 data = [
     {'Name': 'Alice', 'Age': 25, 'Score': 85},
     {'Name': 'Bob', 'Age': 30, 'Score': 70},
     {'Name': 'Charlie', 'Age': 22, 'Score': 90}
 ]
df = pd.DataFrame(data)
print(df)

Key Takeaways

Keypoints

  1. Building on NumPy: Pandas uses NumPy under the hood, adding labels and mixed-type support

  2. Series: 1D labeled array similar to a dictionary + array hybrid

  3. DataFrame: 2D labeled structure, like a collection of Series sharing an index

  4. Labels Matter: Named axes make your data self-documenting and easier to work with

  5. Data Types: Pandas handles mixed types better than NumPy, crucial for real-world data