Introduction to Pandas
Objectives
Explain the relationship between Pandas and NumPy, highlighting when to use each library effectively.
Define and differentiate between Series and DataFrame data structures in Pandas.
Demonstrate the creation of Pandas objects from NumPy arrays, solidifying foundational knowledge.
Instructor note
Teaching : 10 min
Demo: 10 min
What is Pandas?
A powerful Python library for data manipulation and analysis
Built on top of NumPy (which you’re already familiar with)
Essential tool in data science, machine learning, and analytics
Name comes from “panel data” - economic term for multidimensional data
Note
“Pandas is to data analysis what NumPy is to numerical computing.”
“If you’ve ever used Excel or SQL, Pandas will feel somewhat familiar.”
“Today we’ll see how the NumPy skills you’ve learned provide the foundation for Pandas.”
Pandas vs NumPy Relationship
Key Differences:
NumPy |
Pandas |
---|---|
Homogeneous arrays |
Heterogeneous data |
Numerical focus |
Tabular data focus |
Unlabeled axes |
Labeled axes |
Fast mathematical operations |
Data manipulation operations |
Memory efficient |
Feature rich |
Demo
import numpy as np
import pandas as pd
# Try to put mixed types in NumPy - notice what happens
mixed_numpy = np.array([1, 'string', 3.14])
print(f"Mixed NumPy array: {mixed_numpy}")
print(f"Mixed array dtype: {mixed_numpy.dtype}") # Converts to common type
# Pandas Series - can handle mixed types
pandas_series = pd.Series([1, 'string', 3.14])
print("\nPandas Series with mixed types:")
print(pandas_series)
Output
NumPy array: [1 2 3 4 5]
NumPy data type: int64
Mixed NumPy array: ['1' 'string' '3.14']
Mixed array dtype: <U32 # Unicode string with a maximum length of 32 characters
Pandas Series with mixed types:
0 1
1 string
2 3.14
dtype: object
Demo
print([type(i) for i in mixed_numpy])
print([type(i) for i in pandas_series])
Output
[<class 'numpy.str_'>, <class 'numpy.str_'>, <class 'numpy.str_'>]
# NumPy converts everything to strings when faced with mixed types
[<class 'int'>, <class 'str'>, <class 'float'>]
# Pandas preserves the original types using an 'object' dtype
DataFrame Data Structures
Pandas introduces two new data structures
Series: A labeled one-dimensional array
DataFrames: Labeled Tables for Powerful Data Analysis
Pandas Series
1D labeled array capable of holding any data type
Like a cross between a list and a dictionary
Has an Index (labels) and values (data)
Built on NumPy, so supports vectorized operations
Demo
# Creating a Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Pandas Series with custom index:")
print(s)
print(f"The index: {s.index}")
print(f"The values: {s.values}") # Notice this returns a NumPy array
# Dictionary-like access
print(f"\nValue at index 'b': {s['b']}")
print(f"Values at indices 'a' and 'c': {s[['a', 'c']]}")
# NumPy-like operations
print(f"\nAdding 5 to all values: \n{s + 5}")
print(f"Values greater than 25: \n{s > 25}")
Output
Pandas Series with custom index:
a 10
b 20
c 30
d 40
dtype: int64
The index: Index(['a', 'b', 'c', 'd'], dtype='object')
The values: [10 20 30 40]
Value at index 'b': 20
Values at indices 'a' and 'c': a 10
c 30
dtype: int64
Adding 5 to all values:
a 15
b 25
c 35
d 45
dtype: int64
Values greater than 25:
a False
b False
c True
d True
dtype: bool
More info
Additional notes: Why ues labels?
Note
Why ues labels and not numerical indexing?
Labels provide context and meaning to your data (similar to Python Dictionaries)
Flexibility in reshaping and manipulating:
Numerical indexing assumes a fixed order
With labels, you can add, remove, or reorder elements without affecting the data retrieval process
Handling Missing Data:
Numerical indexing can become problematic with missing entries in data. However, Labels let you explicitly handle missing entries using designated labels (e.g., NaN)
Intrinsic Data alignment with labels:
When performing operations between two Series, pandas aligns data based on the label of the indices, not the integer location
This makes operations across data structures with potentially differently ordered labels possible and coherent
The Pandas DataFrame
2D labeled data structure with columns of potentially different types
Think: spreadsheet or SQL table
Collection of Series objects that share the same index
Primary data structure you’ll use in data analysis
Demo
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Boston', 'Chicago']
}
df = pd.DataFrame(data)
print("Pandas DataFrame:")
print(df)
# DataFrame info
print(f"\nDataFrame shape: {df.shape}")
print(f"DataFrame columns: {df.columns}")
print(f"DataFrame index: {df.index}")
# Accessing a column (returns a Series)
print("\nAccessing the Age column:")
print(df['Age'])
print(f"Type of column: {type(df['Age'])}")
Output
Pandas DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
2 Charlie 35 Chicago
DataFrame shape: (3, 3)
DataFrame columns: Index(['Name', 'Age', 'City'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=3, step=1)
Accessing the Age column:
0 25
1 30
2 35
Name: Age, dtype: int64
Type of column: <class 'pandas.core.series.Series'>
More info
Additional notes: Pandas data structures
A Series is the building block in Pandas - think of it as a single column of data
A DataFrame is a collection of Series objects sharing an index - like a table made up of columns
The real power comes from being able to manipulate, filter, and analyze these structures easily
DataFrame is made up of Series objects, and that both have labeled axes that make data access intuitive.
Creating Pandas Objects from NumPy Arrays
Pandas objects can be created directly from NumPy arrays
Indexes can be automatically generated or custom-defined
The underlying NumPy array is still accessible
Pandas adds powerful indexing and data manipulation features
Numpy array to Pandas series
Demo
# NumPy array to Pandas Series
numpy_data = np.array([100, 200, 300, 400])
s1 = pd.Series(numpy_data)
print("Series with default index:")
print(s1)
s2 = pd.Series(numpy_data, index=['w', 'x', 'y', 'z'])
print("\nSeries with custom index:")
print(s2)
Output
Series with default index:
0 100
1 200
2 300
3 400
dtype: int64
Series with custom index:
w 100
x 200
y 300
z 400
dtype: int64
print("Original Pandas Series (s2):")
print(s2)
print("\nIndex of the Pandas Series (s2.index):")
print(s2.index)
# Convert the Pandas Series to a NumPy array
numpy_array_from_s2 = s2.to_numpy()
print("\nNumPy Array created from s2 (s2.to_numpy()):")
print(numpy_array_from_s2)
# There's no explicit index in the NumPy array, it's implicitly 0, 1, 2, ...
# You can demonstrate accessing elements using numerical indices in the NumPy array
print("\nAccessing elements in the NumPy array (numerical indexing):")
print(f"numpy_array_from_s2[0]: {numpy_array_from_s2[0]}")
print(f"numpy_array_from_s2[2]: {numpy_array_from_s2[2]}")
Output
Original Pandas Series (s2):
w 100
x 200
y 300
z 400
dtype: int64
Index of the Pandas Series (s2.index):
Index(['w', 'x', 'y', 'z'], dtype='object')
NumPy Array created from s2 (s2.to_numpy()):
[100 200 300 400]
Accessing elements in the NumPy array (numerical indexing):
numpy_array_from_s2[0]: 100
numpy_array_from_s2[2]: 300
More info
Additional notes: Pandas dataframe from Numpy 2D array
Numpy 2D array to Pandas dataframe:
# NumPy 2D array to DataFrame
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df1 = pd.DataFrame(array_2d)
print("\nDataFrame from 2D array (default column names):")
print(df1)
df2 = pd.DataFrame(array_2d,
columns=['A', 'B', 'C'],
index=['row1', 'row2', 'row3'])
print("\nDataFrame with custom row and column labels:")
print(df2)
# Converting back to NumPy
numpy_again = df2.to_numpy()
print("\nConverted back to NumPy:")
print(numpy_again)
DataFrame from 2D array (default column names):
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
DataFrame with custom row and column labels:
A B C
row1 1 2 3
row2 4 5 6
row3 7 8 9
Converted back to NumPy:
[[1 2 3]
[4 5 6]
[7 8 9]]
Note:
Your existing NumPy knowledge transfers directly to Pandas
The key difference is the addition of labels, which makes data more self-describing
Notice how Pandas automatically names axes when none are provided, but lets you customize when needed
You can always convert back to NumPy when you need pure numerical performance
When you convert a Pandas Series to a NumPy array using .to_numpy(), you are essentially extracting the data values
The labeled index, which is a key feature of Pandas Series, is not part of the underlying NumPy array structure
The resulting NumPy array reverts to its default numerical indexing.
Additional notes: Other ways to create Pandas DataFrames
Creating Pandas DataFrames:
import pandas as pd
# 1. From Dictionaries
# Standard dictionary method
data = {'Name': 'John', 'Age': 30, 'City': 'New York'}
df1a = pd.DataFrame(data, index=[0])
# Using from_dict() method
df1b = pd.DataFrame.from_dict(data, orient='index').T
print("DataFrame from simple dictionary using from_dict:")
print(df1b)
# Dictionary with equal-length lists
data = {'Name': ['John', 'Anna', 'Peter'],
'Age': [30, 28, 45],
'City': ['New York', 'Boston', 'Chicago']}
df2a = pd.DataFrame(data)
# Using from_dict() method (default orient='columns')
df2b = pd.DataFrame.from_dict(data)
print("\nDataFrame from dictionary with lists using from_dict:")
print(df2b)
# 2. From Dictionaries with Lists
# Nested dictionary (records format)
records = [
{'Name': 'John', 'Age': 30, 'City': 'New York'},
{'Name': 'Anna', 'Age': 28, 'City': 'Boston'},
{'Name': 'Peter', 'Age': 45, 'City': 'Chicago'}
]
df3a = pd.DataFrame(records)
# Using from_records instead of from_dict for this format
df3b = pd.DataFrame.from_records(records)
print("\nDataFrame from nested dictionary using from_records:")
print(df3b)
# Using from_dict with orient='index' to transpose dictionary
indexed_dict = {
'Person1': {'Name': 'John', 'Age': 30, 'City': 'New York'},
'Person2': {'Name': 'Anna', 'Age': 28, 'City': 'Boston'},
'Person3': {'Name': 'Peter', 'Age': 45, 'City': 'Chicago'}
}
df3c = pd.DataFrame.from_dict(indexed_dict, orient='index')
print("\nDataFrame from nested dictionary using from_dict with orient='index':")
print(df3c)
# 3. From Lists of Lists
# Create from list of lists with column names
data = [
['John', 30, 'New York'],
['Anna', 28, 'Boston'],
['Peter', 45, 'Chicago']
]
columns = ['Name', 'Age', 'City']
df4 = pd.DataFrame(data, columns=columns)
# For lists of lists, we typically use the DataFrame constructor
# But we can convert to dict first if needed
dict_from_lists = {columns[i]: [row[i] for row in data] for i in range(len(columns))}
df5 = pd.DataFrame.from_dict(dict_from_lists)
print("\nDataFrame from list of lists converted to dict first, then using from_dict:")
print(df5)
More info
Additional notes: Exercises
Exercise 1:
Imagine the random_array represents quarterly sales data for three different products. Explain how the provided code transforms this raw numerical data into a more informative and manageable structure using Pandas, allowing for labeled access and calculations.
Have students execute:
# Create a 3x4 NumPy array of random integers
random_array = np.random.randint(1, 100, size=(3, 4))
print("Original NumPy array:")
print(random_array)
Exercise 1 - Solution:
# Convert to DataFrame with custom labels
quarters_df = pd.DataFrame(
random_array,
columns=['Q1', 'Q2', 'Q3', 'Q4'],
index=['Product A', 'Product B', 'Product C']
)
print("\nAs Pandas DataFrame with labels:")
print(quarters_df)
# Extract Q2 column as a Series
q2_series = quarters_df['Q2']
print("\nQ2 column as a Series:")
print(q2_series)
print(f"Series type: {type(q2_series)}")
Exercise 2: Creating pandas DataFrames from list of dictionaries:
Each dictionary represents a row and its keys are the column names
data = [
{'Name': 'Alice', 'Age': 25, 'Score': 85},
{'Name': 'Bob', 'Age': 30, 'Score': 70},
{'Name': 'Charlie', 'Age': 22, 'Score': 90}
]
df = pd.DataFrame(data)
print(df)
Output
Name Age Score
0 Alice 25 85
1 Bob 30 70
2 Charlie 22 90
Exercise 3: Creating pandas DataFrames from dictionaries of lists:
The dictionary keys are the column names, and the values are lists containing the corresponding data for each row
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85, 70, 90]
}
df = pd.DataFrame(data)
print(df)
Output
Name Age Score
0 Alice 25 85
1 Bob 30 70
2 Charlie 22 90
Key Takeaways
Keypoints
Building on NumPy: Pandas uses NumPy under the hood, adding labels and mixed-type support
Series: 1D labeled array similar to a dictionary + array hybrid
DataFrame: 2D labeled structure, like a collection of Series sharing an index
Labels Matter: Named axes make your data self-documenting and easier to work with
Data Types: Pandas handles mixed types better than NumPy, crucial for real-world data