Introduction to Pandas

Objectives

  • Explain the relationship between Pandas and NumPy, highlighting when to use each library effectively.

  • Define and differentiate between Series and DataFrame data structures in Pandas.

  • Demonstrate the creation of Pandas objects from NumPy arrays, solidifying foundational knowledge.

Time

15 Minutes

What is Pandas?

  • A powerful Python library for data manipulation and analysis

  • Built on top of NumPy (which you’re already familiar with)

  • Essential tool in data science, machine learning, and analytics

  • Name comes from “panel data” - economic term for multidimensional data

Discussion

  • “Pandas is to data analysis what NumPy is to numerical computing.”

  • “If you’ve ever used Excel or SQL, Pandas will feel somewhat familiar.”

  • “Today we’ll see how the NumPy skills you’ve learned provide the foundation for Pandas.”

Pandas vs NumPy Relationship

Key Differences:

NumPy

Pandas

Homogeneous arrays

Heterogeneous data

Numerical focus

Tabular data focus

Unlabeled axes

Labeled axes

Fast mathematical operations

Data manipulation operations

Memory efficient

Feature rich

Demo

import numpy as np
import pandas as pd

# NumPy array - homogeneous data type
numpy_array = np.array([1, 2, 3, 4, 5])
print(f"NumPy array: {numpy_array}")
print(f"NumPy data type: {numpy_array.dtype}")

# Try to put mixed types in NumPy - notice what happens
mixed_numpy = np.array([1, 'string', 3.14])
print(f"Mixed NumPy array: {mixed_numpy}")
print(f"Mixed array dtype: {mixed_numpy.dtype}")  # Converts to common type

# Pandas Series - can handle mixed types
pandas_series = pd.Series([1, 'string', 3.14])
print("\nPandas Series with mixed types:")
print(pandas_series)

Output

NumPy array: [1 2 3 4 5]
NumPy data type: int64
Mixed NumPy array: ['1' 'string' '3.14']
Mixed array dtype: <U32 # Unicode string with a maximum length of 32 characters

Pandas Series with mixed types:
0         1
1    string
2      3.14
dtype: object

print([type(i) for i in mixed_numpy])
print([type(i) for i in pandas_series])

Output

[<class 'numpy.str_'>, <class 'numpy.str_'>, <class 'numpy.str_'>]
[<class 'int'>, <class 'str'>, <class 'float'>]

Discussion

  • Notice how NumPy converts everything to strings when faced with mixed types

  • Pandas, on the other hand, preserves the original types using an ‘object’ dtype.

  • This flexibility is crucial when working with real-world data that rarely comes in neat, homogeneous packages

NumPy converts everything to strings or objects, losing the original type functionality, while Pandas preserves the original types and their behaviors.

DataFrame Data Structures

  • Pandas introduces two new data structures

    • Series: A labeled one-dimensional array

    • DataFrames: Labeled Tables for Powerful Data Analysis

Pandas Series

  • 1D labeled array capable of holding any data type

  • Like a cross between a list and a dictionary

  • Has an Index (labels) and values (data)

  • Built on NumPy, so supports vectorized operations

Demo

# Creating a Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Pandas Series with custom index:")
print(s)
print(f"The index: {s.index}")
print(f"The values: {s.values}")  # Notice this returns a NumPy array

# Dictionary-like access
print(f"\nValue at index 'b': {s['b']}")
print(f"Values at indices 'a' and 'c': {s[['a', 'c']]}")

# NumPy-like operations
print(f"\nAdding 5 to all values: \n{s + 5}")
print(f"Values greater than 25: \n{s > 25}")

Output

Pandas Series with custom index:
a    10
b    20
c    30
d    40
dtype: int64
The index: Index(['a', 'b', 'c', 'd'], dtype='object')
The values: [10 20 30 40]

Value at index 'b': 20
Values at indices 'a' and 'c': a    10
c    30
dtype: int64

Adding 5 to all values:
a    15
b    25
c    35
d    45
dtype: int64
Values greater than 25:
a    False
b    False
c     True
d     True
dtype: bool

Discussion

Why ues labels and not numerical indexing?

  • Labels provide context and meaning to your data (similar to Python Dictionaries)

  • Flexibility in reshaping and manipulating:

    • Numerical indexing assumes a fixed order

    • With labels, you can add, remove, or reorder elements without affecting the data retrieval process

  • Handling Missing Data:

    • Numerical indexing can become problematic with missing entries in data. However, Labels let you explicitly handle missing entries using designated labels (e.g., NaN)

  • Intrinsic Data alignment with labels:

    • When performing operations between two Series, pandas aligns data based on the label of the indices, not the integer location

    • This makes operations across data structures with potentially differently ordered labels possible and coherent

The Pandas DataFrame

  • 2D labeled data structure with columns of potentially different types

  • Think: spreadsheet or SQL table

  • Collection of Series objects that share the same index

  • Primary data structure you’ll use in data analysis

Demo

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Boston', 'Chicago']
}
df = pd.DataFrame(data)
print("Pandas DataFrame:")
print(df)

# DataFrame info
print(f"\nDataFrame shape: {df.shape}")
print(f"DataFrame columns: {df.columns}")
print(f"DataFrame index: {df.index}")

# Accessing a column (returns a Series)
print("\nAccessing the Age column:")
print(df['Age'])
print(f"Type of column: {type(df['Age'])}")

Output

Pandas DataFrame:
      Name  Age      City
0    Alice   25  New York
1      Bob   30    Boston
2  Charlie   35   Chicago

DataFrame shape: (3, 3)
DataFrame columns: Index(['Name', 'Age', 'City'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=3, step=1)

Accessing the Age column:
0    25
1    30
2    35
Name: Age, dtype: int64
Type of column: <class 'pandas.core.series.Series'>

Discussion

  • A Series is the building block in Pandas - think of it as a single column of data

  • A DataFrame is a collection of Series objects sharing an index - like a table made up of columns

  • The real power comes from being able to manipulate, filter, and analyze these structures easily

DataFrame is made up of Series objects, and that both have labeled axes that make data access intuitive.

Creating Pandas Objects from NumPy Arrays

  • Pandas objects can be created directly from NumPy arrays

  • Indexes can be automatically generated or custom-defined

  • The underlying NumPy array is still accessible

  • Pandas adds powerful indexing and data manipulation features

Numpy array to Pandas series

Demo

# NumPy array to Pandas Series
numpy_data = np.array([100, 200, 300, 400])
s1 = pd.Series(numpy_data)
print("Series with default index:")
print(s1)

s2 = pd.Series(numpy_data, index=['w', 'x', 'y', 'z'])
print("\nSeries with custom index:")
print(s2)

Output

Series with default index:
0    100
1    200
2    300
3    400
dtype: int64

Series with custom index:
w    100
x    200
y    300
z    400
dtype: int64

print("Original Pandas Series (s2):")
print(s2)
print("\nIndex of the Pandas Series (s2.index):")
print(s2.index)

# Convert the Pandas Series to a NumPy array
numpy_array_from_s2 = s2.to_numpy()

print("\nNumPy Array created from s2 (s2.to_numpy()):")
print(numpy_array_from_s2)

# There's no explicit index in the NumPy array, it's implicitly 0, 1, 2, ...
# You can demonstrate accessing elements using numerical indices in the NumPy array
print("\nAccessing elements in the NumPy array (numerical indexing):")
print(f"numpy_array_from_s2[0]: {numpy_array_from_s2[0]}")
print(f"numpy_array_from_s2[2]: {numpy_array_from_s2[2]}")

Output

Original Pandas Series (s2):
w    100
x    200
y    300
z    400
dtype: int64

Index of the Pandas Series (s2.index):
Index(['w', 'x', 'y', 'z'], dtype='object')

NumPy Array created from s2 (s2.to_numpy()):
[100 200 300 400]

Accessing elements in the NumPy array (numerical indexing):
numpy_array_from_s2[0]: 100
numpy_array_from_s2[2]: 300

Numpy 2D array to Pandas dataframe

# NumPy 2D array to DataFrame
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df1 = pd.DataFrame(array_2d)
print("\nDataFrame from 2D array (default column names):")
print(df1)

df2 = pd.DataFrame(array_2d, 
                  columns=['A', 'B', 'C'],
                  index=['row1', 'row2', 'row3'])
print("\nDataFrame with custom row and column labels:")
print(df2)

# Converting back to NumPy
numpy_again = df2.to_numpy()
print("\nConverted back to NumPy:")
print(numpy_again)

Output

DataFrame from 2D array (default column names):
   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

DataFrame with custom row and column labels:
      A  B  C
row1  1  2  3
row2  4  5  6
row3  7  8  9

Converted back to NumPy:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Discussion

  • Your existing NumPy knowledge transfers directly to Pandas

  • The key difference is the addition of labels, which makes data more self-describing

  • Notice how Pandas automatically names axes when none are provided, but lets you customize when needed

  • You can always convert back to NumPy when you need pure numerical performance

    • When you convert a Pandas Series to a NumPy array using .to_numpy(), you are essentially extracting the data values

    • The labeled index, which is a key feature of Pandas Series, is not part of the underlying NumPy array structure

    • The resulting NumPy array reverts to its default numerical indexing.

Exercise

Imagine the random_array represents quarterly sales data for three different products. Explain how the provided code transforms this raw numerical data into a more informative and manageable structure using Pandas, allowing for labeled access and calculations.

Have students execute:

# Create a 3x4 NumPy array of random integers
random_array = np.random.randint(1, 100, size=(3, 4))
print("Original NumPy array:")
print(random_array)

Key Takeaways

Keypoints

  1. Building on NumPy: Pandas uses NumPy under the hood, adding labels and mixed-type support

  2. Series: 1D labeled array similar to a dictionary + array hybrid

  3. DataFrame: 2D labeled structure, like a collection of Series sharing an index

  4. Labels Matter: Named axes make your data self-documenting and easier to work with

  5. Data Types: Pandas handles mixed types better than NumPy, crucial for real-world data

Homework

Homework

Creating pandas DataFrames from list of dictionaries:

  • Each dictionary represents a row and its keys are the column names

 data = [
     {'Name': 'Alice', 'Age': 25, 'Score': 85},
     {'Name': 'Bob', 'Age': 30, 'Score': 70},
     {'Name': 'Charlie', 'Age': 22, 'Score': 90}
 ]
df = pd.DataFrame(data)
print(df)

Output

      Name  Age  Score
0    Alice   25     85
1      Bob   30     70
2  Charlie   22     90

Creating pandas DataFrames from dictionaries of lists:

  • The dictionary keys are the column names, and the values are lists containing the corresponding data for each row

 data = {
     'Name': ['Alice', 'Bob', 'Charlie'],
     'Age': [25, 30, 22],
     'Score': [85, 70, 90]
 }

df = pd.DataFrame(data)
print(df)

Output

      Name  Age  Score
0    Alice   25     85
1      Bob   30     70
2  Charlie   22     90