DataFrame Manipulation & Sorting

Objectives

  • Demonstrate effective techniques for selecting specific rows and from DataFrames.

  • Equip participants with skills to add and remove columns within a DataFrame.

  • Implement sorting methods by values, by index, and perform multiple column sorting with custom orders.

Instructor note

  • Teaching : 15 min

  • Demo: 25 min

Note

  • Now that we can import data, we need to reshape and manipulate it for analysis

  • In real-world data analysis, you’ll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis

  • The skills we’re covering in this session form the backbone of data wrangling in Python

  • Think of these operations as transforming raw data into analysis-ready information

Column and Row Selection

Different Ways to Select Data:

Selection Type

Purpose

Example Syntax

Single column

Get one variable

df['column_name'] or df.column_name

Multiple columns

Get specific variables

df[['col1', 'col2']]

Row by index

Get specific observation

df.loc['index_label']

Row by position

Get nth observation

df.iloc[n]

Row and column

Get specific value(s)

df.loc['index', 'column']

Slicing

Get ranges of data

df.loc['idx1':'idx2', 'col1':'col2']

Basic Selection

Demo

import pandas as pd
import numpy as np

# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index
print("Original DataFrame:")
print(df)

More info

Additional notes: Dataframe set_index
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance'],
    'employee_id': ['emp001', 'emp002', 'emp003', 'emp004', 'emp005'],
}

df = pd.DataFrame(data)
df.set_index("employee_id", inplace=True)
df.head()

Demo

# Single column selection - two methods
print("\n1. Single column as Series:")
print(df['Age'])  # Returns a Series

print("\n2. Alternative syntax for columns without spaces:")
print(df.Age)  # Also returns a Series, but only works if column name has no spaces

# Multiple column selection
print("\n3. Selecting multiple columns:")
print(df[['Name', 'Salary']])  # Returns a DataFrame

# Row selection by index label using .loc
print("\n4. Selecting row by index label:")
print(df.loc['emp003'])  # Returns a Series representing the row

# Row selection by position using .iloc
print("\n5. Selecting row by position (third row):")
print(df.iloc[2])  # Also returns a Series

# Selecting a subset of rows
print("\n6. Selecting multiple rows by position:")
print(df.iloc[1:4])  # Row indices 1, 2, and 3 (not including 4)

# Selecting specific cells
print("\n7. Selecting specific value (cell):")
print(df.loc['emp002', 'Salary'])  # Returns the value (72000)

# Selecting a subset of rows and columns
print("\n8. Selecting subset of rows and columns:")
print(df.loc['emp001':'emp003', ['Name', 'Age', 'Salary']])
## df.iloc[0:3, [0,1,3]] == df.loc['emp001':'emp003', ['Name', 'Age', 'Salary']] 

Note

  • Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure

  • The .loc accessor is used for label-based indexing, while .iloc is for position-based indexing

Advanced Selection with Conditions

Demo

# Boolean selection - rows where Age > 30
print("\n9. Boolean selection - employees over 30:")
print(df[df['Age'] > 30])

# Multiple conditions using & (and) and | (or)
print("\n10. Multiple conditions - Tech department with salary > 80000:")
print(df[(df['Department'] == 'Tech') & (df['Salary'] > 80000)])

# Using .query() method for cleaner syntax
## .query() method allows you to use and/or
## print("\n11. Using query method - same condition:")
## print(df.query("Department == 'Tech' and Salary > 80000"))

# Row selection with .isin()
print("\n12. Using .isin() - employees in HR or Finance:")
print(df[df['Department'].isin(['HR', 'Finance'])])

Note

  • Boolean selection is incredibly powerful - it lets you filter data based on specific conditions

  • These selection methods can be combined in powerful ways to extract exactly the data you need

More info

Additional notes: Exercise

Selection Practice:

Use inventory dataframe and

  • Select just the Product_Name and Price columns

  • Select all products that are in stock (In_Stock is True)

  • Select all electronics that cost less than 500

  • Select the 2nd and 3rd products using position-based indexing

# Create a dataset of product inventory
products = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
    'Product_Name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 
                 'Electronics', 'Accessories', 'Accessories'],
    'Price': [1200, 800, 350, 250, 75, 25],
    'In_Stock': [True, True, False, True, True, False],
    'Units': [15, 28, 0, 10, 45, 0]
}
inventory = pd.DataFrame(products)

Solution

# Tasks:
# 1. Select just the Product_Name and Price columns
names_prices = inventory[['Product_Name', 'Price']]
print("Product names and prices:")
print(names_prices)

# 2. Select all products that are in stock (In_Stock is True)
in_stock = inventory[inventory['In_Stock'] == True]
print("\nProducts in stock:")
print(in_stock)

# 3. Select all electronics that cost less than 500
cheap_electronics = inventory[(inventory['Category'] == 'Electronics') & 
                              (inventory['Price'] < 500)]
print("\nAffordable electronics:")
print(cheap_electronics)

# 4. Select the 2nd and 3rd products using position-based indexing
second_third = inventory.iloc[1:3]
print("\nSecond and third products:")
print(second_third)

Understand the different ways to select data from a DataFrame, including column selection, row selection by label and position, boolean filtering, and combinations of these methods

Adding and Removing Columns/Rows

Modifying DataFrame Structure:

Operation

Method

Example

Add column

Direct assignment

df['new_col'] = values

Add column

From existing columns

df['new_col'] = df['col1'] + df['col2']

Add column

Using apply/lambda

df['new_col'] = df.apply(lambda x: func(x), axis=1)

Remove column

Using drop

df.drop('column', axis=1)

Remove row

Using drop

df.drop('index_label')

Add row

Using loc

df.loc['new_index'] = values

Add row

Using append/concat

pd.concat([df, new_row])

Adding and Removing Columns

  • Adding new column with

    • a scalar value

    • a list of values

    • a calculated values in a pandas series

    • a column with conditional values

  • Removing

    • a single column

    • multiple columns

    • columns in-place

Demo


# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index
print("Original DataFrame:")
print(df)

# 1. Adding a new column with scalar value
df['Active'] = True
print("\n1. Adding 'Active' column with same value for all rows:")
print(df)

# 2. Adding a column with a list of values
df['Performance'] = [4.5, 4.0, 3.8, 4.7, 4.2]
print("\n2. Adding 'Performance' column with different values:")
print(df)

# 3. Adding a calculated column
df['Bonus'] = df['Salary'] * df['Performance'] / 100
print("\n3. Adding calculated 'Bonus' column:")
print(df)

# 4. Adding a column with conditional values
df['Experience'] = np.where(df['Age'] > 30, 'Senior', 'Junior')
print("\n4. Adding conditional 'Experience' column:")
print(df)

# 5. Removing a single column
df_no_city = df.drop('City', axis=1)
print("\n5. Removing 'City' column:")
print(df_no_city)

# 6. Removing multiple columns
df_minimal = df.drop(['Active', 'Performance', 'Bonus'], axis=1)
print("\n6. Removing multiple columns:")
print(df_minimal)

# 7. Remove columns in-place
df.drop(['City', 'Active'], axis=1, inplace=True)
print("\n7. Removing multiple columns (inplace=True):")
print(df)

Adding and Removing Rows

  • Removing

    • a row by index label

    • multiple rows

  • Adding a new row

    • with .loc

    • with a Series

Demo


data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

# 1. Removing a row by index label
df_no_bob = df.drop('emp002')
print("\n1. DataFrame without Bob (emp002):")
print(df_no_bob)

# 2. Removing multiple rows
df_reduced = df.drop(['emp001', 'emp005'])
print("\n2. DataFrame without emp001 and emp005:")
print(df_reduced)

# 3. Adding a new row with .loc
# Create a copy to avoid SettingWithCopyWarning
df_new = df.copy()
df_new.loc['emp006'] = ['Frank', 38, 'Dallas', 88000, 'Sales']
print("\n3. DataFrame with new employee:")
print(df_new)

# 4. Adding a row with a Series
new_employee = pd.Series({
    'Name': 'Grace', 'Age': 27, 'City': 'Denver', 'Salary': 67000, 'Department': 'HR',
}, name='emp007')

df_newer = pd.concat([df_new, new_employee.to_frame().T])
print("\n4. DataFrame with another new employee:")
print(df_newer)

Note

  • Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs

  • The drop() function is powerful but doesn’t modify the original DataFrame unless you specify inplace=True

  • Adding rows is less common but useful for simulation, testing, or creating summary rows

  • Always be careful with the axis parameter - axis=0 is for rows, axis=1 is for columns

More info

Additional notes: Additional info and exercise

How Pandas Handles Series-to-DataFrame Conversions:

When converting a pandas Series to a DataFrame, there are several methods that produce different results

  1. Direct Conversion: pd.DataFrame(series)

    new_employee = pd.Series({
    'Name': 'Grace', 'Age': 27, 'City': 'Denver', 
    'Salary': 67000, 'Department': 'HR'
    }, name='emp007')
    
    # Direct conversion
    df1 = pd.DataFrame(new_employee)
    
    

    Output

                  emp007
    Name          Grace
    Age              27
    City          Denver
    Salary        67000
    Department       HR
    
    * The Series values become a single column in the DataFrame
    * The Series keys become the row index of the DataFrame
    * The Series name becomes the column name in the DataFrame
    
  2. Wrapped in a List: pd.DataFrame([series])

    # Wrapping in a list
    df2 = pd.DataFrame([new_employee])
    

    Output

    Name  Age    City  Salary Department
    0  Grace   27  Denver   67000         HR
    
    * The Series becomes a single row in the DataFrame
    * The Series keys become the column names in the DataFrame
    * A default index (0) is created
    * The Series name is effectively ignored
    
  3. to_frame() Method

    # These produce identical results:
    df1 = new_employee.to_frame()
    df2 = pd.DataFrame(new_employee)
    

    Output

                  emp007
    Name          Grace
    Age              27
    City          Denver
    Salary        67000
    Department       HR
    
    * The Series values become a single column
    * The Series index becomes the DataFrame's index
    * By default, the Series name becomes the column name
    * You can specify a different column name: `series.to_frame('new_column_name')`
    

Exercise

Adding and Removing Data:

products = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
    'Product_Name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 
                 'Electronics', 'Accessories', 'Accessories'],
    'Price': [1200, 800, 350, 250, 75, 25],
    'In_Stock': [True, True, False, True, True, False],
    'Units': [15, 28, 0, 10, 45, 0]
}
inventory = pd.DataFrame(products)

Use inventory dataframe and

  • Add a ‘Value’ column that multiplies Price by Units

  • Add a ‘Status’ column: ‘Available’ if In_Stock is True, ‘Out of Stock’ otherwise

  • Remove the In_Stock column (now redundant with Status)

  • Add a new_product row

new_product = pd.Series({
    'Product_ID': 'P007',
    'Product_Name': 'Headphones',
    'Category': 'Accessories',
    'Price': 150,
    'Units': 20,
    'Value': 3000,
    'Status': 'Available'
})
# Continue with the inventory DataFrame from Exercise 1
print("Original inventory:")
print(inventory)

# 1. Add a 'Value' column that multiplies Price by Units
inventory['Value'] = inventory['Price'] * inventory['Units']
print("\nInventory with Value column:")
print(inventory)

# 2. Add a 'Status' column: 'Available' if In_Stock is True, 'Out of Stock' otherwise
inventory['Status'] = inventory['In_Stock'].map({True: 'Available', False: 'Out of Stock'})
print("\nInventory with Status column:")
print(inventory)

# 3. Remove the In_Stock column (now redundant with Status)
inventory_updated = inventory.drop('In_Stock', axis=1)
print("\nInventory without In_Stock column:")
print(inventory_updated)

# 4. Add a new product row
inventory_final = pd.concat([inventory_updated, pd.DataFrame([new_product])])
print("\nInventory with new product:")
print(inventory_final)

DataFrame Sorting

Sorting functions

Method

Description

Key Parameters

sort_values()

Sort by column values

by, ascending, inplace, na_position

sort_index()

Sort by index

ascending, inplace

Multi-column sorting

Sort by multiple columns

by=['col1', 'col2']

Custom sorting

Sort with custom orders

by=col, key=function

Basic Sorting

  • Sorting by a single column - ascending and descending

  • Sorting by index

  • Sorting by multiple columns

  • Sorting with different directions for each column

Demo

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

# Continuing with our employee DataFrame
print("Original DataFrame:")
print(df)

# 1. Sorting by a single column (ascending by default)
df_by_age = df.sort_values('Age')
print("\n1. Sorted by Age (ascending):")
print(df_by_age)

# 2. Sorting by a single column (descending)
df_by_salary_desc = df.sort_values('Salary', ascending=False)
print("\n2. Sorted by Salary (descending):")
print(df_by_salary_desc)

# 3. Sorting by index
df_by_index = df.sort_index()
print("\n3. Sorted by index:")
print(df_by_index)

# 4. Sorting by multiple columns
df_multi_sort = df.sort_values(['Department', 'Salary'])
print("\n4. Sorted by Department, then by Salary within each department:")
print(df_multi_sort)

# 5. Sorting with different directions for each column
df_complex = df.sort_values(['Department', 'Salary'], 
                          ascending=[True, False])
print("\n5. Sorted by Department (asc), then by Salary (desc) within each department:")
print(df_complex)

More info

Additional notes: Additional note

Advanced Sorting:

  • Sort with NaNs at the beginning or end

  • Custom categorical sorting (custom order)

Demo

# 1. Sorting with NaN values
# Create a DataFrame with some missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

df.loc['emp001', 'City'] = np.nan
df.loc['emp003', 'City'] = np.nan

print("1. DataFrame with NaN values:")
print(df)

# Sort with NaNs at the beginning
print("\n1a. Sorted with NaNs first:")
print(df.sort_values('City', na_position='first'))

# Sort with NaNs at the end
print("\n1b. Sorted with NaNs last:")
print(df.sort_values('City', na_position='last'))

# 3. Custom categorical sorting
# Create a custom order for departments
dept_order = ['HR', 'Sales', 'Finance', 'Tech']
df['Dept_Coded'] = pd.Categorical(df['Department'], 
                                 categories=dept_order, 
                                 ordered=True)

print("\n3. Custom department order sorting:")
print(df.sort_values('Dept_Coded'))

Key Takeaways

Keypoints

  1. Flexible Selection: Pandas provides multiple ways to select, filter, and extract data

  2. Structure Modification: Adding and removing columns/rows enables dataset evolution

  3. Meaningful Sorting: Proper sorting is essential for analysis and presentation

  4. Custom Orders: Categorical data types allow you to enforce domain-specific ordering

  5. Combined Operations: The real power comes from combining these techniques