DataFrame Manipulation & Sorting

Objectives

  • Demonstrate effective techniques for selecting specific rows and from DataFrames.

  • Equip participants with skills to add and remove columns within a DataFrame.

  • Implement sorting methods by values, by index, and perform multiple column sorting with custom orders.

Time

20 minutes

Discussion

  • Now that we can import data, we need to reshape and manipulate it for analysis

  • In real-world data analysis, you’ll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis

  • The skills we’re covering in this session form the backbone of data wrangling in Python

  • Think of these operations as transforming raw data into analysis-ready information

Column and Row Selection

Different Ways to Select Data:

Selection Type

Purpose

Example Syntax

Single column

Get one variable

df['column_name'] or df.column_name

Multiple columns

Get specific variables

df[['col1', 'col2']]

Row by index

Get specific observation

df.loc['index_label']

Row by position

Get nth observation

df.iloc[n]

Row and column

Get specific value(s)

df.loc['index', 'column']

Slicing

Get ranges of data

df.loc['idx1':'idx2', 'col1':'col2']

Basic Selection

Demo

import pandas as pd
import numpy as np

# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index
print("Original DataFrame:")
print(df)

# Single column selection - two methods
print("\n1. Single column as Series:")
print(df['Age'])  # Returns a Series

print("\n2. Alternative syntax for columns without spaces:")
print(df.Age)  # Also returns a Series, but only works if column name has no spaces

# Multiple column selection
print("\n3. Selecting multiple columns:")
print(df[['Name', 'Salary']])  # Returns a DataFrame

# Row selection by index label using .loc
print("\n4. Selecting row by index label:")
print(df.loc['emp003'])  # Returns a Series representing the row

# Row selection by position using .iloc
print("\n5. Selecting row by position (third row):")
print(df.iloc[2])  # Also returns a Series

# Selecting a subset of rows
print("\n6. Selecting multiple rows by position:")
print(df.iloc[1:4])  # Rows 1, 2, and 3 (not including 4)

# Selecting specific cells
print("\n7. Selecting specific value (cell):")
print(df.loc['emp002', 'Salary'])  # Returns the value (72000)

# Selecting a subset of rows and columns
print("\n8. Selecting subset of rows and columns:")
print(df.loc['emp001':'emp003', ['Name', 'Age', 'Salary']])

Output

Original DataFrame:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance

1. Single column as Series:
emp001    24
emp002    30
emp003    35
emp004    42
emp005    28
Name: Age, dtype: int64

2. Alternative syntax for columns without spaces:
emp001    24
emp002    30
emp003    35
emp004    42
emp005    28
Name: Age, dtype: int64

3. Selecting multiple columns:
           Name  Salary
emp001    Alice   65000
emp002      Bob   72000
emp003  Charlie   85000
emp004    David   92000
emp005      Eva   70000

4. Selecting row by index label:
Name          Charlie
Age                35
City          Chicago
Salary          85000
Department       Tech
Name: emp003, dtype: object

5. Selecting row by position (third row):
Name          Charlie
Age                35
City          Chicago
Salary          85000
Department       Tech
Name: emp003, dtype: object

6. Selecting multiple rows by position:
           Name  Age     City  Salary Department
emp002      Bob   30   Boston   72000      Sales
emp003  Charlie   35  Chicago   85000       Tech
emp004    David   42  Seattle   92000       Tech

7. Selecting specific value (cell):
72000

8. Selecting subset of rows and columns:
           Name  Age  Salary
emp001    Alice   24   65000
emp002      Bob   30   72000
emp003  Charlie   35   85000

Discussion

  • Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure

  • The .loc accessor is used for label-based indexing, while .iloc is for position-based indexing

Advanced Selection with Conditions

Demo

# Boolean selection - rows where Age > 30
print("\n9. Boolean selection - employees over 30:")
print(df[df['Age'] > 30])

# Multiple conditions using & (and) and | (or)
print("\n10. Multiple conditions - Tech department with salary > 80000:")
print(df[(df['Department'] == 'Tech') & (df['Salary'] > 80000)])

# Using .query() method for cleaner syntax
print("\n11. Using query method - same condition:")
print(df.query("Department == 'Tech' and Salary > 80000"))

# Row selection with .isin()
print("\n12. Using .isin() - employees in HR or Finance:")
print(df[df['Department'].isin(['HR', 'Finance'])])

Output

9. Boolean selection - employees over 30:
           Name  Age     City  Salary Department
emp003  Charlie   35  Chicago   85000       Tech
emp004    David   42  Seattle   92000       Tech

10. Multiple conditions - Tech department with salary > 80000:
           Name  Age     City  Salary Department
emp003  Charlie   35  Chicago   85000       Tech
emp004    David   42  Seattle   92000       Tech

11. Using query method - same condition:
           Name  Age     City  Salary Department
emp003  Charlie   35  Chicago   85000       Tech
emp004    David   42  Seattle   92000       Tech

12. Using .isin() - employees in HR or Finance:
         Name  Age      City  Salary Department
emp001  Alice   24  New York   65000         HR
emp005    Eva   28     Miami   70000    Finance

Discussion

  • Boolean selection is incredibly powerful - it lets you filter data based on specific conditions

  • These selection methods can be combined in powerful ways to extract exactly the data you need

Exercise

Selection Practice:

Use inventory dataframe and

  • Select just the Product_Name and Price columns

  • Select all products that are in stock (In_Stock is True)

  • Select all electronics that cost less than 500

  • Select the 2nd and 3rd products using position-based indexing

# Create a dataset of product inventory
products = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
    'Product_Name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 
                 'Electronics', 'Accessories', 'Accessories'],
    'Price': [1200, 800, 350, 250, 75, 25],
    'In_Stock': [True, True, False, True, True, False],
    'Units': [15, 28, 0, 10, 45, 0]
}
inventory = pd.DataFrame(products)

Adding and Removing Columns/Rows

Modifying DataFrame Structure:

Operation

Method

Example

Add column

Direct assignment

df['new_col'] = values

Add column

From existing columns

df['new_col'] = df['col1'] + df['col2']

Add column

Using apply/lambda

df['new_col'] = df.apply(lambda x: func(x), axis=1)

Remove column

Using drop

df.drop('column', axis=1)

Remove row

Using drop

df.drop('index_label')

Add row

Using loc

df.loc['new_index'] = values

Add row

Using append/concat

pd.concat([df, new_row])

Adding and Removing Columns

  • Adding new column with

    • a scalar value

    • a list of values

    • a calculated values in a pandas series

    • a column with conditional values

  • Removing

    • a single column

    • multiple columns

    • columns in-place

Demo


# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index
print("Original DataFrame:")
print(df)

# 1. Adding a new column with scalar value
df['Active'] = True
print("\n1. Adding 'Active' column with same value for all rows:")
print(df)

# 2. Adding a column with a list of values
df['Performance'] = [4.5, 4.0, 3.8, 4.7, 4.2]
print("\n2. Adding 'Performance' column with different values:")
print(df)

# 3. Adding a calculated column
df['Bonus'] = df['Salary'] * df['Performance'] / 100
print("\n3. Adding calculated 'Bonus' column:")
print(df)

# 4. Adding a column with conditional values
df['Experience'] = np.where(df['Age'] > 30, 'Senior', 'Junior')
print("\n4. Adding conditional 'Experience' column:")
print(df)

# 5. Removing a single column
df_no_city = df.drop('City', axis=1)
print("\n5. Removing 'City' column:")
print(df_no_city)

# 6. Removing multiple columns
df_minimal = df.drop(['Active', 'Performance', 'Bonus'], axis=1)
print("\n6. Removing multiple columns:")
print(df_minimal)

# 7. Remove columns in-place
df.drop(['City', 'Active'], axis=1, inplace=True)
print("\n7. Removing multiple columns (inplace=True):")
print(df)

Output

Original DataFrame:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance

1. Adding 'Active' column with same value for all rows:
           Name  Age      City  Salary Department  Active
emp001    Alice   24  New York   65000         HR    True
emp002      Bob   30    Boston   72000      Sales    True
emp003  Charlie   35   Chicago   85000       Tech    True
emp004    David   42   Seattle   92000       Tech    True
emp005      Eva   28     Miami   70000    Finance    True

2. Adding 'Performance' column with different values:
           Name  Age      City  Salary Department  Active  Performance
emp001    Alice   24  New York   65000         HR    True          4.5
emp002      Bob   30    Boston   72000      Sales    True          4.0
emp003  Charlie   35   Chicago   85000       Tech    True          3.8
emp004    David   42   Seattle   92000       Tech    True          4.7
emp005      Eva   28     Miami   70000    Finance    True          4.2

3. Adding calculated 'Bonus' column:
           Name  Age      City  Salary Department  Active  Performance   Bonus
emp001    Alice   24  New York   65000         HR    True          4.5  2925.0
emp002      Bob   30    Boston   72000      Sales    True          4.0  2880.0
emp003  Charlie   35   Chicago   85000       Tech    True          3.8  3230.0
emp004    David   42   Seattle   92000       Tech    True          4.7  4324.0
emp005      Eva   28     Miami   70000    Finance    True          4.2  2940.0

4. Adding conditional 'Experience' column:
           Name  Age      City  Salary Department  Active  Performance   Bonus Experience
emp001    Alice   24  New York   65000         HR    True          4.5  2925.0     Junior
emp002      Bob   30    Boston   72000      Sales    True          4.0  2880.0     Junior
emp003  Charlie   35   Chicago   85000       Tech    True          3.8  3230.0     Senior
emp004    David   42   Seattle   92000       Tech    True          4.7  4324.0     Senior
emp005      Eva   28     Miami   70000    Finance    True          4.2  2940.0     Junior

5. Removing 'City' column:
           Name  Age  Salary Department  Active  Performance   Bonus Experience
emp001    Alice   24   65000         HR    True          4.5  2925.0     Junior
emp002      Bob   30   72000      Sales    True          4.0  2880.0     Junior
emp003  Charlie   35   85000       Tech    True          3.8  3230.0     Senior
emp004    David   42   92000       Tech    True          4.7  4324.0     Senior
emp005      Eva   28   70000    Finance    True          4.2  2940.0     Junior

6. Removing multiple columns:
           Name  Age      City  Salary Department Experience
emp001    Alice   24  New York   65000         HR     Junior
emp002      Bob   30    Boston   72000      Sales     Junior
emp003  Charlie   35   Chicago   85000       Tech     Senior
emp004    David   42   Seattle   92000       Tech     Senior
emp005      Eva   28     Miami   70000    Finance     Junior

7. Removing multiple columns (inplace=True):
           Name  Age  Salary Department  Performance   Bonus
emp001    Alice   24   65000         HR          4.5  2925.0
emp002      Bob   30   72000      Sales          4.0  2880.0
emp003  Charlie   35   85000       Tech          3.8  3230.0
emp004    David   42   92000       Tech          4.7  4324.0
emp005      Eva   28   70000    Finance          4.2  2940.0

Adding and Removing Rows

  • Removing

    • a row by index label

    • multiple rows

  • Adding a new row

    • with .loc

    • with a Series

Demo


data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

# 1. Removing a row by index label
df_no_bob = df.drop('emp002')
print("\n1. DataFrame without Bob (emp002):")
print(df_no_bob)

# 2. Removing multiple rows
df_reduced = df.drop(['emp001', 'emp005'])
print("\n2. DataFrame without emp001 and emp005:")
print(df_reduced)

# 3. Adding a new row with .loc
# Create a copy to avoid SettingWithCopyWarning
df_new = df.copy()
df_new.loc['emp006'] = ['Frank', 38, 'Dallas', 88000, 'Sales']
print("\n3. DataFrame with new employee:")
print(df_new)

# 4. Adding a row with a Series
new_employee = pd.Series({
    'Name': 'Grace', 'Age': 27, 'City': 'Denver', 'Salary': 67000, 'Department': 'HR',
}, name='emp007')

df_newer = pd.concat([df_new, pd.DataFrame([new_employee])])
print("\n4. DataFrame with another new employee:")
print(df_newer)

Output

1. DataFrame without Bob (emp002):
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance

2. DataFrame without emp001 and emp005:
           Name  Age     City  Salary Department
emp002      Bob   30   Boston   72000      Sales
emp003  Charlie   35  Chicago   85000       Tech
emp004    David   42  Seattle   92000       Tech

3. DataFrame with new employee:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance
emp006    Frank   38    Dallas   88000      Sales

4. DataFrame with another new employee:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance
emp006    Frank   38    Dallas   88000      Sales
emp007    Grace   27    Denver   67000         HR

Discussion

  • Adding columns is a common operation, especially when you need to create derived fields or features

  • Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs

  • The drop() function is powerful but doesn’t modify the original DataFrame unless you specify inplace=True

  • Adding rows is less common but useful for simulation, testing, or creating summary rows

  • Always be careful with the axis parameter - axis=0 is for rows, axis=1 is for columns

Exercise

Adding and Removing Data:

products = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
    'Product_Name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 
                 'Electronics', 'Accessories', 'Accessories'],
    'Price': [1200, 800, 350, 250, 75, 25],
    'In_Stock': [True, True, False, True, True, False],
    'Units': [15, 28, 0, 10, 45, 0]
}
inventory = pd.DataFrame(products)

Use inventory dataframe and

  • Add a ‘Value’ column that multiplies Price by Units

  • Add a ‘Status’ column: ‘Available’ if In_Stock is True, ‘Out of Stock’ otherwise

  • Remove the In_Stock column (now redundant with Status)

  • Add a new_product row

new_product = pd.Series({
    'Product_ID': 'P007',
    'Product_Name': 'Headphones',
    'Category': 'Accessories',
    'Price': 150,
    'Units': 20,
    'Value': 3000,
    'Status': 'Available'
})

DataFrame Sorting

Sorting functions

Method

Description

Key Parameters

sort_values()

Sort by column values

by, ascending, inplace, na_position

sort_index()

Sort by index

ascending, inplace

Multi-column sorting

Sort by multiple columns

by=['col1', 'col2']

Custom sorting

Sort with custom orders

by=col, key=function

Basic Sorting

  • Sorting by a single column - ascending and descending

  • Sorting by index

  • Sorting by multiple columns

  • Sorting with different directions for each column

Demo

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

# Continuing with our employee DataFrame
print("Original DataFrame:")
print(df)

# 1. Sorting by a single column (ascending by default)
df_by_age = df.sort_values('Age')
print("\n1. Sorted by Age (ascending):")
print(df_by_age)

# 2. Sorting by a single column (descending)
df_by_salary_desc = df.sort_values('Salary', ascending=False)
print("\n2. Sorted by Salary (descending):")
print(df_by_salary_desc)

# 3. Sorting by index
df_by_index = df.sort_index()
print("\n3. Sorted by index:")
print(df_by_index)

# 4. Sorting by multiple columns
df_multi_sort = df.sort_values(['Department', 'Salary'])
print("\n4. Sorted by Department, then by Salary within each department:")
print(df_multi_sort)

# 5. Sorting with different directions for each column
df_complex = df.sort_values(['Department', 'Salary'], 
                          ascending=[True, False])
print("\n5. Sorted by Department (asc), then by Salary (desc) within each department:")
print(df_complex)

Output

Original DataFrame:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance

1. Sorted by Age (ascending):
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp005      Eva   28     Miami   70000    Finance
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech

2. Sorted by Salary (descending):
           Name  Age      City  Salary Department
emp004    David   42   Seattle   92000       Tech
emp003  Charlie   35   Chicago   85000       Tech
emp002      Bob   30    Boston   72000      Sales
emp005      Eva   28     Miami   70000    Finance
emp001    Alice   24  New York   65000         HR

3. Sorted by index:
           Name  Age      City  Salary Department
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech
emp005      Eva   28     Miami   70000    Finance

4. Sorted by Department, then by Salary within each department:
           Name  Age      City  Salary Department
emp005      Eva   28     Miami   70000    Finance
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp003  Charlie   35   Chicago   85000       Tech
emp004    David   42   Seattle   92000       Tech

5. Sorted by Department (asc), then by Salary (desc) within each department:
           Name  Age      City  Salary Department
emp005      Eva   28     Miami   70000    Finance
emp001    Alice   24  New York   65000         HR
emp002      Bob   30    Boston   72000      Sales
emp004    David   42   Seattle   92000       Tech
emp003  Charlie   35   Chicago   85000       Tech

Advanced Sorting

  • Sort with NaNs at the beginning or end

  • Custom categorical sorting (custom order)

Demo

# 1. Sorting with NaN values
# Create a DataFrame with some missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 35, 42, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [65000, 72000, 85000, 92000, 70000],
    'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
}
df = pd.DataFrame(data)
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005']  # Custom index

df.loc['emp001', 'City'] = np.nan
df.loc['emp003', 'City'] = np.nan

print("1. DataFrame with NaN values:")
print(df)

# Sort with NaNs at the beginning
print("\n1a. Sorted with NaNs first:")
print(df.sort_values('City', na_position='first'))

# Sort with NaNs at the end
print("\n1b. Sorted with NaNs last:")
print(df.sort_values('City', na_position='last'))

# 3. Custom categorical sorting
# Create a custom order for departments
dept_order = ['HR', 'Sales', 'Finance', 'Tech']
df['Dept_Coded'] = pd.Categorical(df['Department'], 
                                 categories=dept_order, 
                                 ordered=True)

print("\n3. Custom department order sorting:")
print(df.sort_values('Dept_Coded'))

Output

1. DataFrame with NaN values:
           Name  Age     City  Salary Department
emp001    Alice   24      NaN   65000         HR
emp002      Bob   30   Boston   72000      Sales
emp003  Charlie   35      NaN   85000       Tech
emp004    David   42  Seattle   92000       Tech
emp005      Eva   28    Miami   70000    Finance

1a. Sorted with NaNs first:
           Name  Age     City  Salary Department
emp001    Alice   24      NaN   65000         HR
emp003  Charlie   35      NaN   85000       Tech
emp002      Bob   30   Boston   72000      Sales
emp005      Eva   28    Miami   70000    Finance
emp004    David   42  Seattle   92000       Tech

1b. Sorted with NaNs last:
           Name  Age     City  Salary Department
emp002      Bob   30   Boston   72000      Sales
emp005      Eva   28    Miami   70000    Finance
emp004    David   42  Seattle   92000       Tech
emp001    Alice   24      NaN   65000         HR
emp003  Charlie   35      NaN   85000       Tech

3. Custom department order sorting:
           Name  Age     City  Salary Department Dept_Coded
emp001    Alice   24      NaN   65000         HR         HR
emp002      Bob   30   Boston   72000      Sales      Sales
emp005      Eva   28    Miami   70000    Finance    Finance
emp003  Charlie   35      NaN   85000       Tech       Tech
emp004    David   42  Seattle   92000       Tech       Tech

Discussion

  • Sorting is essential for both data analysis and presentation.

  • Multi-column sorting is particularly useful for hierarchical data.

  • The ascending parameter can be a single boolean or a list of booleans for multi-column sorts.

  • Custom sorting allows for domain-specific ordering beyond simple alphabetical or numerical order.

  • The inplace=True parameter can be used to modify the original DataFrame rather than creating a new one.

Exercise

Multiple Column Sorting with Custom Orders:

This dataset represents results from three independent candidate gene studies investigating the association between common genetic variants and breast cancer risk, where odds ratios indicate the increased likelihood of developing breast cancer for carriers of each gene variant compared to non-carriers.

  1. Create correct chromosome ordering: ‘Chr1’ to ‘ChrX’, ‘ChrY’

  2. Create study category with custom order: ‘Study1’, ‘Study2’, ‘Study3’

  3. Sort by Chromosome (numerical), then by Study (custom order)

  4. Sort by statistical significance (p-value) within each Chromosome

  5. Select genes sorted by P-value and effect size within each chromosome

import pandas as pd

# Create a genetic dataset
genetic_data = {
    'Gene': ['BRCA1', 'TP53', 'APOE', 'PTEN', 'BRAF', 'KRAS', 
            'BRCA2', 'EGFR', 'TNF', 'IL6'],
    'Chromosome': ['Chr17', 'Chr17', 'Chr19', 'Chr10', 'Chr7', 'Chr12', 
                  'Chr13', 'Chr7', 'Chr6', 'Chr7'],
    'Study': ['Study1', 'Study2', 'Study1', 'Study3', 'Study2', 'Study3', 
             'Study1', 'Study2', 'Study3', 'Study1'],
    'PValue': [0.0001, 0.0050, 0.0200, 0.0150, 0.0300, 0.0400, 
              0.0005, 0.0250, 0.0100, 0.0450],
    'EffectSize': [2.5, 1.8, 3.2, 2.1, 1.5, 1.2, 2.7, 1.9, 2.3, 1.1]
}

Key Takeaways

Keypoints

  1. Flexible Selection: Pandas provides multiple ways to select, filter, and extract data

  2. Structure Modification: Adding and removing columns/rows enables dataset evolution

  3. Meaningful Sorting: Proper sorting is essential for analysis and presentation

  4. Custom Orders: Categorical data types allow you to enforce domain-specific ordering

  5. Combined Operations: The real power comes from combining these techniques