Data Import and Export in Pandas

Objectives

  • Illustrate how to read data from various formats, including CSV, Excel, and JSON, into Pandas.

  • Show how to export DataFrames to different formats, ensuring participants can save their analyses.

  • Utilize data inspection methods (head, info, describe) to gain an understanding of data structure and content.

Instructor note

  • Teaching : 10 min

  • Demo: 20 min

Introduction to Data Import and Export

  • Real-world data analysis starts with getting data into Pandas

  • Pandas supports many file formats natively

  • The process: Import → Inspect → Manipulate → Export

Reading Data from Various Formats

Reading Data - Main Methods

Function

Purpose

Common Parameters

pd.read_csv()

Read CSV files

filepath_or_buffer, sep, header, index_col

pd.read_excel()

Read Excel files

io, sheet_name, header, index_col

pd.read_json()

Read JSON files

path_or_buf, orient, typ

pd.read_sql()

Read SQL query/table

sql, con, index_col

pd.read_html()

Read HTML tables

io, match, flavor

Reading CSV Data

Demo

import pandas as pd
from urllib.request import urlopen

# Basic CSV reading

sample_data_csv_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/4c11eeffe234e7bc729008f767fe631a/sample_data.csv"

df_csv = pd.read_csv(urlopen(sample_data_csv_url))
# df_csv = pd.read_csv('test_data/sample_data.csv')


print("Basic CSV import:")
print(df_csv.head())  # Show first 3 rows

# Customizing CSV import
df_csv_custom = pd.read_csv(urlopen(sample_data_csv_url),
                           sep=',',            # Delimiter (comma is default)
                           header=0,           # Row to use as column names (0 is default)
                           index_col='ID',     # Column to set as index
                           usecols=['ID', 'Name', 'Value'],  # Only import specific columns
                           nrows=5)            # Only read first 5 rows
print("\nCustomized CSV import:")
print(df_csv_custom)

Reading Excel Data

Demo

# Basic Excel reading

import requests
from io import BytesIO

## Read XLSX file directly from a URL
# sample_data_xls_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/04c10398f323fcf8ff5ec474e4b22205/sample_data.xlsx"

# response = requests.get(sample_data_xls_url)
# df_excel = pd.read_excel(BytesIO(response.content))

df_excel = pd.read_excel('test_data/sample_data.xlsx')
print("Basic Excel import:")
print(df_excel.head())

# Reading specific sheet
df_excel_sheet2 = pd.read_excel('test_data/sample_data.xlsx', 
                               sheet_name='Sheet 1 - sample_data',
                               skiprows=1)     # Skip first row
print("\nReading specific Excel sheet:")
print(df_excel_sheet2.head())

# List all sheets in workbook
xl = pd.ExcelFile('test_data/sample_data.xlsx')
print(f"\nAll sheets in workbook: {xl.sheet_names}")

More info

Additional notes: Reading JSON Data

Reading JSON Data:

# Reading JSON
df_json = pd.read_json('test_data/sample_data.json')
print("JSON import:")
print(df_json.head())

JSON import:
      Name Department  Salary   Hire_Date
0     Alex      Sales   55000  2020-01-15
1     Beth         HR   60000  2019-07-10
2  Charlie       Tech   75000  2021-03-22
3    Diana    Finance   70000  2018-11-05

Note

  • Notice that all these different file formats are loaded into the same DataFrame structure.

  • Each import function has many parameters for fine-tuning the import process.

  • The defaults work well in many cases, but customization gives you control over exactly what data gets loaded and how.

  • For real-world data, you’ll often need to use these parameters to handle messy or improperly formatted files.

Writing Data to Files

Writing Data - Main Methods

Function

Purpose

Common Parameters

to_csv()

Write to CSV

path_or_buf, index, header, sep

to_excel()

Write to Excel

excel_writer, sheet_name, index, header

to_json()

Write to JSON

path_or_buf, orient, lines

to_sql()

Write to SQL table

name, con, if_exists, index

to_html()

Write to HTML

buf, columns, header, index

Writing to Different Formats

Demo

# Create a small DataFrame to export
data = {
    'Name': ['Alex', 'Beth', 'Charlie', 'Diana'],
    'Department': ['Sales', 'HR', 'Tech', 'Finance'],
    'Salary': [55000, 60000, 75000, 70000],
    'Hire_Date': ['2020-01-15', '2019-07-10', '2021-03-22', '2018-11-05']
}
df = pd.DataFrame(data)
print("DataFrame to export:")
print(df)

# Export to CSV
df.to_csv('employees.csv', index=False)
print("\nExported to CSV (employees.csv)")

# Export to Excel
df.to_excel('employees.xlsx', sheet_name='Employees', index=False)
print("Exported to Excel (employees.xlsx)")

More info

Additional notes: Export to JSON
# Export to JSON (with different orientations)
df.to_json('employees_records.json', orient='records')
print("Exported to JSON (employees_records.json) in records format")

df.to_json('employees_split.json', orient='split')
print("Exported to JSON (employees_split.json) in split format")

Note

  • Writing data is even simpler than reading it, since Pandas already knows the structure.

  • Setting index=False is common when exporting, to avoid having an extra column.

  • Different export formats are useful in different contexts:

    • CSV is universal but loses data types

    • Excel preserves formatting but requires Excel to open

    • JSON is great for metadata

Data Inspection Methods

Key Inspection Methods

Method

Purpose

What It Shows

head() / tail()

View first/last n rows

A preview of the data

info()

Summary of DataFrame

Data types, non-null counts, memory usage

describe()

Statistical summary

Count, mean, std, min, 25%, 50%, 75%, max

dtypes

Column data types

Data type of each column

shape

Dimensions

(rows, columns)

columns

Column names

Index of column names

value_counts()

Count unique values

Frequency of each unique value

Demo

Inspection Methods

# Let's inspect a more interesting dataset
import numpy as np

# Create a sample dataset with different data types
np.random.seed(42)  # For reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'Value': np.random.normal(100, 20, size=1000),
    'Count': np.random.randint(1, 100, size=1000),
    'Date': pd.date_range(start='2023-01-01', periods=1000),
    'Flag': np.random.choice([True, False], size=1000)
}
df = pd.DataFrame(data)

# Add some missing values
df.loc[10:20, 'Value'] = np.nan
df.loc[500:510, 'Category'] = None

# Now let's inspect
print("First 5 rows with head():")
print(df.head())

print("\nLast 3 rows with tail(3):")
print(df.tail(3))

print("\nDataFrame information with info():")
df.info()

print("\nStatistical summary with describe():")
print(df.describe())

print("\nInclude all columns in describe (even non-numeric):")
print(df.describe(include='all'))

print("\nData types of each column:")
print(df.dtypes)

print("\nShape (rows, columns):", df.shape)

print("\nValue counts for 'Category':")
print(df['Category'].value_counts())

Note

  • “These inspection methods are critical to understanding your data before analysis.”

  • “Always start data exploration with these methods - they quickly reveal dataset characteristics and potential issues.”

  • info() shows missing values and data types - crucial for cleaning strategies.”

  • describe() gives statistical summaries - great for spotting outliers or distribution patterns.”

  • “Real data is messy - these tools help you understand what needs to be cleaned.”

More info

Additional notes: Exercises

Exercise 1: Writing Data:

Use grades_df dataframe and export to

  1. CSV without the index, but including headers

  2. JSON in format

  3. Verify by reading the files back

# Create a dataframe of student grades
grades_data = {
    'Student': ['Alex', 'Barbara', 'Carlos', 'Diana', 'Evan'],
    'Math': [85, 92, 78, 96, 88],
    'Science': [92, 85, 91, 88, 95],
    'History': [79, 89, 85, 82, 91]
}
grades_df = pd.DataFrame(grades_data)

Exercise 1: solution:


# Task 1: Export to CSV without the index, but including headers
grades_df.to_csv('student_grades.csv', index=False)

# Task 2: Export to JSON in 'records' format
grades_df.to_json('student_grades.json', orient='records')

# Verify by reading the files back
print("CSV file contents:")
print(pd.read_csv('student_grades.csv').head())

print("\nJSON file contents:")
print(pd.read_json('student_grades.json').head())

Exercise 2: Data Inspection: Perform a comprehensive inspection using products_df and answer these questions

  1. How many rows and columns are in this dataset?

  2. What data type is the ‘Rating’ column?

  3. How many null values are in the dataset?

  4. What’s the average price?

# Create a dataframe with some intentional quirks
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Price': [1200, 800, 350, 250, 75],
    'Stock': [10, 25, 30, 5, 50],
    'Rating': [4.5, 4.8, 4.2, 4.7, None],  # Note the None value
    'OnSale': [True, False, True, False, False]
}
products_df = pd.DataFrame(data)

Exercise 2: solution:

# Inspect the data
print("Shape:", products_df.shape)
products_df.info()
print("\nStatistical summary:")
print(products_df.describe())
print("\nIncluding all columns:")
print(products_df.describe(include='all'))

# Answers:
print("\nAnswers:")
print(f"1. {products_df.shape[0]} rows and {products_df.shape[1]} columns")
print(f"2. The 'Rating' column has dtype: {products_df['Rating'].dtype}")
print(f"3. Total null values: {products_df.isnull().sum().sum()}")
print(f"4. Average price: ${products_df['Price'].mean():.2f}")

Key Takeaways

Keypoints

  1. Data Import Flexibility: Pandas can read data from virtually any structured data format

  2. Customized Reading: Parameters let you control exactly how data is imported

  3. Easy Export: Writing data is straightforward with similar syntax across formats

  4. Always Inspect First: head(), info(), and describe() should be your first steps with any new dataset

  5. Real-World Ready: These tools prepare you for working with messy, real-world data