Data Import and Export in Pandas
Objectives
Illustrate how to read data from various formats, including CSV, Excel, and JSON, into Pandas.
Show how to export DataFrames to different formats, ensuring participants can save their analyses.
Utilize data inspection methods (head, info, describe) to gain an understanding of data structure and content.
Instructor note
Teaching : 10 min
Demo: 20 min
Introduction to Data Import and Export
Real-world data analysis starts with getting data into Pandas
Pandas supports many file formats natively
The process: Import → Inspect → Manipulate → Export
Reading Data from Various Formats
Reading Data - Main Methods
Function |
Purpose |
Common Parameters |
---|---|---|
|
Read CSV files |
|
|
Read Excel files |
|
|
Read JSON files |
|
|
Read SQL query/table |
|
|
Read HTML tables |
|
Reading CSV Data
Demo
import pandas as pd
from urllib.request import urlopen
# Basic CSV reading
sample_data_csv_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/4c11eeffe234e7bc729008f767fe631a/sample_data.csv"
df_csv = pd.read_csv(urlopen(sample_data_csv_url))
# df_csv = pd.read_csv('test_data/sample_data.csv')
print("Basic CSV import:")
print(df_csv.head()) # Show first 3 rows
# Customizing CSV import
df_csv_custom = pd.read_csv(urlopen(sample_data_csv_url),
sep=',', # Delimiter (comma is default)
header=0, # Row to use as column names (0 is default)
index_col='ID', # Column to set as index
usecols=['ID', 'Name', 'Value'], # Only import specific columns
nrows=5) # Only read first 5 rows
print("\nCustomized CSV import:")
print(df_csv_custom)
Output
Basic CSV import:
ID Name Age City Value Category Date
0 1 Person_1 22 Chicago 34.61 B 2023-01-01
1 2 Person_2 52 Phoenix 47.46 A 2023-01-02
2 3 Person_3 38 Phoenix 48.85 D 2023-01-03
3 4 Person_4 27 Houston 75.10 A 2023-01-04
4 5 Person_5 34 Houston 47.88 B 2023-01-05
Customized CSV import:
Name Value
ID
1 Person_1 34.61
2 Person_2 47.46
3 Person_3 48.85
4 Person_4 75.10
5 Person_5 47.88
Reading Excel Data
Demo
# Basic Excel reading
import requests
from io import BytesIO
## Read XLSX file directly from a URL
# sample_data_xls_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/04c10398f323fcf8ff5ec474e4b22205/sample_data.xlsx"
# response = requests.get(sample_data_xls_url)
# df_excel = pd.read_excel(BytesIO(response.content))
df_excel = pd.read_excel('test_data/sample_data.xlsx')
print("Basic Excel import:")
print(df_excel.head())
# Reading specific sheet
df_excel_sheet2 = pd.read_excel('test_data/sample_data.xlsx',
sheet_name='Sheet 1 - sample_data',
skiprows=1) # Skip first row
print("\nReading specific Excel sheet:")
print(df_excel_sheet2.head())
# List all sheets in workbook
xl = pd.ExcelFile('test_data/sample_data.xlsx')
print(f"\nAll sheets in workbook: {xl.sheet_names}")
Output
Basic Excel import:
sample_data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 ID Name Age City Value Category Date
1 1 Person_1 22 Chicago 34.61 B 2023-01-01
2 2 Person_2 52 Phoenix 47.46 A 2023-01-02
3 3 Person_3 38 Phoenix 48.85 D 2023-01-03
4 4 Person_4 27 Houston 75.1 A 2023-01-04
Reading specific Excel sheet:
ID Name Age City Value Category Date
0 1 Person_1 22 Chicago 34.61 B 2023-01-01
1 2 Person_2 52 Phoenix 47.46 A 2023-01-02
2 3 Person_3 38 Phoenix 48.85 D 2023-01-03
3 4 Person_4 27 Houston 75.10 A 2023-01-04
4 5 Person_5 34 Houston 47.88 B 2023-01-05
All sheets in workbook: ['Sheet 1 - sample_data']
More info
Additional notes: Reading JSON Data
Reading JSON Data:
# Reading JSON
df_json = pd.read_json('test_data/sample_data.json')
print("JSON import:")
print(df_json.head())
JSON import:
Name Department Salary Hire_Date
0 Alex Sales 55000 2020-01-15
1 Beth HR 60000 2019-07-10
2 Charlie Tech 75000 2021-03-22
3 Diana Finance 70000 2018-11-05
Note
Notice that all these different file formats are loaded into the same DataFrame structure.
Each import function has many parameters for fine-tuning the import process.
The defaults work well in many cases, but customization gives you control over exactly what data gets loaded and how.
For real-world data, you’ll often need to use these parameters to handle messy or improperly formatted files.
Writing Data to Files
Writing Data - Main Methods
Function |
Purpose |
Common Parameters |
---|---|---|
|
Write to CSV |
|
|
Write to Excel |
|
|
Write to JSON |
|
|
Write to SQL table |
|
|
Write to HTML |
|
Writing to Different Formats
Demo
# Create a small DataFrame to export
data = {
'Name': ['Alex', 'Beth', 'Charlie', 'Diana'],
'Department': ['Sales', 'HR', 'Tech', 'Finance'],
'Salary': [55000, 60000, 75000, 70000],
'Hire_Date': ['2020-01-15', '2019-07-10', '2021-03-22', '2018-11-05']
}
df = pd.DataFrame(data)
print("DataFrame to export:")
print(df)
# Export to CSV
df.to_csv('employees.csv', index=False)
print("\nExported to CSV (employees.csv)")
# Export to Excel
df.to_excel('employees.xlsx', sheet_name='Employees', index=False)
print("Exported to Excel (employees.xlsx)")
More info
Additional notes: Export to JSON
# Export to JSON (with different orientations)
df.to_json('employees_records.json', orient='records')
print("Exported to JSON (employees_records.json) in records format")
df.to_json('employees_split.json', orient='split')
print("Exported to JSON (employees_split.json) in split format")
Note
Writing data is even simpler than reading it, since Pandas already knows the structure.
Setting
index=False
is common when exporting, to avoid having an extra column.Different export formats are useful in different contexts:
CSV is universal but loses data types
Excel preserves formatting but requires Excel to open
JSON is great for metadata
Data Inspection Methods
Key Inspection Methods
Method |
Purpose |
What It Shows |
---|---|---|
|
View first/last n rows |
A preview of the data |
|
Summary of DataFrame |
Data types, non-null counts, memory usage |
|
Statistical summary |
Count, mean, std, min, 25%, 50%, 75%, max |
|
Column data types |
Data type of each column |
|
Dimensions |
(rows, columns) |
|
Column names |
Index of column names |
|
Count unique values |
Frequency of each unique value |
Demo
Inspection Methods
# Let's inspect a more interesting dataset
import numpy as np
# Create a sample dataset with different data types
np.random.seed(42) # For reproducibility
data = {
'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
'Value': np.random.normal(100, 20, size=1000),
'Count': np.random.randint(1, 100, size=1000),
'Date': pd.date_range(start='2023-01-01', periods=1000),
'Flag': np.random.choice([True, False], size=1000)
}
df = pd.DataFrame(data)
# Add some missing values
df.loc[10:20, 'Value'] = np.nan
df.loc[500:510, 'Category'] = None
# Now let's inspect
print("First 5 rows with head():")
print(df.head())
print("\nLast 3 rows with tail(3):")
print(df.tail(3))
print("\nDataFrame information with info():")
df.info()
print("\nStatistical summary with describe():")
print(df.describe())
print("\nInclude all columns in describe (even non-numeric):")
print(df.describe(include='all'))
print("\nData types of each column:")
print(df.dtypes)
print("\nShape (rows, columns):", df.shape)
print("\nValue counts for 'Category':")
print(df['Category'].value_counts())
Output
First 5 rows with head():
Category Value Count Date Flag
0 C 106.835120 72 2023-01-01 False
1 D 137.523417 49 2023-01-02 False
2 A 119.008477 79 2023-01-03 True
3 C 88.461927 17 2023-01-04 True
4 C 82.031707 60 2023-01-05 False
Last 3 rows with tail(3):
Category Value Count Date Flag
997 D 87.885702 77 2025-09-24 True
998 D 114.841907 59 2025-09-25 False
999 C 105.985852 47 2025-09-26 False
DataFrame information with info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Category 989 non-null object
1 Value 989 non-null float64
2 Count 1000 non-null int64
3 Date 1000 non-null datetime64[ns]
4 Flag 1000 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 32.4+ KB
Statistical summary with describe():
Value Count Date
count 989.000000 1000.000000 1000
mean 100.858217 49.007000 2024-05-14 12:00:00
min 42.074892 1.000000 2023-01-01 00:00:00
25% 87.503628 24.000000 2023-09-07 18:00:00
50% 100.811834 48.000000 2024-05-14 12:00:00
75% 113.637830 73.000000 2025-01-19 06:00:00
max 161.577616 99.000000 2025-09-26 00:00:00
std 20.024572 28.269703 NaN
Include all columns in describe (even non-numeric):
Category Value Count Date Flag
count 989 989.000000 1000.000000 1000 1000
unique 4 NaN NaN NaN 2
top D NaN NaN NaN True
freq 277 NaN NaN NaN 512
mean NaN 100.858217 49.007000 2024-05-14 12:00:00 NaN
min NaN 42.074892 1.000000 2023-01-01 00:00:00 NaN
25% NaN 87.503628 24.000000 2023-09-07 18:00:00 NaN
50% NaN 100.811834 48.000000 2024-05-14 12:00:00 NaN
75% NaN 113.637830 73.000000 2025-01-19 06:00:00 NaN
max NaN 161.577616 99.000000 2025-09-26 00:00:00 NaN
std NaN 20.024572 28.269703 NaN NaN
Data types of each column:
Category object
Value float64
Count int64
Date datetime64[ns]
Flag bool
dtype: object
Shape (rows, columns): (1000, 5)
Value counts for 'Category':
Category
D 277
A 254
C 229
B 229
Name: count, dtype: int64
Note
“These inspection methods are critical to understanding your data before analysis.”
“Always start data exploration with these methods - they quickly reveal dataset characteristics and potential issues.”
“
info()
shows missing values and data types - crucial for cleaning strategies.”“
describe()
gives statistical summaries - great for spotting outliers or distribution patterns.”“Real data is messy - these tools help you understand what needs to be cleaned.”
More info
Additional notes: Exercises
Exercise 1: Writing Data:
Use grades_df
dataframe and export to
CSV without the index, but including headers
JSON in format
Verify by reading the files back
# Create a dataframe of student grades
grades_data = {
'Student': ['Alex', 'Barbara', 'Carlos', 'Diana', 'Evan'],
'Math': [85, 92, 78, 96, 88],
'Science': [92, 85, 91, 88, 95],
'History': [79, 89, 85, 82, 91]
}
grades_df = pd.DataFrame(grades_data)
Exercise 1: solution:
# Task 1: Export to CSV without the index, but including headers
grades_df.to_csv('student_grades.csv', index=False)
# Task 2: Export to JSON in 'records' format
grades_df.to_json('student_grades.json', orient='records')
# Verify by reading the files back
print("CSV file contents:")
print(pd.read_csv('student_grades.csv').head())
print("\nJSON file contents:")
print(pd.read_json('student_grades.json').head())
Exercise 2: Data Inspection:
Perform a comprehensive inspection using products_df
and answer these questions
How many rows and columns are in this dataset?
What data type is the ‘Rating’ column?
How many null values are in the dataset?
What’s the average price?
# Create a dataframe with some intentional quirks
data = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Price': [1200, 800, 350, 250, 75],
'Stock': [10, 25, 30, 5, 50],
'Rating': [4.5, 4.8, 4.2, 4.7, None], # Note the None value
'OnSale': [True, False, True, False, False]
}
products_df = pd.DataFrame(data)
Exercise 2: solution:
# Inspect the data
print("Shape:", products_df.shape)
products_df.info()
print("\nStatistical summary:")
print(products_df.describe())
print("\nIncluding all columns:")
print(products_df.describe(include='all'))
# Answers:
print("\nAnswers:")
print(f"1. {products_df.shape[0]} rows and {products_df.shape[1]} columns")
print(f"2. The 'Rating' column has dtype: {products_df['Rating'].dtype}")
print(f"3. Total null values: {products_df.isnull().sum().sum()}")
print(f"4. Average price: ${products_df['Price'].mean():.2f}")
Key Takeaways
Keypoints
Data Import Flexibility: Pandas can read data from virtually any structured data format
Customized Reading: Parameters let you control exactly how data is imported
Easy Export: Writing data is straightforward with similar syntax across formats
Always Inspect First:
head()
,info()
, anddescribe()
should be your first steps with any new datasetReal-World Ready: These tools prepare you for working with messy, real-world data