Data Import and Export in Pandas

Objectives

Illustrate how to read data from various formats, including CSV, Excel, and JSON, into Pandas.
Show how to export DataFrames to different formats, ensuring participants can save their analyses.
Utilize data inspection methods (head, info, describe) to gain an understanding of data structure and content.

Instructor note

Teaching : 10 min
Demo: 20 min

Introduction to Data Import and Export

Real-world data analysis starts with getting data into Pandas
Pandas supports many file formats natively
The process: Import → Inspect → Manipulate → Export

Reading Data from Various Formats

Reading Data - Main Methods

Function	Purpose	Common Parameters
`pd.read_csv()`	Read CSV files	`filepath_or_buffer`, `sep`, `header`, `index_col`
`pd.read_excel()`	Read Excel files	`io`, `sheet_name`, `header`, `index_col`
`pd.read_json()`	Read JSON files	`path_or_buf`, `orient`, `typ`
`pd.read_sql()`	Read SQL query/table	`sql`, `con`, `index_col`
`pd.read_html()`	Read HTML tables	`io`, `match`, `flavor`

Reading CSV Data

Demo

import pandas as pd
from urllib.request import urlopen

# Basic CSV reading

sample_data_csv_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/4c11eeffe234e7bc729008f767fe631a/sample_data.csv"

df_csv = pd.read_csv(urlopen(sample_data_csv_url))
# df_csv = pd.read_csv('test_data/sample_data.csv')


print("Basic CSV import:")
print(df_csv.head())  # Show first 3 rows

# Customizing CSV import
df_csv_custom = pd.read_csv(urlopen(sample_data_csv_url),
                           sep=',',            # Delimiter (comma is default)
                           header=0,           # Row to use as column names (0 is default)
                           index_col='ID',     # Column to set as index
                           usecols=['ID', 'Name', 'Value'],  # Only import specific columns
                           nrows=5)            # Only read first 5 rows
print("\nCustomized CSV import:")
print(df_csv_custom)

Output

Basic CSV import:
   ID      Name  Age     City  Value Category        Date
0   1  Person_1   22  Chicago  34.61        B  2023-01-01
1   2  Person_2   52  Phoenix  47.46        A  2023-01-02
2   3  Person_3   38  Phoenix  48.85        D  2023-01-03
3   4  Person_4   27  Houston  75.10        A  2023-01-04
4   5  Person_5   34  Houston  47.88        B  2023-01-05

Customized CSV import:
        Name  Value
ID
1   Person_1  34.61
2   Person_2  47.46
3   Person_3  48.85
4   Person_4  75.10
5   Person_5  47.88

Reading Excel Data

Demo

# Basic Excel reading

import requests
from io import BytesIO

## Read XLSX file directly from a URL
# sample_data_xls_url = "https://coderefinery.github.io/NumPy-and-Pandas-fundamentals-for-handling-biological-datasets/_downloads/04c10398f323fcf8ff5ec474e4b22205/sample_data.xlsx"

# response = requests.get(sample_data_xls_url)
# df_excel = pd.read_excel(BytesIO(response.content))

df_excel = pd.read_excel('test_data/sample_data.xlsx')
print("Basic Excel import:")
print(df_excel.head())

# Reading specific sheet
df_excel_sheet2 = pd.read_excel('test_data/sample_data.xlsx', 
                               sheet_name='Sheet 1 - sample_data',
                               skiprows=1)     # Skip first row
print("\nReading specific Excel sheet:")
print(df_excel_sheet2.head())

# List all sheets in workbook
xl = pd.ExcelFile('test_data/sample_data.xlsx')
print(f"\nAll sheets in workbook: {xl.sheet_names}")

Output

Basic Excel import:
  sample_data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5  Unnamed: 6
0          ID       Name        Age       City      Value   Category        Date
1           1   Person_1         22    Chicago      34.61          B  2023-01-01
2           2   Person_2         52    Phoenix      47.46          A  2023-01-02
3           3   Person_3         38    Phoenix      48.85          D  2023-01-03
4           4   Person_4         27    Houston       75.1          A  2023-01-04

Reading specific Excel sheet:
   ID      Name  Age     City  Value Category        Date
0   1  Person_1   22  Chicago  34.61        B  2023-01-01
1   2  Person_2   52  Phoenix  47.46        A  2023-01-02
2   3  Person_3   38  Phoenix  48.85        D  2023-01-03
3   4  Person_4   27  Houston  75.10        A  2023-01-04
4   5  Person_5   34  Houston  47.88        B  2023-01-05

All sheets in workbook: ['Sheet 1 - sample_data']

More info

Additional notes: Reading JSON Data

Reading JSON Data:

# Reading JSON
df_json = pd.read_json('test_data/sample_data.json')
print("JSON import:")
print(df_json.head())

JSON import:
      Name Department  Salary   Hire_Date
0     Alex      Sales   55000  2020-01-15
1     Beth         HR   60000  2019-07-10
2  Charlie       Tech   75000  2021-03-22
3    Diana    Finance   70000  2018-11-05

Note

Notice that all these different file formats are loaded into the same DataFrame structure.
Each import function has many parameters for fine-tuning the import process.
The defaults work well in many cases, but customization gives you control over exactly what data gets loaded and how.
For real-world data, you’ll often need to use these parameters to handle messy or improperly formatted files.

Writing Data to Files

Writing Data - Main Methods

Function	Purpose	Common Parameters
`to_csv()`	Write to CSV	`path_or_buf`, `index`, `header`, `sep`
`to_excel()`	Write to Excel	`excel_writer`, `sheet_name`, `index`, `header`
`to_json()`	Write to JSON	`path_or_buf`, `orient`, `lines`
`to_sql()`	Write to SQL table	`name`, `con`, `if_exists`, `index`
`to_html()`	Write to HTML	`buf`, `columns`, `header`, `index`

Writing to Different Formats

Demo

# Create a small DataFrame to export
data = {
    'Name': ['Alex', 'Beth', 'Charlie', 'Diana'],
    'Department': ['Sales', 'HR', 'Tech', 'Finance'],
    'Salary': [55000, 60000, 75000, 70000],
    'Hire_Date': ['2020-01-15', '2019-07-10', '2021-03-22', '2018-11-05']
}
df = pd.DataFrame(data)
print("DataFrame to export:")
print(df)

# Export to CSV
df.to_csv('employees.csv', index=False)
print("\nExported to CSV (employees.csv)")

# Export to Excel
df.to_excel('employees.xlsx', sheet_name='Employees', index=False)
print("Exported to Excel (employees.xlsx)")

More info

Additional notes: Export to JSON

# Export to JSON (with different orientations)
df.to_json('employees_records.json', orient='records')
print("Exported to JSON (employees_records.json) in records format")

df.to_json('employees_split.json', orient='split')
print("Exported to JSON (employees_split.json) in split format")

Note

Writing data is even simpler than reading it, since Pandas already knows the structure.
Setting index=False is common when exporting, to avoid having an extra column.
Different export formats are useful in different contexts:
- CSV is universal but loses data types
- Excel preserves formatting but requires Excel to open
- JSON is great for metadata

Data Inspection Methods

Key Inspection Methods

Method	Purpose	What It Shows
`head()` / `tail()`	View first/last n rows	A preview of the data
`info()`	Summary of DataFrame	Data types, non-null counts, memory usage
`describe()`	Statistical summary	Count, mean, std, min, 25%, 50%, 75%, max
`dtypes`	Column data types	Data type of each column
`shape`	Dimensions	(rows, columns)
`columns`	Column names	Index of column names
`value_counts()`	Count unique values	Frequency of each unique value

Demo

Inspection Methods

# Let's inspect a more interesting dataset
import numpy as np

# Create a sample dataset with different data types
np.random.seed(42)  # For reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'Value': np.random.normal(100, 20, size=1000),
    'Count': np.random.randint(1, 100, size=1000),
    'Date': pd.date_range(start='2023-01-01', periods=1000),
    'Flag': np.random.choice([True, False], size=1000)
}
df = pd.DataFrame(data)

# Add some missing values
df.loc[10:20, 'Value'] = np.nan
df.loc[500:510, 'Category'] = None

# Now let's inspect
print("First 5 rows with head():")
print(df.head())

print("\nLast 3 rows with tail(3):")
print(df.tail(3))

print("\nDataFrame information with info():")
df.info()

print("\nStatistical summary with describe():")
print(df.describe())

print("\nInclude all columns in describe (even non-numeric):")
print(df.describe(include='all'))

print("\nData types of each column:")
print(df.dtypes)

print("\nShape (rows, columns):", df.shape)

print("\nValue counts for 'Category':")
print(df['Category'].value_counts())

Output

First 5 rows with head():
  Category       Value  Count       Date   Flag
0        C  106.835120     72 2023-01-01  False
1        D  137.523417     49 2023-01-02  False
2        A  119.008477     79 2023-01-03   True
3        C   88.461927     17 2023-01-04   True
4        C   82.031707     60 2023-01-05  False

Last 3 rows with tail(3):
    Category       Value  Count       Date   Flag
997        D   87.885702     77 2025-09-24   True
998        D  114.841907     59 2025-09-25  False
999        C  105.985852     47 2025-09-26  False

DataFrame information with info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Category  989 non-null    object
 1   Value     989 non-null    float64
 2   Count     1000 non-null   int64
 3   Date      1000 non-null   datetime64[ns]
 4   Flag      1000 non-null   bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 32.4+ KB

Statistical summary with describe():
            Value        Count                 Date
count  989.000000  1000.000000                 1000
mean   100.858217    49.007000  2024-05-14 12:00:00
min     42.074892     1.000000  2023-01-01 00:00:00
25%     87.503628    24.000000  2023-09-07 18:00:00
50%    100.811834    48.000000  2024-05-14 12:00:00
75%    113.637830    73.000000  2025-01-19 06:00:00
max    161.577616    99.000000  2025-09-26 00:00:00
std     20.024572    28.269703                  NaN

Include all columns in describe (even non-numeric):
       Category       Value        Count                 Date  Flag
count       989  989.000000  1000.000000                 1000  1000
unique        4         NaN          NaN                  NaN     2
top           D         NaN          NaN                  NaN  True
freq        277         NaN          NaN                  NaN   512
mean        NaN  100.858217    49.007000  2024-05-14 12:00:00   NaN
min         NaN   42.074892     1.000000  2023-01-01 00:00:00   NaN
25%         NaN   87.503628    24.000000  2023-09-07 18:00:00   NaN
50%         NaN  100.811834    48.000000  2024-05-14 12:00:00   NaN
75%         NaN  113.637830    73.000000  2025-01-19 06:00:00   NaN
max         NaN  161.577616    99.000000  2025-09-26 00:00:00   NaN
std         NaN   20.024572    28.269703                  NaN   NaN

Data types of each column:
Category            object
Value              float64
Count                int64
Date        datetime64[ns]
Flag                  bool
dtype: object

Shape (rows, columns): (1000, 5)

Value counts for 'Category':
Category
D    277
A    254
C    229
B    229
Name: count, dtype: int64

Note

“These inspection methods are critical to understanding your data before analysis.”
“Always start data exploration with these methods - they quickly reveal dataset characteristics and potential issues.”
“info() shows missing values and data types - crucial for cleaning strategies.”
“describe() gives statistical summaries - great for spotting outliers or distribution patterns.”
“Real data is messy - these tools help you understand what needs to be cleaned.”

More info

Additional notes: Exercises

Exercise 1: Writing Data:

Use grades_df dataframe and export to

CSV without the index, but including headers
JSON in format
Verify by reading the files back

# Create a dataframe of student grades
grades_data = {
    'Student': ['Alex', 'Barbara', 'Carlos', 'Diana', 'Evan'],
    'Math': [85, 92, 78, 96, 88],
    'Science': [92, 85, 91, 88, 95],
    'History': [79, 89, 85, 82, 91]
}
grades_df = pd.DataFrame(grades_data)

Exercise 1: solution:

# Task 1: Export to CSV without the index, but including headers
grades_df.to_csv('student_grades.csv', index=False)

# Task 2: Export to JSON in 'records' format
grades_df.to_json('student_grades.json', orient='records')

# Verify by reading the files back
print("CSV file contents:")
print(pd.read_csv('student_grades.csv').head())

print("\nJSON file contents:")
print(pd.read_json('student_grades.json').head())

Exercise 2: Data Inspection: Perform a comprehensive inspection using products_df and answer these questions

How many rows and columns are in this dataset?
What data type is the ‘Rating’ column?
How many null values are in the dataset?
What’s the average price?

# Create a dataframe with some intentional quirks
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Price': [1200, 800, 350, 250, 75],
    'Stock': [10, 25, 30, 5, 50],
    'Rating': [4.5, 4.8, 4.2, 4.7, None],  # Note the None value
    'OnSale': [True, False, True, False, False]
}
products_df = pd.DataFrame(data)

Exercise 2: solution:

# Inspect the data
print("Shape:", products_df.shape)
products_df.info()
print("\nStatistical summary:")
print(products_df.describe())
print("\nIncluding all columns:")
print(products_df.describe(include='all'))

# Answers:
print("\nAnswers:")
print(f"1. {products_df.shape[0]} rows and {products_df.shape[1]} columns")
print(f"2. The 'Rating' column has dtype: {products_df['Rating'].dtype}")
print(f"3. Total null values: {products_df.isnull().sum().sum()}")
print(f"4. Average price: ${products_df['Price'].mean():.2f}")

Key Takeaways

Keypoints

Data Import Flexibility: Pandas can read data from virtually any structured data format
Customized Reading: Parameters let you control exactly how data is imported
Easy Export: Writing data is straightforward with similar syntax across formats
Always Inspect First: head(), info(), and describe() should be your first steps with any new dataset
Real-World Ready: These tools prepare you for working with messy, real-world data