Data Import and Export in Pandas

Objectives

  • Illustrate how to read data from various formats, including CSV, Excel, and JSON, into Pandas.

  • Show how to export DataFrames to different formats, ensuring participants can save their analyses.

  • Utilize data inspection methods (head, info, describe) to gain an understanding of data structure and content.

Time

15 Minutes

Introduction to Data Import and Export

  • Real-world data analysis starts with getting data into Pandas

  • Pandas supports many file formats natively

  • The process: Import → Inspect → Manipulate → Export

Discussion

  • “Now that we understand Pandas data structures, let’s learn how to fill them with real data.”

  • “This is where Pandas really starts to shine compared to NumPy - it handles diverse data sources seamlessly.”

  • “The workflow we’re learning today is the foundation of data analysis in Python.”

Reading Data from Various Formats

Reading Data - Main Methods

Function

Purpose

Common Parameters

pd.read_csv()

Read CSV files

filepath_or_buffer, sep, header, index_col

pd.read_excel()

Read Excel files

io, sheet_name, header, index_col

pd.read_json()

Read JSON files

path_or_buf, orient, typ

pd.read_sql()

Read SQL query/table

sql, con, index_col

pd.read_html()

Read HTML tables

io, match, flavor

Reading CSV Data

Demo

import pandas as pd

# Basic CSV reading
df_csv = pd.read_csv('test_data/sample_data.csv')
print("Basic CSV import:")
print(df_csv.head())  # Show first 3 rows

# Customizing CSV import
df_csv_custom = pd.read_csv('test_data/sample_data.csv',
                           sep=',',            # Delimiter (comma is default)
                           header=0,           # Row to use as column names (0 is default)
                           index_col='ID',     # Column to set as index
                           usecols=['ID', 'Name', 'Value'],  # Only import specific columns
                           nrows=5)            # Only read first 5 rows
print("\nCustomized CSV import:")
print(df_csv_custom)

Output

Basic CSV import:
   ID      Name  Age     City  Value Category        Date
0   1  Person_1   22  Chicago  34.61        B  2023-01-01
1   2  Person_2   52  Phoenix  47.46        A  2023-01-02
2   3  Person_3   38  Phoenix  48.85        D  2023-01-03
3   4  Person_4   27  Houston  75.10        A  2023-01-04
4   5  Person_5   34  Houston  47.88        B  2023-01-05

Customized CSV import:
        Name  Value
ID
1   Person_1  34.61
2   Person_2  47.46
3   Person_3  48.85
4   Person_4  75.10
5   Person_5  47.88

Reading Excel Data

Demo

# Basic Excel reading
df_excel = pd.read_excel('test_data/sample_data.xlsx')
print("Basic Excel import:")
print(df_excel.head())

# Reading specific sheet
df_excel_sheet2 = pd.read_excel('test_data/sample_data.xlsx', 
                               sheet_name='Sheet 1 - sample_data',
                               skiprows=1)     # Skip first row
print("\nReading specific Excel sheet:")
print(df_excel_sheet2.head())

# List all sheets in workbook
xl = pd.ExcelFile('test_data/sample_data.xlsx')
print(f"\nAll sheets in workbook: {xl.sheet_names}")

Output

Basic Excel import:
  sample_data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5  Unnamed: 6
0          ID       Name        Age       City      Value   Category        Date
1           1   Person_1         22    Chicago      34.61          B  2023-01-01
2           2   Person_2         52    Phoenix      47.46          A  2023-01-02
3           3   Person_3         38    Phoenix      48.85          D  2023-01-03
4           4   Person_4         27    Houston       75.1          A  2023-01-04

Reading specific Excel sheet:
   ID      Name  Age     City  Value Category        Date
0   1  Person_1   22  Chicago  34.61        B  2023-01-01
1   2  Person_2   52  Phoenix  47.46        A  2023-01-02
2   3  Person_3   38  Phoenix  48.85        D  2023-01-03
3   4  Person_4   27  Houston  75.10        A  2023-01-04
4   5  Person_5   34  Houston  47.88        B  2023-01-05

All sheets in workbook: ['Sheet 1 - sample_data']

Reading JSON Data

Demo

# Reading JSON
df_json = pd.read_json('test_data/sample_data.json')
print("JSON import:")
print(df_json.head())

Output

JSON import:
      Name Department  Salary   Hire_Date
0     Alex      Sales   55000  2020-01-15
1     Beth         HR   60000  2019-07-10
2  Charlie       Tech   75000  2021-03-22
3    Diana    Finance   70000  2018-11-05

Discussion

  • “Notice that all these different file formats are loaded into the same DataFrame structure.”

  • “Each import function has many parameters for fine-tuning the import process.”

  • “The defaults work well in many cases, but customization gives you control over exactly what data gets loaded and how.”

  • “For real-world data, you’ll often need to use these parameters to handle messy or improperly formatted files.”

Writing Data to Files

Writing Data - Main Methods

Function

Purpose

Common Parameters

to_csv()

Write to CSV

path_or_buf, index, header, sep

to_excel()

Write to Excel

excel_writer, sheet_name, index, header

to_json()

Write to JSON

path_or_buf, orient, lines

to_sql()

Write to SQL table

name, con, if_exists, index

to_html()

Write to HTML

buf, columns, header, index

Writing to Different Formats**

Demo

# Create a small DataFrame to export
data = {
    'Name': ['Alex', 'Beth', 'Charlie', 'Diana'],
    'Department': ['Sales', 'HR', 'Tech', 'Finance'],
    'Salary': [55000, 60000, 75000, 70000],
    'Hire_Date': ['2020-01-15', '2019-07-10', '2021-03-22', '2018-11-05']
}
df = pd.DataFrame(data)
print("DataFrame to export:")
print(df)

# Export to CSV
df.to_csv('employees.csv', index=False)
print("\nExported to CSV (employees.csv)")

# Export to Excel
df.to_excel('employees.xlsx', sheet_name='Employees', index=False)
print("Exported to Excel (employees.xlsx)")

# Export to JSON (with different orientations)
df.to_json('employees_records.json', orient='records')
print("Exported to JSON (employees_records.json) in records format")

df.to_json('employees_split.json', orient='split')
print("Exported to JSON (employees_split.json) in split format")

Discussion

  • Writing data is even simpler than reading it, since Pandas already knows the structure.

  • Setting index=False is common when exporting, to avoid having an extra column.

  • Different export formats are useful in different contexts:

    • CSV is universal but loses data types

    • Excel preserves formatting but requires Excel to open

    • JSON is great for web applications and APIs

Exercise

Writing Data:

Use grades_df dataframe and export to

  1. CSV without the index, but including headers

  2. JSON in format

  3. Verify by reading the files back

# Create a dataframe of student grades
grades_data = {
    'Student': ['Alex', 'Barbara', 'Carlos', 'Diana', 'Evan'],
    'Math': [85, 92, 78, 96, 88],
    'Science': [92, 85, 91, 88, 95],
    'History': [79, 89, 85, 82, 91]
}
grades_df = pd.DataFrame(grades_data)

Data Inspection Methods

Key Inspection Methods

Method

Purpose

What It Shows

head() / tail()

View first/last n rows

A preview of the data

info()

Summary of DataFrame

Data types, non-null counts, memory usage

describe()

Statistical summary

Count, mean, std, min, 25%, 50%, 75%, max

dtypes

Column data types

Data type of each column

shape

Dimensions

(rows, columns)

columns

Column names

Index of column names

value_counts()

Count unique values

Frequency of each unique value

Demo

Inspection Methods

# Let's inspect a more interesting dataset
import numpy as np

# Create a sample dataset with different data types
np.random.seed(42)  # For reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'Value': np.random.normal(100, 20, size=1000),
    'Count': np.random.randint(1, 100, size=1000),
    'Date': pd.date_range(start='2023-01-01', periods=1000),
    'Flag': np.random.choice([True, False], size=1000)
}
df = pd.DataFrame(data)

# Add some missing values
df.loc[10:20, 'Value'] = np.nan
df.loc[500:510, 'Category'] = None

# Now let's inspect
print("First 5 rows with head():")
print(df.head())

print("\nLast 3 rows with tail(3):")
print(df.tail(3))

print("\nDataFrame information with info():")
df.info()

print("\nStatistical summary with describe():")
print(df.describe())

print("\nInclude all columns in describe (even non-numeric):")
print(df.describe(include='all'))

print("\nData types of each column:")
print(df.dtypes)

print("\nShape (rows, columns):", df.shape)

print("\nValue counts for 'Category':")
print(df['Category'].value_counts())

Output

First 5 rows with head():
  Category       Value  Count       Date   Flag
0        C  106.835120     72 2023-01-01  False
1        D  137.523417     49 2023-01-02  False
2        A  119.008477     79 2023-01-03   True
3        C   88.461927     17 2023-01-04   True
4        C   82.031707     60 2023-01-05  False

Last 3 rows with tail(3):
    Category       Value  Count       Date   Flag
997        D   87.885702     77 2025-09-24   True
998        D  114.841907     59 2025-09-25  False
999        C  105.985852     47 2025-09-26  False

DataFrame information with info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Category  989 non-null    object
 1   Value     989 non-null    float64
 2   Count     1000 non-null   int64
 3   Date      1000 non-null   datetime64[ns]
 4   Flag      1000 non-null   bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 32.4+ KB

Statistical summary with describe():
            Value        Count                 Date
count  989.000000  1000.000000                 1000
mean   100.858217    49.007000  2024-05-14 12:00:00
min     42.074892     1.000000  2023-01-01 00:00:00
25%     87.503628    24.000000  2023-09-07 18:00:00
50%    100.811834    48.000000  2024-05-14 12:00:00
75%    113.637830    73.000000  2025-01-19 06:00:00
max    161.577616    99.000000  2025-09-26 00:00:00
std     20.024572    28.269703                  NaN

Include all columns in describe (even non-numeric):
       Category       Value        Count                 Date  Flag
count       989  989.000000  1000.000000                 1000  1000
unique        4         NaN          NaN                  NaN     2
top           D         NaN          NaN                  NaN  True
freq        277         NaN          NaN                  NaN   512
mean        NaN  100.858217    49.007000  2024-05-14 12:00:00   NaN
min         NaN   42.074892     1.000000  2023-01-01 00:00:00   NaN
25%         NaN   87.503628    24.000000  2023-09-07 18:00:00   NaN
50%         NaN  100.811834    48.000000  2024-05-14 12:00:00   NaN
75%         NaN  113.637830    73.000000  2025-01-19 06:00:00   NaN
max         NaN  161.577616    99.000000  2025-09-26 00:00:00   NaN
std         NaN   20.024572    28.269703                  NaN   NaN

Data types of each column:
Category            object
Value              float64
Count                int64
Date        datetime64[ns]
Flag                  bool
dtype: object

Shape (rows, columns): (1000, 5)

Value counts for 'Category':
Category
D    277
A    254
C    229
B    229
Name: count, dtype: int64

Discussion

  • “These inspection methods are critical to understanding your data before analysis.”

  • “Always start data exploration with these methods - they quickly reveal dataset characteristics and potential issues.”

  • info() shows missing values and data types - crucial for cleaning strategies.”

  • describe() gives statistical summaries - great for spotting outliers or distribution patterns.”

  • “Real data is messy - these tools help you understand what needs to be cleaned.”

Exercise

Data Inspection: Perform a comprehensive inspection using products_df and answer these questions

  1. How many rows and columns are in this dataset?

  2. What data type is the ‘Rating’ column?

  3. How many null values are in the dataset?

  4. What’s the average price?

# Create a dataframe with some intentional quirks
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Price': [1200, 800, 350, 250, 75],
    'Stock': [10, 25, 30, 5, 50],
    'Rating': [4.5, 4.8, 4.2, 4.7, None],  # Note the None value
    'OnSale': [True, False, True, False, False]
}
products_df = pd.DataFrame(data)

Key Takeaways

Keypoints

  1. Data Import Flexibility: Pandas can read data from virtually any structured data format

  2. Customized Reading: Parameters let you control exactly how data is imported

  3. Easy Export: Writing data is straightforward with similar syntax across formats

  4. Always Inspect First: head(), info(), and describe() should be your first steps with any new dataset

  5. Real-World Ready: These tools prepare you for working with messy, real-world data