I/O Best Practices

Objectives

  • Reading files is a common bottleneck

  • One large file is usually faster than many small files

  • Local hard disks and ramdisks can be much faster

Prerequisites

  • No prerequisites for following the demo

You can find the demo code and setup instructions at https://github.com/coderefinery/ttt4hpc-io-examples.

If you wish to run the code, you will need Python and the packages listed in requirements.txt. You can install them with pip install -r requirements.txt.

In this lesson we discuss common I/O problems, diagnosing them and avoiding them. I/O issues are usually about how the workflow is structured and can be fixed or alleviated by restucturing, moving the reads and writes to move convenient times or using the right storage system.

This lesson is mostly a demonstration, with some general discussion. We recommend you follow the demonstration and

  • Filesystem can be the slowest part of many jobs

  • networked filesystems tend to be best at large files, bad at many small

Quick primer: How files are accessed

../_images/file_explanation.svg

Figure 1: File structure

Files on file systems consist of:

  • metadata (who owns the file, when was the file last modified, how big is the file)

  • the file contents (actual contents of the data)

Programs check the metadata with stat-calls. File contents are accessed by creating a file descriptor with with an open-call and then file contents are either read using a read-call or written using a write-call.

For example, ls -l will check the metadata of a file while cat will open the file contents.

../_images/file_access.svg

Figure 2: Different programs might access files in different ways

Data Format Demos

We show the effect of reading many small files vs one large file. In a shared file system, file system operations are slower than one would expect from experience with a local disk. Files are stored on many disks and reading one usually requires some communication between nodes. But it’s actually a bit worse than that. The file system needs to find metadata for the file first, to determine where to actually find the data.

Note

The demo examples are at https://github.com/coderefinery/ttt4hpc-io-examples. Expected results are included in collapsed sections titles “expected result”.

  1. Setup

# Clone the repository and setup
git clone https://github.com/coderefinery/ttt4hpc-io-examples
cd ttt4hpc-io-examples
pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt

cd data_formats_1
python generate_data.py
python create_archive.py

First, let’s check the files we created. They are in the data folder. Each csv file contains an activity measurement for each hour of the day. There is data for 20 year, so 175200 rows all together in 7300 files.

  1. Naive example reading all the files

Here we read all of these files into memory and create a pandas dataframe. The approach in read_files_naive.py is the simplest way we would first write this. The version in read_files.py only reads the files in the loop, and gives a more fair comparison.

strace -c -e trace=%desc python read_files.py

strace shows the number of file system calls. In this case we count file system operations.

  1. Example reading a single archive sequentially

This example reads the same data from the tar archive. An uncompressed tar file is essentially just a concatenation of the contents of the files.

We use the streaming mode for reading the archive. This means the files have to be read in order. Otherwise we would still generate A large number of file system calls.

strace -c -e trace=%desc python read_archive.py
  1. Random access

Say we need to read the files in randomized order. This is common in training machine learning models. In this case reading from the the archive is not that helpful, since we cannot stream the contents.

Note

Tar is actually a bad format for this. A tar file is always read sequentially. But independent of the file format, reading files in random order is slow on a network file system.

Still, this is better than reading many small files.

strace -c -e trace=%desc python read_archive_random.py

This is not great. How would you avoid reading the files out of order?

In this case, the whole data fits in memory. Even if it didn’t, it’s usually good enough to read the file in chunks and shuffle the chunks in memory.

strace -c -e trace=%desc python read_random_chunked.py

Note

The strace output is not very readable. There are not many tools for parsing it into something more human readable. Here are a couple of examples we found:

I/O Workflows

Shared and Network File Systems

  • How does a network file system work? What is Lustre? What happens when I ask for the contents of a file?

File System is Slow

  • Even a normal file system is generally much slower than a RAM, CPUs or GPUs. Computations have to wait for many cycles for each I/O operation.

  • Network file systems and shared file systems and have even more latency. Performance also depends on what other users are doing.

  • Bad I/O hampers interactive use. Waiting for a file to load can be frustrating.

Common Issues

  • Order of operations: Reading a file many times because the function is called in a loop.

    This is often hidden by a function call, maybe even to a library. This can be about understanding what libraries do, and using them correctly.

  • Accumulation: A bad IO pattern might not seem bad when simulation is run with a single computer or deep learning model is trained for one epoch (single pass through all the data). But in a larger scale or with a longer run, inefficiencies and bad access patterns accumulate.

    Essentially, 10% of a big number is still pretty big. Since file systems are a shared resource and usually not reserved for a job, it’s possible to congest the whole system.

  • Carrying everything with you: All of the data is loaded, when only part of it is needed.

    Everything is kept in RAM and takes space. The job might not need all the resources it seems to need.

  • Wrong Format: Data format is chosen when the amount of data is small, or for inspection and plotting. The format is not optimal for the actual use case.

    A profiler can detect I/O patterns and this can be useful for identifying bottlenecks. However, this is mostly a workflow issue. Thinking through the workflow steps and testing them in isolation is often the best approach.

    Human readable data formats (CSVs, .txt, .json) are good when human is reading the file contents with an editor. If they are processed by code using binary formats can improve code’s efficiency.

Local Disks and RAM Disks

Local Disks

  • Some systems have local disks on nodes. These are connected directly to the node and are much faster than network file systems.

  • Check your system documentation for the local disk path.

  • Local disks are usually not persistent. You need to copy data to to the local disk at the beginning of a job and copy results back at the end

    unzip -d /tmp/data data.zip
    
    python train_model.py --data /tmp/data
    
    cp -r /tmp/results results
    
  • Try creating and reading a large file locally and on lustre

    time dd if=/dev/zero of=largefile bs=1024M count=50
    
  • Try reading the large file

    time md5sum largefile
    

Ramdisk

  • /dev/shm/ in linux

  • A file system directly in random access memory. This is very fast, but limited by the available memory

  • Reserve enough memory when queueing the job

Machine Learning and Large data

Training large machine learning models requires a lot of data. Storing and accessing the data can easily become a bottleneck. It’s easy to starve the GPUs for data just because accessing the input files on disk is too slow.

This is usually further complicated by the fact that in each training epoch all of the data needs to be loaded in random order. To deal around this problem different frameworks have created their own data formats, but they work in similar ways.

Typically large datasets are split into shards, where each shard contains some random piece of the full dataset. Shards can be up to multiple gigabytes in size.

When data is read during training multiple threads are usually used to read the shards. Each thread loads data from random shard in sequential order and shuffles the data in memory. Data is then collected to master thread that creates a batch of data from inputs of all data loaders.

../_images/sharded_dataloading.svg

Figure 4: An example of a sharded data loading pipeline

Webdataset does this for PyTorch. It uses the POSIX tar format, making it easy to handle on most HPC systems.

Demo in the webdataset folder.

  1. Creating a dataset

python create_dataset.py
  1. Reading a sharded dataset

python imagenet.py

Note that the data does not need to be downlaoded and stored locally for webdataset. The library can also handle http addresses directly, and has a protocol for general UNIX pipes.

wds.WebDataset("filename.tar")

is equivalent to

wds.WebDataset("pipe:cat filename.tar")

This makes webdataset very general and flexible. Unfortunately, though, the data needs to be stored in a tar file.

Summary

See also

  • Link

  • Link