Data Loading for Machine Learning

When we load data to train a neural network model, We usually split data into the training data and testing data.

Then the training data is further split into batches to train the model. This is done to reduce the memory usage and speed up the training process.

Let’s see how to load data into a neural network model using PyTorch.

Dataset

pytorch provides a Dataset class to load data into a neural network model. We can create a custom dataset by inheriting the Dataset class and implementing the __len__ and __getitem__ methods.

Here is an example of a custom dataset class that loads data from a CSV file.

import pandas as pd
import torch

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, file_path):
        self.data = pd.read_csv(file_path)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x = torch.tensor(self.data.iloc[idx, :-1].values, dtype=torch.float32)
        y = torch.tensor(self.data.iloc[idx, -1], dtype=torch.float32)
        return x, y

This dataset class reads data from a CSV file and returns the input and output values as tensors.

Image Loading example

import torch
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, directory):
        self.directory = directory
        self.image_files = [f for f in os.listdir(directory) if f.endswith('.png')]

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        image = Image.open(os.path.join(self.directory, self.image_files[idx]))
        image = transform(image)
        label = self.image_files[idx].split('_')[0]
        return image, label

The Dataset class can be used to load the data by applying required transformations.

For splitting the data into training and testing sets we could simply use the random_split method provided by PyTorch.

from torch.utils.data import random_split

dataset = CustomDataset('data.csv')
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

DataLoader

The DataLoader class is used to load the data in batches. It provides options to shuffle the data and load the data in parallel.

Here is an example of how to use the DataLoader class to load the data.

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

when training the model, we can iterate over the DataLoader object to get the data in batches.

for item, label in train_loader:
    output = model(item)
    loss = criterion(output, label)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

for accuracy calculation, we can use the the test_loader.

correct = 0
val_loss = 0
total = 0

with torch.no_grad():
    for item, label in test_loader:
        output = model(item)

        loss = criterion(output, label)
        val_loss += loss.item() # represents the loss of the model

        _, predicted = torch.max(output, 1)
        total += label.size(0)
        correct += (predicted == label).sum().item()

accuracy = correct / len(total)

Summary

The Dataset class is used to load data into a neural network model.
The DataLoader class is used to load the data in batches.
The data can be split into training and testing sets using the random_split method provided by PyTorch.
The DataLoader class provides options to shuffle the data and load the data in parallel.
The DataLoader class can be used to iterate over the data in batches when training the model.
The split can be used to calculate the accuracy of the model.

Dataset#

DataLoader#

Dataset

DataLoader