When we load data to train a neural network model, We usually split data into the training data and testing data.
Then the training data is further split into batches to train the model. This is done to reduce the memory usage and speed up the training process.
Let’s see how to load data into a neural network model using PyTorch.
Dataset
pytorch provides a Dataset class to load data into a neural network model. We can create a custom dataset by inheriting the Dataset class and implementing the __len__ and __getitem__ methods.
Here is an example of a custom dataset class that loads data from a CSV file.
import pandas as pd
import torch
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, file_path):
self.data = pd.read_csv(file_path)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
x = torch.tensor(self.data.iloc[idx, :-1].values, dtype=torch.float32)
y = torch.tensor(self.data.iloc[idx, -1], dtype=torch.float32)
return x, y
This dataset class reads data from a CSV file and returns the input and output values as tensors.
Image Loading example
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, directory):
self.directory = directory
self.image_files = [f for f in os.listdir(directory) if f.endswith('.png')]
def __len__(self):
return len(self.image_files)
def __getitem__(self, idx):
image = Image.open(os.path.join(self.directory, self.image_files[idx]))
image = transform(image)
label = self.image_files[idx].split('_')[0]
return image, label
The Dataset class can be used to load the data by applying required transformations.
For splitting the data into training and testing sets we could simply use the random_split method provided by PyTorch.
from torch.utils.data import random_split
dataset = CustomDataset('data.csv')
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])
DataLoader
The DataLoader class is used to load the data in batches. It provides options to shuffle the data and load the data in parallel.
Here is an example of how to use the DataLoader class to load the data.
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
when training the model, we can iterate over the DataLoader object to get the data in batches.
for item, label in train_loader:
output = model(item)
loss = criterion(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
for accuracy calculation, we can use the the test_loader.
correct = 0
val_loss = 0
total = 0
with torch.no_grad():
for item, label in test_loader:
output = model(item)
loss = criterion(output, label)
val_loss += loss.item() # represents the loss of the model
_, predicted = torch.max(output, 1)
total += label.size(0)
correct += (predicted == label).sum().item()
accuracy = correct / len(total)
Summary
- The
Datasetclass is used to load data into a neural network model. - The
DataLoaderclass is used to load the data in batches. - The data can be split into training and testing sets using the
random_splitmethod provided by PyTorch. - The
DataLoaderclass provides options to shuffle the data and load the data in parallel. - The
DataLoaderclass can be used to iterate over the data in batches when training the model. - The split can be used to calculate the accuracy of the model.