A comparison of Pets classifier using vanilla PyTorch and Fast.ai
Comparing fine tuning of a RestNet34 based Pets classifier using vanilla PyTorch code with the one written using Fast.ai. The purpose of this blog is to demonstrate how `fastai` can really help one get started with `deep learning` and also provide right abstractions and encapsulation that helps one focus more on the research and modeling than to spend our time in boiler plate code. `fastai` is amazing at abstracting away a lot of inner details that we should focus on but not when we're starting out or when we're prototyping our research. The post below shows that what usually requires writing custom classes for creating `data-splitters`, `datasets`, `trainers` without using a library like `fast.ai` can be accomplished in very few lines of code with `fast.ai`. I start out by writing that code assuming we don't have access to `fast.ai` and then I go and surprise myself by showing how easy it is to perform the same task using `fast.ai`.
- Introduction
- PyTorch Version
- Fast.ai Version
- Summary
- References
Disclaimer: I am a fairly new to the library and this is just to show what I've observed so far, I may be missing a point or two, so, don't treat the above as a comprehensive list of what fastai
can do. This is just my attempt to keep learning and evolving.
Below is generally the plan that everyone follows when it comes to training a Machine Learning model:
- Load Data
- Inspect Data: Plot a few examples
- Create a DataLoader
- Define a model architecture.
- Write a training loop.
- Plot metrics.
There's also another step, which is
- Analyze Errors.
but we'll tackle this in a separate blog post once we've covered the training of the model bit.
from torch import nn
from torch import optim
from torch.utils.data import dataset, dataloader
from torch.autograd import Variable
from torch.nn import functional as F
from torchvision import datasets, transforms
from torchvision.models.resnet import resnet18
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
import os
from collections import Counter, OrderedDict
import re
import requests
import tarfile
The data is a tar-gzip
archive, to extract data files from it we'll use the package called tarfile
. But first we need to download the archive.
The code in the cell below is taken form: https://gist.github.com/devhero/8ae2229d9ea1a59003ced4587c9cb236#gistcomment-3775721.
def fetch_data(url, data_dir, download=False):
if download:
response = requests.get(url, stream=True)
file = tarfile.open(fileobj=response.raw, mode="r|gz")
file.extractall(path=data_dir)
In the interest of comparison, I'll first write the Dataset
class and see how easy it gets when we use fastai
. The url that we want to fetch the data from is here: Pets Dataset
pets_url = 'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz'
data_dir = os.path.join('gdrive', 'MyDrive', 'pets_data')
base_img_dir = os.path.join(data_dir, 'oxford-iiit-pet', 'images')
fetch_data(pets_url, data_dir)
We've extracted the data in the folder named pets_data
, on inspection, it looks like the folder pets_data/oxford-iiit-pet/images
contains all the images we want (some files need to be filtered out as they're not in JPEG format). The filenames have the category labels in their name itself in the format: <CATEGORYNAME>_<NUMBER>.jpg
.
In order to extract the category name from the file names in fastbook, a RegexLabeller
is used. We'll write a similar LabelExtractor
(although it's very inferior in functionality to the fastai's
RegexLabeller
but does the job for now).
class RegexLabelExtractor():
def __init__(self, pattern):
self.pattern = pattern
self._names = []
def __call__(self, iterable):
return [re.findall(self.pattern, value)[0] for value in iterable]
As mentioned before our version, RegexLabelExtractor
extracts the label given a text. It accepts a pattern during instantiation and on __call__
it expects an iterable
containing a list of texts
containing the labels. It returns all the label names in a Python list
Once we've defibed a class for extracting the labels
, we'd like to define a container that is responsible for maintaining a map of CATEGORYNAME -> ID
, which we'll use to convert the labels to an integer format and vice versa.
Below we define a LabelManager
, it exposes a id_for_label
* and label_for_id
* methods along with keys
, which returns the unique label names in our dataset (this our vocabulary size). We can also call len
on a LabelManager
object to know the number of output classes
.
*These are a type of OrderedDict
.
class LabelManager():
def __init__(self, labels):
self._label_to_idx = OrderedDict()
for label in labels:
if label not in self._label_to_idx:
self._label_to_idx[label] = len(self._label_to_idx)
self._idx_to_label = {v:k for k,v in self._label_to_idx.items()}
@property
def keys(self):
return list(self._label_to_idx.keys())
def id_for_label(self, label):
return self._label_to_idx[label]
def label_for_id(self, idx):
return self._idx_to_label[idx]
def __len__(self):
return len(self._label_to_idx)
We'd also like to spilt our dataset into train
and validation
subsets. Although the dataset provides a list of train
and validation
splits but to be consistent with the book, we'll just write our version of the RandomSplitter
(which again would be very inferior in functionality, but will do the job for the purposes of demonstration).
We'd like this Splitter
to accept a percentage
to split on and also a seed
for reproducibility.
class Splitter():
def __init__(self, valid_pct=0.2, seed = None):
self.seed = seed
self.valid_pct = valid_pct
def __call__(self, dataset):
return train_test_split(dataset, test_size=self.valid_pct, random_state=np.random.RandomState(self.seed))
Now that we have a way to extract labels, maintain them in a map and split the data into train
and validation
splits, we'll define a PetsDataset
( a PyTorch Dataset
) which will be used by the PyTorch DataLoader
to give us the data we need to provide our model to train.
A note on PyTorch Dataset
: A PyTorch dataset is a primitive provided by the library that stores the samples and their corresponding labels. In order to write a custom dataset, our class PetsDataset
needs to implement three functions: __init__
, __len__
, and __getitem__
.
class PetsDataset(dataset.Dataset):
def __init__(self, data, tfms=None):
super(PetsDataset, self).__init__()
self.data = data
self.transforms = tfms
def __getitem__(self, idx):
X = Image.open(self.data[idx][0])
if X.mode != 'RGB':
X = X.convert('RGB')
y = self.data[idx][1]
if self.transforms:
X = self.transforms(X)
return (X, y)
def __len__(self):
return len(self.data)
Notice how we're opening the Image
only when __getitem__
is called and we also have to make sure that all the images have 3 input channels, hence the check if X.mode != 'RGB'
. Some images in the dataset have this issue and if we don't convert them to have 3 input channels then the DataLoader
wouldn't be able to create a batch using torch.stack
We're now ready to use these datasets, but we'll need to make sure that our global map of CATEGORYNAME -> ID
is constructed using both the train
and the validation
splits, we'll also have this class hold our corresponding datasets.
class DatasetManager():
def __init__(self, base_dir, paths, label_extractor, tfms=None, valid_pct=0.2, seed=None):
self._labels = label_extractor(paths)
self.tfms = tfms
self._label_manager = LabelManager(self._labels)
self._label_ids = [self.label_manager.id_for_label(label) for label in self._labels]
self.abs_paths = [os.path.join(base_dir, path) for path in paths]
self.train_data, self.valid_data = Splitter(valid_pct=valid_pct, seed=seed)(list(zip(self.abs_paths, self._label_ids)))
@property
def label_manager(self):
return self._label_manager
@property
def train_dataset(self):
return PetsDataset(self.train_data, tfms=self.tfms)
@property
def valid_dataset(self):
return PetsDataset(self.valid_data, tfms=self.tfms)
We'll now use all the helper classes we've created so far to use the datasets
in a dataloader
and look at the plan to choose an architecture and train it (almost there).
paths = [path for path in sorted(os.listdir(base_img_dir)) if path.endswith('.jpg')]
pattern = '(.+)_\d+.jpg$'
regex_label_extractor = RegexLabelExtractor(pattern)
dataset_manager = DatasetManager(base_img_dir, paths, regex_label_extractor,
tfms=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()]),
seed=42)
train_dataset = dataset_manager.train_dataset
valid_dataset = dataset_manager.valid_dataset
Before we look at the model, let's just for sake of sanity look at the labels we're dealing with and possibly plot a few images. This step is just to make sure that things are working as expected and the dataloader
will be batching the data in the right way and our Trainer
won't crash midway.
df = pd.DataFrame(dataset_manager.label_manager.keys, columns=['label_name'])
df.head(len(df))
A method to plot one batch of data (inspired by fastai
of course but again a very curtailed version of what that function does). Notice how we're calling transforms.ToPILImage()
, that's because we have objects of type torch.Tensor
in our batch and in order to plot them we need to convert them to a PIL.Image
, rest everything is done to just make sure we've got the images aligned in a nice way across different panels.
def plot_one_batch(batch, max_images=9):
nrows = int(math.sqrt(max_images))
ncols = int(math.sqrt(max_images))
if nrows * ncols != max_images:
nrows = (max_images + ncols - 1) // ncols
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 10))
X,Y = next(batch)
for idx, x in enumerate(X[:max_images]):
y = Y[idx]
ax.ravel()[idx].imshow(transforms.ToPILImage()(x))
ax.ravel()[idx].set_title(f'{y}/{dataset_manager.label_manager.label_for_id(y.item())}')
ax.ravel()[idx].set_axis_off()
plt.tight_layout()
plt.show()
This function generates one batch of data given a dataloader
, this is just using Python Generators
def generate_one_batch(dl):
for batch in dl:
yield batch
plot_one_batch(generate_one_batch(train_dl), max_images=20)
Now, we're ready to look at the model and make a few decisions about the architecture we want to use.
Here's our requirement: We want to extract the features from an image and then uses a classification head to get the output distribution over the number of classes (our labels
from before). We'll define a loss and use it to optimize the network.
Because we're dealing with images, a Convolution Neural Network (CNN) seems like a good start, in the literature as well as fastbook
, a restnet
type architecture has been used, so let's use that and see what we can do with it.
Coding a ResNet
is a separate blog post on its own, so, we'll punt that for now and use what's available to us in the form a pretrained model.
model = resnet34(pretrained=True, progress=True)
Since this model is trained to give an output distribution for 1000 classes, we can just change that layer to give us an output distribution based on what we have in our dataset and then fine-tune this layer. To read more on fine-tuning refer to fastbook
model.fc = nn.Linear(512,len(dataset_manager.label_manager), bias=True)
We'll freeze all the layers of the model except for the fc
classification head we added above.
def make_fine_tunable(model):
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
print("Tunable Layers: ")
for (name, param) in model.named_parameters():
if param.requires_grad:
print(f'{name} -> {param.requires_grad}')
make_fine_tunable(model)
Now comes in a point where we have to write a traininig loop and this is where things get into the Boiler Plate code category even more. We shouldn't be writing this but that's the point of this blog post that using fastai
we can offload a lot of the boiler plate code to the library and use the goodies offered by the library to our advantage and focus more on research/modeling.
We are maintaining an instance of model
, criterion
, an optimizer
and dataloaders
. We step through a batch during train_epoch
and incur a loss
. We use this loss
to make a backward
pass and let the optimizer
take a step by updating the network parameters. We also have a validate
function that calculates loss
and accuracy
on validation dataset
after every epoch
class Trainer():
def __init__(self, train_dataloader, model, criterion, optimizer, test_dataloader=None):
self.train_dl = train_dataloader
self.model = model
self.test_dl = test_dataloader
self.criterion = criterion
self.optimizer = optimizer
self.recorder = {'loss': {
'train': {}, 'test': {}}
, 'accuracy': {'train': {}, 'test': {}}}
def step_batch(self, X,y):
X = X.cuda()
y = y.cuda()
logits = self.model(X)
loss = self.criterion(logits, y)
probs = F.softmax(logits, dim=1)
return loss, logits, probs
def train_epoch(self, epoch):
self.model.train()
running_loss = 0
for X,y in tqdm(self.train_dl, leave=False):
self.optimizer.zero_grad()
loss, _, _ = self.step_batch(X,y)
running_loss += loss
loss.backward()
self.optimizer.step()
epoch_loss = running_loss / len(self.train_dl)
self.recorder['loss']['train'][epoch] = epoch_loss
return epoch_loss
@torch.no_grad()
def accuracy(self):
correct = 0
total = 0
for X,y in tqdm(self.test_dl):
total += y.size(0)
logits = model(X)
probs = F.softmax(logits, dim=1)
_, y_pred = torch.max(probs, dim=1)
correct += (y_pred == y).sum()
acc = correct / float(total)
return acc
@torch.no_grad()
def validate(self, epoch):
running_loss = 0
total = 0
correct = 0
for X,y in tqdm(self.test_dl, leave=False):
y = y.cuda()
total += y.size(0)
loss, logits, probs = self.step_batch(X,y)
running_loss += loss
_, y_pred = torch.max(probs, dim=1)
correct += (y_pred == y).cpu().sum()
acc = correct / float(total)
epoch_loss = running_loss / len(self.test_dl)
self.recorder['loss']['test'][epoch] = epoch_loss
self.recorder['accuracy']['test'][epoch] = acc
return epoch_loss, acc
def train(self, num_epochs):
for epoch in tqdm(range(num_epochs), leave=False):
train_loss = self.train_epoch(epoch)
test_loss, test_acc = self.validate(epoch)
#print(f"Training Loss: {train_loss},\tTest Loss: {test_loss},\tTest Accuracy: {test_acc}")
Let's send the model over to the GPU for faster training.
model = model.cuda()
Let's define a configuration that will hold our hyper-parameters
class TrainConfig():
def __init__(self, bs=32, lr=1e-2, seed=42, betas=(0.9, 0.999), num_workers=4):
self.bs = bs
self.lr = lr
self.seed = seed
self.betas = betas
self.num_workers = num_workers
We set the seed for reproducibility, and instantiate dataloader
objects. Notice how we're using > 1 num_workers. That speeds up the data loading process.
config = TrainConfig(bs=128)
torch.manual_seed(config.seed)
train_dl = dataloader.DataLoader(train_dataset, batch_size=config.bs, shuffle=True, num_workers=config.num_workers)
valid_dl = dataloader.DataLoader(valid_dataset, batch_size=config.bs, shuffle=False, num_workers=config.num_workers)
We define our criterion
as nn.CrossEntropy
and choose our optimizer
to be an instance of optim.Adam
, after that we instantiate our trainer object and train (fine-tune in our case) for a few epochs.
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
trainer = Trainer(train_dl, model, criterion, optimizer, test_dataloader=valid_dl)
trainer.train(10)
Helper functions for plotting our loss and accuracies (which we've recorded using our trainer)
def plot_losses(losses):
train_loss = losses['train']
test_loss = losses['test']
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(train_loss, color='blue', label='Training Loss')
ax.plot(test_loss, color='green', label='Test Loss')
ax.set(title="Loss over epochs", xlabel="Epochs", ylabel="Loss")
ax.legend()
fig.show()
plt.style.use('default')
def plot_accuracy(accuracy):
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(accuracy, color='blue', label='Test Accuracy')
ax.set(title="Accuracy over epochs", xlabel="Epochs", ylabel="Accuracy")
ax.legend()
fig.show()
plt.style.use('default')
losses = { k: np.asarray([t.item() for t in v.values()]) for k,v in trainer.recorder['loss'].items() }
plot_losses(losses)
accuracies = { k: np.asarray([t.item() for t in v.values()]) for k,v in trainer.recorder['accuracy'].items()}
plot_accuracy(accuracy=accuracies['test'])
- Load Data
- Inspect Data: Plot a few examples
- Create a DataLoader
- Define a model architecture.
- Write a training loop.
- Plot metrics.
Now let's see how this can done using fastai
. One could treat the following cells as a completely different notebook altogether.
from fastcore.all import L
from fastai.vision.all import *
matplotlib.rc('image', cmap='Greys')
We'll first download the Pets
data and untar
it using the untar_data
function and this function really takes care of filtering the images and storing them somewhere on the disk for us and then returning the paths. It's helpful as I don't have to take a peek at the response object and parse it then untar it, apply filters and then iterate through the directory. This function does it all for us. To know more about untar_data
, please checkout the documentation for untar_data
path = untar_data(URLs.PETS)
Path.BASE_PATH = path
path.ls()
(path/"images").ls()
Let's construct a DataBlock
object. A DataBlock
object provides us encapsulation over many aspects of our data loading and arranging pipeline. It let's us define the
-
blocks
which make up forX
andy
in our dataset- This will also automtically convert the
labels
tointeger ids
- This will also automtically convert the
- extract the
label
from thename
attribute of the file - apply
Transformations
for us which can help us doData Augmentation
and resizing in one go. - Randomly split the data into
training
andvalidation
splits.
Notice how it does all the work and more (we didn't do any augmentation) of the classes Splitter
, RegexLabelExtractor
, LabelManager
, DatasetManager
defined above in just one call, and since it's well maintained, offers us much more functionality, generic, more performamnt, well tested and maintained, we don't need to keep writing our own versions from scratch every time we are tasked with training a classifier.
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(seed=42),
get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
item_tfms=Resize(460),
batch_tfms=aug_transforms(size=224, min_scale=0.75))
To check if everything will work fine, there's a pretty handy function called summary
that we call on the DataBlock
object that will show us the whole plan and will tell us a meningful error message if there's an issue with our pipeline somewhere.
pets.summary(path/"images")
Let's define our dataloader.
dls = pets.dataloaders(path/"images")
Let's train our model for two epochs using cnn_learner
, the model we'll use is resnet34
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(10)
Notice how we didn't have to worry about sending the data or the model over to the GPU.
And the cherry on top is the ability to do interpretation and analyze errors with a very neatly written function call.
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
And we're done! Sure the model can be improved upon from here, but the point is that I can now focus on that bit precisely after just getting started and not worry about anything else. I'd advice now to please read the chapter 5 of the fastbook
as the last few lines have missed a few points about Data Augmentation
, finding the right Learning Rate
etc.