code computer vision deep learning

Classifying Paintings through Deep Learning

Back in college, I did a class project where I used computer vision techniques for the first time. It was the age before deep learning. Today, I want to revisit this old project.

Using ResNets to revive an old class project

Featured photo by Clem Onojeghuo on

Back in college, I did a class project where I used computer vision techniques for the first time. I wanted to classify paintings to genres — abstract, realist, impressionist, baroque. It was the age before deep learning, and so the features available were histograms, blobs, and edges. I haven’t heard yet of SIFT and bag-of-words for images then, so the project was very basic. But you had to start somewhere, and that was already very exciting for me. Matlab then was the pinnacle of machine learning. Today, we know all about the numpy stack and deep learning frameworks. Wouldn’t it be fun to revisit an old project?

You can check out the notebooks from Github here. You can also check the Kaggle Notebooks here.

The Challenge

We’re using the Historic Art dataset from Kaggle. It’s a large collection of artworks and their metadata. Other than paintings, there are sculptures, architecture, graphic design, and even ancient pottery and tapestry. In the paintings department, we have the works of different artists, from Donatello to Van Gogh. It spans several periods, from Medieval to Impressionism. The paintings number 33,611 from 5565 painters. In this exercise, I filter only a few painters I’m familiar with. Note that this selection is completely arbitrary:

PeriodArtist# of Samples
Early RenaissanceFra Angelico244
Northern RenaissanceBruegel, the Elder223
ImpressionismVan Gogh420

Note that some of the samples are actually details of a larger work. For example, the Garden of Earthly Delights by Bosch has several zoomed-in samples in the data, depicting scenes within the piece.

I want to make a classifier that distinguishes the different periods. To do so, I will use PyTorch Lightning to create our model and training routines. I will compare ResNet18, 34, and 50 if there is a difference in accuracy. As an added challenge, I also want to try out multi-task learning, with predicting the period and the artists simultaneously to hopefully steer the training in a positive direction. Finally, I want to visualize what the network is doing beyond the metrics. I want to see gradient maps to uncover what latent features the network processes.

Lastly, a disclaimer. I’m not an expert in art history or any academic history. I’m simply a museum-goer with a taste for these beautiful paintings. So I’m afraid I can’t explain all the nuances of the Renaissance, and the fun elements of the impressionist movement.

Code On

First, we start with the DataLoader. I’m also defining a SquarePad transformer, based here, to keep the ratio of the image dimensions.

class SquarePad:
def __call__(self, image):
w, h = image.size
max_wh = np.max([w, h])
hp = int((max_wh w) / 2)
vp = int((max_wh h) / 2)
padding = (hp, vp, hp, vp)
return torchvision.transforms.functional.pad(image, padding, 0, 'constant')
class ArtPeriodDataSet(Dataset):
def __init__(self, dataframe, img_dir, transform=None, target_transform=None):
self.dataframe = dataframe.copy()
self.img_dir = img_dir
# eliminate non-existing files
print("Eliminate non-existing files")
for idx, row in tqdm(dataframe.iterrows(), total=len(dataframe)):
path = f'{img_dir}/{row["ID"]}.jpg'
if not os.path.exists(path):
self.dataframe.drop(idx, inplace=True)
print("Dropping", path)
# encode brands
self.label_encoder = LabelEncoder()
labels_encoded = self.label_encoder.fit_transform(self.dataframe["period"])
self.dataframe["period_encoded"] = labels_encoded
self.classes = self.label_encoder.classes_
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.dataframe)
def __getitem__(self, idx):
img_path = f'{self.img_dir}/{self.dataframe.iloc[idx]["ID"]}.jpg'
image ='RGB')
label = self.dataframe.iloc[idx]["period_encoded"]
if self.transform:
image = self.transform(image)
if self.target_transform:
label = self.target_transform(label)
return image, label

My transformations and augmentations are standard. We use square padding, resize to 256, center crop to 224, then normalize via ImageNet mean and standard deviation. For the train, validation, and test splits, I have the following summary:

Early Renaissance3883030448
Northern Renaissance3253030385

We are using pre-trained ResNet models from torchvision. The definition and training routines use PyTorch Lightning, which personally, makes the code more beautiful and more organized.

class LightningResNet(pl.LightningModule):
def __init__(self, net_pretrained, device='cpu', criterion = F.cross_entropy,
num_classes = 4, optimizer = None, scheduler = None):
super().__init__() = net_pretrained
# set top to number of classes
num_ftrs = = nn.Linear(num_ftrs, num_classes)
self.criterion = criterion
self.optimizer = optimizer
self.scheduler = scheduler
def forward(self, x):
def accuracy(self, outputs, labels):
_, preds = torch.max(outputs, dim=1)
return torch.tensor(torch.sum(preds == labels).item() / len(preds))
def training_step(self, batch, batch_idx):
loss, acc = self._shared_eval_step(batch, batch_idx)
metrics = {'train_loss': loss, 'train_accuracy': acc}
return {'loss': loss, "train_accuracy" : acc}
def test_step(self, batch, batch_idx):
with torch.no_grad():
loss, acc = self._shared_eval_step(batch, batch_idx)
metrics = {"test_acc": acc, "test_loss": loss}
return metrics
def validation_step(self, batch, batch_idx):
with torch.no_grad():
loss, acc = self._shared_eval_step(batch, batch_idx)
metrics = {'val_loss': loss, 'val_accuracy': acc}
self.log_dict(metrics, prog_bar=True)
return metrics
def _shared_eval_step(self, batch, batch_idx):
images, labels = batch
out = self(images)
loss = self.criterion(out, labels)
accu = self.accuracy(out,labels)
return loss, accu
def configure_optimizers(self):
if not self.optimizer:
optimizer = optim.SGD(, lr=0.001, momentum=0.9)
plateau_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=3, verbose=True)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": plateau_scheduler,
"monitor": "val_accuracy"
return {
"optimizer" : self.optimizer,
"lr_scheduler": {
"scheduler": self.scheduler,
"monitor": "val_loss"
# load data loaders and parameters
# …
# get pretrained
net = models.resnet18(pretrained=True)
# create and load the optimizers
optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
plateau_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=plateau_lr_decrease, verbose=True)
clf_resnet18 = LightningResNet(net, num_classes=len(classes), device=device, optimizer = optimizer, scheduler=plateau_scheduler)
# prepare callbacks
callbacks = [pl.callbacks.EarlyStopping("val_accuracy", mode='max', patience=patience)]
logger = pl.loggers.CSVLogger("logs", name="resnet18-a1-transfer-logs", version=0)
# prepare trainer and launch!
trainer_resnet18 = pl.Trainer(accelerator="auto", gpus=1, callbacks=callbacks, max_epochs=max_epochs, log_every_n_steps=1, logger=logger), train_dataloaders=dataloaders['train'], val_dataloaders = dataloaders['val'])

In one set of runs, I freeze the feature extraction layers and only train the heads of the models. This is to ensure the model does not ‘catastrophically forget’ the breadth of its earlier ImageNet training. I repeat the runs for ResNet34 and 50 to find out the best classifier. To squeeze out a little extra performance, I then enable the training of all layers, while keeping the learning rate very low.

In another set of runs, I try out multi-task classification. The first task is predicting the period and the second is the artist. To do this, you must modify the DataLoader to output two labels while also modifying the forward function of your PyTorch Lightning model. Here’s the relevant snippet:

# following
class LightningResNetMultiLabel(pl.LightningModule):
def __init__(self, net, n_period, n_artists, criterion = F.cross_entropy, optimizer = None, scheduler = None, dropout_p = 0., lr=0.001, freeze_net=False):
super().__init__() = net
self.feature_extractor = nn.Sequential(*(list([:1]))
if freeze_net:
for param in
param.requires_grad = False
num_ftrs = net.fc.in_features
self.period_fc = nn.Sequential(
nn.Linear(in_features=num_ftrs, out_features=n_period)
self.artist_fc = nn.Sequential(
nn.Linear(in_features=num_ftrs, out_features=n_artists)
self.loss_func = criterion
self.optimizer = optimizer
self.scheduler = scheduler
self.learning_rate = lr
def criterion(self, loss_func, outputs, inputs):
losses = 0
for i, key in enumerate(outputs):
losses += loss_func(outputs[key], inputs[f'{key}_label'])
return losses
def forward(self, x):
x = self.feature_extractor(x)
x = torch.flatten(x, 1)
return {
'period': self.period_fc(x),
'artist': self.artist_fc(x),
def _shared_eval_step(self, batch, batch_idx):
images = batch["image"]
period_labels = batch["period_label"]
artist_labels = batch["artist_label"]
out = self(images)
out_period = out["period"]
out_artist = out["artist"]
loss = self.criterion(self.loss_func, out, batch)
period_accu = self.accuracy(out_period, period_labels)
artist_accu = self.accuracy(out_artist, artist_labels)

It’s also helpful to know that PyTorch Lightning comes with its own learning rate finder. Simply use the auto_lr_find=True in the Trainer and use trainer_resnet18.tune. I used this in the multi-task experiments and the runs that did not involve freezing the feature extractor layers.


1ResNet 18 (11.4 M params)0.6970.722
2ResNet 34 (21.5 M params)0.7200.722
3ResNet 50 (23.9 params)0.7300.733
4ResNet 18 – Train All Layers after
top training
5ResNet 18 – Train All Layers immediately1.0180.600
6ResNet 18 – Multi-Task4.9470.556
7ResNet 34- Multi-Task4.8670.472

For Runs 1-3, the metrics are nearly the same, so it proves we can use the simplest model (11.4 M parameters counts as simplest in this case!). I reach the best metrics in Run 4. Run 5, the method that involved training all the layers off-the-shelf, proves a little less accurate than the best run. Runs 6-7 are a letdown, as their accuracy on the artists is also dismal.

Confusion matrix of Run 4

The confusion matrix reveals some difficult examples of cross-over from adjacent periods. For example, early renaissance works are predicted as medieval, and romanticism is predicted most of the time as baroque. I’m not sure why some baroque pieces are predicted as medieval, however. Perhaps a gallery of errors will help.

Got it. The errors of the romanticism period are portraits, which are very popular in the baroque period. As for the errors of the baroque period being misclassified as medieval, I’m not sure. Perhaps it sees something religious in nature in the long hairs. It should be noted though that these are mostly portraits of Titus, Rembrandt’s only son that reached adulthood.


Here, I use the best run’s parameters to visualize what the network is doing. After some fiddling around, I landed on using Guided Backpropagation. I have the following for consideration:

The model rightly focuses on the edges and the face of the lady on the left.
Here we see that the eyes are the sole focus.
In this portrait as well, the face lights up.
Here, we see houses as the focus, something that Bruegel is known for.
It’s interesting that the clothed maja has the gradients scattered, while the nude maja only has the face. The folds of clothing are a major element of classicism, which baroque was trying to revive.
We see the distinct brushstrokes of the impressionists. It appears that the network uses blob-like elements as features.

Update: Train on All Paintings

I did a follow-up set of experiments to see how well ResNet-18 will work with the whole 33,000 paintings data. I did stratified sampling. 20% of the total samples go to validation and another 20% for the test set.

Early Renaissance25376347933964
Northern Renaissance20685176463231
High Renaissance13583404252123
Art Nouveau1553949243

For these runs, I freeze only a part of ResNet, and let it train on the rest. Below, run 1 means the first block is frozen and the rest are trained. Run 2 means that the first and second block is frozen. And so on. This form of transfer learning keeps the bottom features active from the larger ImageNet data, while creating new hierarchies of features higher up the model. Here’s the Kaggle Notebook.

1ResNet 18 – Freeze up to 1st2.0920.380
2ResNet 18 – Freeze up to 2nd1.7770.524
3ResNet 18 – 3rd1.9660.552
4ResNet 18 – 4th2.6120.069

There are significant differences, and in this case, it looks like freezing up to the 2nd or 3rd block will both work well. This means that the higher convolutional blocks are free to create features that are well suited for this domain.

Run 2’s confusion matrix

Like the confusion matrix of our first experiments, the misclassification tends to be on adjacent periods. In this case, there are classes with zero correct predictions. Neoclassicism, realism, and art nouveau are not recognized at all while baroque is predicted everywhere. Perhaps this is expected since the class distribution tends heavily on it.


This has been a fun exercise. I’m quite proud of where I stand today, in a field that has become very active since the last time I used traditional computer vision features. I wish I can pat my younger self on the back and say to plow on — the field gets very exciting!

Some learnings, to close:

  • The simplest model wins, but perhaps if I use the larger paintings data, then the capacity of ResNet50 wins.
  • For this exercise, I find that PyTorch Lightning is a big boost to your training. It makes your model and computation organized into tidy and reusable blocks.
  • Dropping the learned weights (or at least part of it), then retraining on all the images from all artists might also work well.
  • I did not get the multi-task learning to work well. Perhaps more data is required for this. Or more tuning.
  • For visualization, I’ve tried image generation, but the images did not make sense. For that to fly, GANs are the way to go for sure.

Thanks for reading!

By krsnewwave

I'm a software engineer and a data science guy on recommender systems, natural language processing, and computer vision.

One reply on “Classifying Paintings through Deep Learning”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s