'Garbage In, Garbage Out' a popular adage which says the quality of the output is only as good as the quality of the input i.e. your model is as good as the data it is trained upon. In this blog, we demonstrate this concept using Cleanlab, in particular, Cleanlab Studio.
Motivation
To see is to believe, and we shall see through experimentation, the importance of data quality in data science. TLDR, we train a model for multi task multi class classification and find one of the task performing poorly. We suspect data labels to be ambiguous. We shall try fitting the model again using the clean labels generated by Cleanlab Studio and compare the performance.
The Dataset
For the demonstration, we use CMU Face Images dataset.
- This data consists of 640 black and white face images of people.
- Each image can be characterized by the pose [straight, left, right, up], expression [neutral, happy, sad, angry], eyes [wearing sunglasses or not].
- This directory contains 20 subdirectories, one for each person, named by
userid. - There are 32 images for each person capturing every combination of features. 16 of the 640 images have glitches due to problems with the camera setup; these are the .bad images
- Naming convention of images
[userid][pose][expression][eyes][scale].pgm - Images with scale 1 were selected i.e. full-resolution image (120x128) which has the convention
[userid][pose][expression][eyes].pgm
[pose; expression; eyes]Vanilla Implementation
We create a custom dataset leveraging the PyTorch dataset API for easy data loading. We leverage the naming convention to extract the labels.
# Code for custom dataset using PyTorch
class CMUFaceDataset(Dataset):
def __init__(self, data_dir, transform=None):
self.data_dir = data_dir
self.transform = transform
# User IDs in the dataset
self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']
# Create user_id to index mapping for faster lookups
self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}
# Scan the dataset directory structure
self._scan_dataset()
def _scan_dataset(self):
"""Scan the dataset directory structure"""
# Use glob to find all PGM files (more efficient than nested loops)
pattern = os.path.join(self.data_dir, "*/*.pgm")
all_image_paths = glob.glob(pattern)
exp_size = len(all_image_paths)
# Initialize data structures
self.image_paths = [None] * exp_size
self.pose = [None] * exp_size
self.expression = [None] * exp_size
self.eyes = [None] * exp_size
self.images = [None] * exp_size
# Process files sequentially to avoid thread safety issues
valid_idx = 0
for img_path in all_image_paths:
# Extract user_id from path
parts = os.path.normpath(img_path).split(os.sep)
user_id = parts[-2]
if user_id in self.user_to_idx:
# Parse filename to extract metadata
filename = os.path.basename(img_path)
parts = filename.split('_')
if len(parts) >= 4:
# Store metadata and path
self.image_paths[valid_idx] = img_path
self.pose[valid_idx] = parts[1]
self.expression[valid_idx] = parts[2]
self.eyes[valid_idx] = parts[3][:-4]
image = Image.open(img_path).convert('L')
if self.transform:
image = self.transform(image)
self.images[valid_idx] = image
valid_idx += 1
# Trim lists to the valid size
if valid_idx < exp_size:
self.image_paths = self.image_paths[:valid_idx]
self.pose = self.pose[:valid_idx]
self.expression = self.expression[:valid_idx]
self.eyes = self.eyes[:valid_idx]
self.images = self.images[:valid_idx]
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])
We shall implement a fairly simple CNN-based architecture for the classification task. We train the model for 20 epochs with 80:20 train-test split.
# Code for multitask face classifier using PyTorch
class MultitaskFaceClassifier(nn.Module):
def __init__(self, num_poses=4, num_expressions=2, num_eye_states=2):
super(MultitaskFaceClassifier, self).__init__()
# Enhanced feature extraction with batch normalization
self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.pool = nn.MaxPool2d(2, 2)
# Shared features with batch norm
self.fc_shared = nn.Linear(128 * 8 * 8, 512)
self.bn_shared = nn.BatchNorm1d(512)
self.dropout_shared = nn.Dropout(0.5)
# Enhanced task-specific heads with intermediate layers
# Pose head
self.pose_layers = nn.Sequential(
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_poses)
)
# Expression head
self.expression_layers = nn.Sequential(
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.6),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_expressions)
)
# Eyes head
self.eyes_layers = nn.Sequential(
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_eye_states)
)
def forward(self, x):
x = self.pool(F.relu(self.bn1(self.conv1(x))))
x = self.pool(F.relu(self.bn2(self.conv2(x))))
x = self.pool(F.relu(self.bn3(self.conv3(x))))
x = x.view(-1, 128 * 8 * 8)
# Shared features with normalization
features = self.fc_shared(x)
features = self.bn_shared(features)
features = F.relu(features)
features = self.dropout_shared(features)
# Task-specific outputs through enhanced heads
pose_out = self.pose_layers(features)
expression_out = self.expression_layers(features)
eyes_out = self.eyes_layers(features)
return {
'pose': pose_out,
'expression': expression_out,
'eyes': eyes_out
}, features
Following are the plots for epoch vs loss, confusion matrix and prediction sample. We consider accuracy as the evaluation metric.
Note: The loss curve for expression does not look right.
- The validation loss does not follow decreasing pattern, in fact, a slight increase. This raises a concern.
- The loss is ~ 1.42 after 10 epochs, which is as good as random guess. (Why?)
Mathematical Reasoning
For distribution a classification model with $K$ classes that performs as well as a random guess, the expected cross-entropy loss is $\text{log}(K)$.
The cross-entropy loss for a single prediction is defined as,
$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)$$
where,
$y_i$ is the true probability for class $i$ ($y_i = 1$ for the correct class and $y_i = 0$ for all other classes)
$\hat{y}_i$ is the predicted probability for class $i$
For a random classifier with $K$ classes, the model assigns equal probability to each class, $\hat{y}_i = \frac{1}{K}$ $\forall$ classes.
When the true class is class $j$, we have $y_j = 1$ and $y_i = 0$ for all $i \neq j$. Therefore,
$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i) = -y_j \log(\hat{y}_j) = -1 \cdot \log\left(\frac{1}{K}\right)= \log(K)$$
Since this calculation holds for any true class $j$, and assuming a balanced dataset where each class appears with equal frequency $\frac{1}{K}$, the expected cross-entropy loss is given by,
$$\mathbb{E}[H(y, \hat{y})] = \sum_{j=1}^{K} P(\text{true class} = j) \cdot H(y_j, \hat{y}) = \sum_{j=1}^{K} \frac{1}{K} \cdot \log(K) = \log(K)$$
Thus, the expected cross-entropy loss for a random classifier with $4$ classes is $\mathbb{E}[H] = \ln(4) \approx 1.386$
This result represents the baseline performance that any classification model should exceed. A model performing worse than this baseline is actually performing worse than random guessing.
The confusion matrix shows that pose and eyes achieve 97.6%, which is good. As expected, expression achieve 16.8%, which shall be our focus going forward.
The labels in green indicate correct prediction while red indicates wrong prediction.
Cleanlab
Introduction
Cleanlab Studio is an AI-powered data curation tool used to improve the quality of data and resulting model/analytics. The Studio offers three workflows, Web interface, Python API, and Command Line. We used the Web Interface for this demonstration.
The usage pretty straight forward
- Upload a dataset
- Create a project (their AI analyzes the data)
- Review detected data issues
- Export a cleanset (the cleaned dataset)
The Studio does not support our classification task i.e. Multi-Task Multi-Class Classification. But since we would only treat the expression task, so Multi-Class Classification suffices for the demonstration.
Analysis of Report
We download the cleanset and use it to create new labels for expression task. Bur before that, lets look at ready-made analytics provided by the Studio.
Easy to see, there is a high correlation(inverse) between percent of label issues(label issue, outlier, ambiguous) combined per class and the accuracy per class.
Now, we use the cleanset to create new labels by following the steps below.
- Create two copies of the
cleanlab_suggested_labelcolumn, i.e.cleanlab_suggested_label_originalandcleanlab_suggested_label_predwhich essentially contain the suggestions made by Cleanlab. - The column
cleanlab_suggested_labelhas blank cells wherever [Condition 1]cleanlab_actionis notunresolved(resolved) OR [Condition 2] atleast one of the columnscleanlab_is_outlierorcleanlab_is_ambiguousisTRUEwithcleanlab_is_label_issuevalueFalse. - Impute the original label for
expressionwhere [Condition 1] is TRUE for both copies (Why?) - Impute the original label and
cleanlab_predicted_labeltocleanlab_suggested_label_originalandcleanlab_suggested_label_predwhere [Condition 2] is TRUE.
# Make labels based on cleanset exported from Cleanlab Studio
import pandas as pd
df = pd.read_csv('cleanlab-faces-expr.csv')
# Make the copies of the suggested label column
df['cleanlab_suggested_label_original'] = df['cleanlab_suggested_label']
df['cleanlab_suggested_label_pred'] = df['cleanlab_suggested_label']
# [Condition 1]
# Replace values in 'cleanlab_suggested_label_original' and 'cleanlab_suggested_label_pred'
# where 'cleanlab_action' is not 'unresolved'
mask = df['cleanlab_action'] != 'unresolved'
df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'expression']
# [Condition 2]
# Replace values based on the specified conditions
not_out_amb = ~((df['cleanlab_is_outlier'] == False) & (df['cleanlab_is_ambiguous'] == False))
no_label_issue = df['cleanlab_is_label_issue'] == False
mask = not_out_amb & no_label_issue
df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'cleanlab_predicted_label']
# Save the modified dataframe back to CSV
df.to_csv('modified-cleanlab-faces-expr.csv', index=False)
Hypothesis
Our hypothesis is that, the labels for expression are too ambiguous. The way this data was created, is given the tuple (pose, expression, eyes), the person captured the image with the said expression with pose and eyes. Now its natural to suspect that if pose is not straight or eyes is sunglasses, the expression might not be captured faithfully. There is clear directed dependence of expression on pose and eyes (not the other way around).
Re-Implementation
We use the new set of labels to fit our model again and compare the results. We create another custom dataset for new labels. This dataset can be used for the base case as well by not passing any data frame. This makes the previous one redundant but I still did it for the sake of demonstration.
# Code for custom dataset (with labels using Cleanlab) using PyTorch
class CMUFaceDatasetCleanLab(Dataset):
def __init__(self, data_dir, transform=None, df=None, col='expression'):
self.data_dir = data_dir
self.transform = transform
self.df = df
# Create a mapping from image filename to new label if df is provided
self.expression_mapping = {}
if df is not None:
# Convert dataframe to a dictionary for faster lookups
self.expression_mapping = dict(zip(df['image'], df[col]))
# User IDs in the dataset
self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']
# Create user_id to index mapping for faster lookups
self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}
# Scan the dataset directory structure
self._scan_dataset()
def _scan_dataset(self):
"""Scan the dataset directory structure"""
# Use glob to find all PGM files (more efficient than nested loops)
pattern = os.path.join(self.data_dir, "*/*.pgm")
all_image_paths = glob.glob(pattern)
exp_size = len(all_image_paths)
# Initialize data structures
self.image_paths = [None] * exp_size
self.pose = [None] * exp_size
self.expression = [None] * exp_size
self.eyes = [None] * exp_size
self.images = [None] * exp_size
# Process files sequentially to avoid thread safety issues
valid_idx = 0
for img_path in all_image_paths:
parts = os.path.normpath(img_path).split(os.sep)
user_id = parts[-2]
filename = os.path.basename(img_path)
if user_id in self.user_to_idx:
# Parse filename to extract metadata
parts = filename.split('_')
if len(parts) >= 4:
# Use the mapped expression if available, otherwise use the original
if filename in self.expression_mapping:
expression = self.expression_mapping[filename]
else:
expression = parts[2]
# Store metadata and path
self.image_paths[valid_idx] = img_path
self.pose[valid_idx] = parts[1]
self.expression[valid_idx] = expression
self.eyes[valid_idx] = parts[3][:-4]
image = Image.open(img_path).convert('L')
if self.transform:
image = self.transform(image)
self.images[valid_idx] = image
valid_idx += 1
# Trim lists to the valid size
if valid_idx < exp_size:
self.image_paths = self.image_paths[:valid_idx]
self.pose = self.pose[:valid_idx]
self.expression = self.expression[:valid_idx]
self.eyes = self.eyes[:valid_idx]
self.images = self.images[:valid_idx]
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])
Following are the plots for epoch vs loss, confusion matrix and prediction sample.
Note: The loss curve for expression looks much better now for both the cases. The validation loss follows decreasing pattern, in fact, a slight decrease. This is a good sign. Also, the loss is ~ 1.25(<1.386) after 10 epochs, which is an improvement.
The following can be concluded from the confusion matrix
- The accuracy for
posehas dipped a little for both cases. - The accuracy for
eyeshas remained almost unchanged for both cases. - The accuracy for
expressionhas improved by 160% from 16.8% to ~ 44% for both cases.
Note: The distribution of expression classes has changed now with neutral class being over-represented. (Why?)
Results
| Label | pose | expression | eyes |
|---|---|---|---|
| Original | 97.6% | 16.8% | 97.6% |
| Cleanlab Original | 93.6% | 44.0% | 98.4% |
| Cleanlab Prediction | 94.4% | 44.8% | 97.6% |
Conclusion
Badly labelled data will take you only so far.