Cleanlab Saves The Day: An Implementation

'Garbage In, Garbage Out' a popular adage which says the quality of the output is only as good as the quality of the input i.e. your model is as good as the data it is trained upon. In this blog, we demonstrate this concept using Cleanlab, in particular, Cleanlab Studio.

Motivation

To see is to believe, and we shall see through experimentation, the importance of data quality in data science. TLDR, we train a model for multi task multi class classification and find one of the task performing poorly. We suspect data labels to be ambiguous. We shall try fitting the model again using the clean labels generated by Cleanlab Studio and compare the performance.

The Dataset

For the demonstration, we use CMU Face Images dataset.

This data consists of 640 black and white face images of people.
Each image can be characterized by the pose [straight, left, right, up], expression [neutral, happy, sad, angry], eyes [wearing sunglasses or not].
This directory contains 20 subdirectories, one for each person, named by userid.
There are 32 images for each person capturing every combination of features. 16 of the 640 images have glitches due to problems with the camera setup; these are the .bad images
Naming convention of images [userid][pose][expression][eyes][scale].pgm
Images with scale 1 were selected i.e. full-resolution image (120x128) which has the convention [userid][pose][expression][eyes].pgm

Vanilla Implementation

We create a custom dataset leveraging the PyTorch dataset API for easy data loading. We leverage the naming convention to extract the labels.

# Code for custom dataset using PyTorch
class CMUFaceDataset(Dataset):
    def __init__(self, data_dir, transform=None):
        self.data_dir = data_dir
        self.transform = transform

        # User IDs in the dataset
        self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
                   'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
                   'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']

        # Create user_id to index mapping for faster lookups
        self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}

        # Scan the dataset directory structure
        self._scan_dataset()

    def _scan_dataset(self):
        """Scan the dataset directory structure"""
        # Use glob to find all PGM files (more efficient than nested loops)
        pattern = os.path.join(self.data_dir, "*/*.pgm")
        all_image_paths = glob.glob(pattern)

        exp_size = len(all_image_paths)

        # Initialize data structures
        self.image_paths = [None] * exp_size
        self.pose = [None] * exp_size
        self.expression = [None] * exp_size
        self.eyes = [None] * exp_size
        self.images = [None] * exp_size

        # Process files sequentially to avoid thread safety issues
        valid_idx = 0
        for img_path in all_image_paths:
            # Extract user_id from path
            parts = os.path.normpath(img_path).split(os.sep)
            user_id = parts[-2]

            if user_id in self.user_to_idx:
                # Parse filename to extract metadata
                filename = os.path.basename(img_path)
                parts = filename.split('_')

                if len(parts) >= 4:
                    # Store metadata and path
                    self.image_paths[valid_idx] = img_path
                    self.pose[valid_idx] = parts[1]
                    self.expression[valid_idx] = parts[2]
                    self.eyes[valid_idx] = parts[3][:-4]

                    image = Image.open(img_path).convert('L')
                    if self.transform:
                        image = self.transform(image)
                    self.images[valid_idx] = image
                    valid_idx += 1

        # Trim lists to the valid size
        if valid_idx < exp_size:
          self.image_paths = self.image_paths[:valid_idx]
          self.pose = self.pose[:valid_idx]
          self.expression = self.expression[:valid_idx]
          self.eyes = self.eyes[:valid_idx]
          self.images = self.images[:valid_idx]

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])

We shall implement a fairly simple CNN-based architecture for the classification task. We train the model for 20 epochs with 80:20 train-test split.

# Code for multitask face classifier using PyTorch
class MultitaskFaceClassifier(nn.Module):
    def __init__(self, num_poses=4, num_expressions=2, num_eye_states=2):
        super(MultitaskFaceClassifier, self).__init__()

        # Enhanced feature extraction with batch normalization
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool = nn.MaxPool2d(2, 2)

        # Shared features with batch norm
        self.fc_shared = nn.Linear(128 * 8 * 8, 512)
        self.bn_shared = nn.BatchNorm1d(512)
        self.dropout_shared = nn.Dropout(0.5)

        # Enhanced task-specific heads with intermediate layers
        # Pose head
        self.pose_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_poses)
        )

        # Expression head
        self.expression_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_expressions)
        )

        # Eyes head
        self.eyes_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_eye_states)
        )

    def forward(self, x):
        x = self.pool(F.relu(self.bn1(self.conv1(x))))
        x = self.pool(F.relu(self.bn2(self.conv2(x))))
        x = self.pool(F.relu(self.bn3(self.conv3(x))))
        x = x.view(-1, 128 * 8 * 8)

        # Shared features with normalization
        features = self.fc_shared(x)
        features = self.bn_shared(features)
        features = F.relu(features)
        features = self.dropout_shared(features)

        # Task-specific outputs through enhanced heads
        pose_out = self.pose_layers(features)
        expression_out = self.expression_layers(features)
        eyes_out = self.eyes_layers(features)

        return {
            'pose': pose_out,
            'expression': expression_out,
            'eyes': eyes_out
        }, features

Following are the plots for epoch vs loss, confusion matrix and prediction sample. We consider accuracy as the evaluation metric.

Note: The loss curve for expression does not look right.

The validation loss does not follow decreasing pattern, in fact, a slight increase. This raises a concern.
The loss is ~ 1.42 after 10 epochs, which is as good as random guess. (Why?)

Mathematical Reasoning

For distribution a classification model with $K$ classes that performs as well as a random guess, the expected cross-entropy loss is $\text{log}(K)$.

The cross-entropy loss for a single prediction is defined as,

$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)$$

where,

$y_i$ is the true probability for class $i$ ($y_i = 1$ for the correct class and $y_i = 0$ for all other classes)

$\hat{y}_i$ is the predicted probability for class $i$

For a random classifier with $K$ classes, the model assigns equal probability to each class, $\hat{y}_i = \frac{1}{K}$ $\forall$ classes.

When the true class is class $j$, we have $y_j = 1$ and $y_i = 0$ for all $i \neq j$. Therefore,

$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i) = -y_j \log(\hat{y}_j) = -1 \cdot \log\left(\frac{1}{K}\right)= \log(K)$$

Since this calculation holds for any true class $j$, and assuming a balanced dataset where each class appears with equal frequency $\frac{1}{K}$, the expected cross-entropy loss is given by,

$$\mathbb{E}[H(y, \hat{y})] = \sum_{j=1}^{K} P(\text{true class} = j) \cdot H(y_j, \hat{y}) = \sum_{j=1}^{K} \frac{1}{K} \cdot \log(K) = \log(K)$$

Thus, the expected cross-entropy loss for a random classifier with $4$ classes is $\mathbb{E}[H] = \ln(4) \approx 1.386$

This result represents the baseline performance that any classification model should exceed. A model performing worse than this baseline is actually performing worse than random guessing.

The confusion matrix shows that pose and eyes achieve 97.6%, which is good. As expected, expression achieve 16.8%, which shall be our focus going forward.

The labels in green indicate correct prediction while red indicates wrong prediction.

Cleanlab

Introduction

Cleanlab Studio is an AI-powered data curation tool used to improve the quality of data and resulting model/analytics. The Studio offers three workflows, Web interface, Python API, and Command Line. We used the Web Interface for this demonstration.

The usage pretty straight forward

Upload a dataset
Create a project (their AI analyzes the data)
Review detected data issues
Export a cleanset (the cleaned dataset)

The Studio does not support our classification task i.e. Multi-Task Multi-Class Classification. But since we would only treat the expression task, so Multi-Class Classification suffices for the demonstration.

Analysis of Report

We download the cleanset and use it to create new labels for expression task. Bur before that, lets look at ready-made analytics provided by the Studio.

Easy to see, there is a high correlation(inverse) between percent of label issues(label issue, outlier, ambiguous) combined per class and the accuracy per class.

Now, we use the cleanset to create new labels by following the steps below.

Create two copies of the cleanlab_suggested_label column, i.e. cleanlab_suggested_label_original and cleanlab_suggested_label_pred which essentially contain the suggestions made by Cleanlab.
The column cleanlab_suggested_label has blank cells wherever [Condition 1] cleanlab_action is not unresolved (resolved) OR [Condition 2] atleast one of the columns cleanlab_is_outlier or cleanlab_is_ambiguous is TRUE with cleanlab_is_label_issue value False.
Impute the original label for expression where [Condition 1] is TRUE for both copies (Why?)
Impute the original label and cleanlab_predicted_label to cleanlab_suggested_label_original and cleanlab_suggested_label_pred where [Condition 2] is TRUE.

# Make labels based on cleanset exported from Cleanlab Studio
import pandas as pd
df = pd.read_csv('cleanlab-faces-expr.csv')

# Make the copies of the suggested label column
df['cleanlab_suggested_label_original'] = df['cleanlab_suggested_label']
df['cleanlab_suggested_label_pred'] = df['cleanlab_suggested_label']

# [Condition 1]
# Replace values in 'cleanlab_suggested_label_original' and 'cleanlab_suggested_label_pred' 
# where 'cleanlab_action' is not 'unresolved'
mask = df['cleanlab_action'] != 'unresolved'
df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'expression']

# [Condition 2]
# Replace values based on the specified conditions
not_out_amb = ~((df['cleanlab_is_outlier'] == False) & (df['cleanlab_is_ambiguous'] == False))
no_label_issue = df['cleanlab_is_label_issue'] == False
mask = not_out_amb & no_label_issue

df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'cleanlab_predicted_label']

# Save the modified dataframe back to CSV
df.to_csv('modified-cleanlab-faces-expr.csv', index=False)

Hypothesis

Our hypothesis is that, the labels for expression are too ambiguous. The way this data was created, is given the tuple (pose, expression, eyes), the person captured the image with the said expression with pose and eyes. Now its natural to suspect that if pose is not straight or eyes is sunglasses, the expression might not be captured faithfully. There is clear directed dependence of expression on pose and eyes (not the other way around).

Re-Implementation

We use the new set of labels to fit our model again and compare the results. We create another custom dataset for new labels. This dataset can be used for the base case as well by not passing any data frame. This makes the previous one redundant but I still did it for the sake of demonstration.

# Code for custom dataset (with labels using Cleanlab) using PyTorch
class CMUFaceDatasetCleanLab(Dataset):
    def __init__(self, data_dir, transform=None, df=None, col='expression'):
        self.data_dir = data_dir
        self.transform = transform
        self.df = df
        
        # Create a mapping from image filename to new label if df is provided
        self.expression_mapping = {}
        if df is not None:
            # Convert dataframe to a dictionary for faster lookups
            self.expression_mapping = dict(zip(df['image'], df[col]))

        # User IDs in the dataset
        self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
                   'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
                   'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']

        # Create user_id to index mapping for faster lookups
        self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}

        # Scan the dataset directory structure
        self._scan_dataset()

    def _scan_dataset(self):
        """Scan the dataset directory structure"""
        # Use glob to find all PGM files (more efficient than nested loops)
        pattern = os.path.join(self.data_dir, "*/*.pgm")
        all_image_paths = glob.glob(pattern)

        exp_size = len(all_image_paths)
        
        # Initialize data structures
        self.image_paths = [None] * exp_size
        self.pose = [None] * exp_size
        self.expression = [None] * exp_size
        self.eyes = [None] * exp_size
        self.images = [None] * exp_size

        # Process files sequentially to avoid thread safety issues
        valid_idx = 0
        for img_path in all_image_paths:

            parts = os.path.normpath(img_path).split(os.sep)
            user_id = parts[-2]
            filename = os.path.basename(img_path)

            if user_id in self.user_to_idx:
                # Parse filename to extract metadata
                parts = filename.split('_')

                if len(parts) >= 4:
                    
                    # Use the mapped expression if available, otherwise use the original
                    if filename in self.expression_mapping:
                        expression = self.expression_mapping[filename]
                    else:
                        expression = parts[2]
                        
                    # Store metadata and path
                    self.image_paths[valid_idx] = img_path
                    self.pose[valid_idx] = parts[1]
                    self.expression[valid_idx] = expression
                    self.eyes[valid_idx] = parts[3][:-4]

                    image = Image.open(img_path).convert('L')
                    if self.transform:
                        image = self.transform(image)
                    self.images[valid_idx] = image
                    valid_idx += 1

        # Trim lists to the valid size
        if valid_idx < exp_size:
          self.image_paths = self.image_paths[:valid_idx]
          self.pose = self.pose[:valid_idx]
          self.expression = self.expression[:valid_idx]
          self.eyes = self.eyes[:valid_idx]
          self.images = self.images[:valid_idx]

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])

Following are the plots for epoch vs loss, confusion matrix and prediction sample.

Epoch vs Loss curve for CNN Classifier with Cleanlab Original Label

Epoch vs Loss curve for CNN Classifier with Cleanlab Prediction Label

Note: The loss curve for expression looks much better now for both the cases. The validation loss follows decreasing pattern, in fact, a slight decrease. This is a good sign. Also, the loss is ~ 1.25(<1.386) after 10 epochs, which is an improvement.

Confusion Matrix for CNN Classifier with Cleanlab Original Label

Confusion Matrix for CNN Classifier with Cleanlab Prediction Label

The following can be concluded from the confusion matrix

The accuracy for pose has dipped a little for both cases.
The accuracy for eyes has remained almost unchanged for both cases.
The accuracy for expression has improved by 160% from 16.8% to ~ 44% for both cases.

Note: The distribution of expression classes has changed now with neutral class being over-represented. (Why?)

Prediction sample for CNN Classifier with Cleanlab Original Label

Prediction sample for CNN Classifier with Cleanlab Prediction Label

Results

Label	pose	expression	eyes
Original	97.6%	16.8%	97.6%
Cleanlab Original	93.6%	44.0%	98.4%
Cleanlab Prediction	94.4%	44.8%	97.6%

Table: Accuracy of the model for different labels

Conclusion

Badly labelled data will take you only so far.

Himanshu

Cleanlab Saves The Day: An Implementation