Himanshu

Cleanlab Saves The Day: An Implementation

May, 2025

cleanlab data-quality
Table of Contents

'Garbage In, Garbage Out' a popular adage which says the quality of the output is only as good as the quality of the input i.e. your model is as good as the data it is trained upon. In this blog, we demonstrate this concept using Cleanlab, in particular, Cleanlab Studio.

Motivation

To see is to believe, and we shall see through experimentation, the importance of data quality in data science. TLDR, we train a model for multi task multi class classification and find one of the task performing poorly. We suspect data labels to be ambiguous. We shall try fitting the model again using the clean labels generated by Cleanlab Studio and compare the performance.

The Dataset

For the demonstration, we use CMU Face Images dataset.

CMU Face Images
Sample images from the dataset [pose; expression; eyes]

Vanilla Implementation

We create a custom dataset leveraging the PyTorch dataset API for easy data loading. We leverage the naming convention to extract the labels.

# Code for custom dataset using PyTorch
class CMUFaceDataset(Dataset):
    def __init__(self, data_dir, transform=None):
        self.data_dir = data_dir
        self.transform = transform

        # User IDs in the dataset
        self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
                   'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
                   'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']

        # Create user_id to index mapping for faster lookups
        self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}

        # Scan the dataset directory structure
        self._scan_dataset()

    def _scan_dataset(self):
        """Scan the dataset directory structure"""
        # Use glob to find all PGM files (more efficient than nested loops)
        pattern = os.path.join(self.data_dir, "*/*.pgm")
        all_image_paths = glob.glob(pattern)

        exp_size = len(all_image_paths)

        # Initialize data structures
        self.image_paths = [None] * exp_size
        self.pose = [None] * exp_size
        self.expression = [None] * exp_size
        self.eyes = [None] * exp_size
        self.images = [None] * exp_size

        # Process files sequentially to avoid thread safety issues
        valid_idx = 0
        for img_path in all_image_paths:
            # Extract user_id from path
            parts = os.path.normpath(img_path).split(os.sep)
            user_id = parts[-2]

            if user_id in self.user_to_idx:
                # Parse filename to extract metadata
                filename = os.path.basename(img_path)
                parts = filename.split('_')

                if len(parts) >= 4:
                    # Store metadata and path
                    self.image_paths[valid_idx] = img_path
                    self.pose[valid_idx] = parts[1]
                    self.expression[valid_idx] = parts[2]
                    self.eyes[valid_idx] = parts[3][:-4]

                    image = Image.open(img_path).convert('L')
                    if self.transform:
                        image = self.transform(image)
                    self.images[valid_idx] = image
                    valid_idx += 1

        # Trim lists to the valid size
        if valid_idx < exp_size:
          self.image_paths = self.image_paths[:valid_idx]
          self.pose = self.pose[:valid_idx]
          self.expression = self.expression[:valid_idx]
          self.eyes = self.eyes[:valid_idx]
          self.images = self.images[:valid_idx]

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])

We shall implement a fairly simple CNN-based architecture for the classification task. We train the model for 20 epochs with 80:20 train-test split.

# Code for multitask face classifier using PyTorch
class MultitaskFaceClassifier(nn.Module):
    def __init__(self, num_poses=4, num_expressions=2, num_eye_states=2):
        super(MultitaskFaceClassifier, self).__init__()

        # Enhanced feature extraction with batch normalization
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool = nn.MaxPool2d(2, 2)

        # Shared features with batch norm
        self.fc_shared = nn.Linear(128 * 8 * 8, 512)
        self.bn_shared = nn.BatchNorm1d(512)
        self.dropout_shared = nn.Dropout(0.5)

        # Enhanced task-specific heads with intermediate layers
        # Pose head
        self.pose_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_poses)
        )

        # Expression head
        self.expression_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_expressions)
        )

        # Eyes head
        self.eyes_layers = nn.Sequential(
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_eye_states)
        )

    def forward(self, x):
        x = self.pool(F.relu(self.bn1(self.conv1(x))))
        x = self.pool(F.relu(self.bn2(self.conv2(x))))
        x = self.pool(F.relu(self.bn3(self.conv3(x))))
        x = x.view(-1, 128 * 8 * 8)

        # Shared features with normalization
        features = self.fc_shared(x)
        features = self.bn_shared(features)
        features = F.relu(features)
        features = self.dropout_shared(features)

        # Task-specific outputs through enhanced heads
        pose_out = self.pose_layers(features)
        expression_out = self.expression_layers(features)
        eyes_out = self.eyes_layers(features)

        return {
            'pose': pose_out,
            'expression': expression_out,
            'eyes': eyes_out
        }, features

Following are the plots for epoch vs loss, confusion matrix and prediction sample. We consider accuracy as the evaluation metric.

Epoch vs Loss curve for CNN Classifier
Epoch vs Loss curve for CNN Classifier

Note: The loss curve for expression does not look right.

Mathematical Reasoning

For distribution a classification model with $K$ classes that performs as well as a random guess, the expected cross-entropy loss is $\text{log}(K)$.

The cross-entropy loss for a single prediction is defined as,

$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)$$

where,

$y_i$ is the true probability for class $i$ ($y_i = 1$ for the correct class and $y_i = 0$ for all other classes)

$\hat{y}_i$ is the predicted probability for class $i$

For a random classifier with $K$ classes, the model assigns equal probability to each class, $\hat{y}_i = \frac{1}{K}$ $\forall$ classes.

When the true class is class $j$, we have $y_j = 1$ and $y_i = 0$ for all $i \neq j$. Therefore,

$$H(y, \hat{y}) = -\sum_{i=1}^{K} y_i \log(\hat{y}_i) = -y_j \log(\hat{y}_j) = -1 \cdot \log\left(\frac{1}{K}\right)= \log(K)$$

Since this calculation holds for any true class $j$, and assuming a balanced dataset where each class appears with equal frequency $\frac{1}{K}$, the expected cross-entropy loss is given by,

$$\mathbb{E}[H(y, \hat{y})] = \sum_{j=1}^{K} P(\text{true class} = j) \cdot H(y_j, \hat{y}) = \sum_{j=1}^{K} \frac{1}{K} \cdot \log(K) = \log(K)$$

Thus, the expected cross-entropy loss for a random classifier with $4$ classes is $\mathbb{E}[H] = \ln(4) \approx 1.386$

This result represents the baseline performance that any classification model should exceed. A model performing worse than this baseline is actually performing worse than random guessing.

Confusion Matrix for CNN Classifier
Confusion Matrix for CNN Classifier

The confusion matrix shows that pose and eyes achieve 97.6%, which is good. As expected, expression achieve 16.8%, which shall be our focus going forward.

Prediction sample for CNN Classifier
Prediction sample for CNN Classifier

The labels in green indicate correct prediction while red indicates wrong prediction.

Cleanlab

Introduction

Cleanlab Studio is an AI-powered data curation tool used to improve the quality of data and resulting model/analytics. The Studio offers three workflows, Web interface, Python API, and Command Line. We used the Web Interface for this demonstration.

The usage pretty straight forward

The Studio does not support our classification task i.e. Multi-Task Multi-Class Classification. But since we would only treat the expression task, so Multi-Class Classification suffices for the demonstration.

Analysis of Report

We download the cleanset and use it to create new labels for expression task. Bur before that, lets look at ready-made analytics provided by the Studio.

Clean Lab Suggested Labels
Clean Lab Suggested Labels
Clean Lab Issue Count by Class
Clean Lab Issue Count by Class

Easy to see, there is a high correlation(inverse) between percent of label issues(label issue, outlier, ambiguous) combined per class and the accuracy per class.

Accuracy vs Label Issues (in %)
Accuracy vs Label Issues (in %)

Now, we use the cleanset to create new labels by following the steps below.

# Make labels based on cleanset exported from Cleanlab Studio
import pandas as pd
df = pd.read_csv('cleanlab-faces-expr.csv')

# Make the copies of the suggested label column
df['cleanlab_suggested_label_original'] = df['cleanlab_suggested_label']
df['cleanlab_suggested_label_pred'] = df['cleanlab_suggested_label']

# [Condition 1]
# Replace values in 'cleanlab_suggested_label_original' and 'cleanlab_suggested_label_pred' 
# where 'cleanlab_action' is not 'unresolved'
mask = df['cleanlab_action'] != 'unresolved'
df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'expression']

# [Condition 2]
# Replace values based on the specified conditions
not_out_amb = ~((df['cleanlab_is_outlier'] == False) & (df['cleanlab_is_ambiguous'] == False))
no_label_issue = df['cleanlab_is_label_issue'] == False
mask = not_out_amb & no_label_issue

df.loc[mask, 'cleanlab_suggested_label_original'] = df.loc[mask, 'expression']
df.loc[mask, 'cleanlab_suggested_label_pred'] = df.loc[mask, 'cleanlab_predicted_label']

# Save the modified dataframe back to CSV
df.to_csv('modified-cleanlab-faces-expr.csv', index=False)

Hypothesis

Our hypothesis is that, the labels for expression are too ambiguous. The way this data was created, is given the tuple (pose, expression, eyes), the person captured the image with the said expression with pose and eyes. Now its natural to suspect that if pose is not straight or eyes is sunglasses, the expression might not be captured faithfully. There is clear directed dependence of expression on pose and eyes (not the other way around).

Re-Implementation

We use the new set of labels to fit our model again and compare the results. We create another custom dataset for new labels. This dataset can be used for the base case as well by not passing any data frame. This makes the previous one redundant but I still did it for the sake of demonstration.

# Code for custom dataset (with labels using Cleanlab) using PyTorch
class CMUFaceDatasetCleanLab(Dataset):
    def __init__(self, data_dir, transform=None, df=None, col='expression'):
        self.data_dir = data_dir
        self.transform = transform
        self.df = df
        
        # Create a mapping from image filename to new label if df is provided
        self.expression_mapping = {}
        if df is not None:
            # Convert dataframe to a dictionary for faster lookups
            self.expression_mapping = dict(zip(df['image'], df[col]))

        # User IDs in the dataset
        self.user_ids = ['an2i', 'at33', 'boland', 'bpm', 'ch4f', 'cheyer', 'choon',
                   'danieln', 'glickman', 'karyadi', 'kawamura', 'kk49', 'megak',
                   'mitchell', 'night', 'phoebe', 'saavik', 'steffi', 'sz24', 'tammo']

        # Create user_id to index mapping for faster lookups
        self.user_to_idx = {user_id: idx for idx, user_id in enumerate(self.user_ids)}

        # Scan the dataset directory structure
        self._scan_dataset()

    def _scan_dataset(self):
        """Scan the dataset directory structure"""
        # Use glob to find all PGM files (more efficient than nested loops)
        pattern = os.path.join(self.data_dir, "*/*.pgm")
        all_image_paths = glob.glob(pattern)

        exp_size = len(all_image_paths)
        
        # Initialize data structures
        self.image_paths = [None] * exp_size
        self.pose = [None] * exp_size
        self.expression = [None] * exp_size
        self.eyes = [None] * exp_size
        self.images = [None] * exp_size

        # Process files sequentially to avoid thread safety issues
        valid_idx = 0
        for img_path in all_image_paths:

            parts = os.path.normpath(img_path).split(os.sep)
            user_id = parts[-2]
            filename = os.path.basename(img_path)

            if user_id in self.user_to_idx:
                # Parse filename to extract metadata
                parts = filename.split('_')

                if len(parts) >= 4:
                    
                    # Use the mapped expression if available, otherwise use the original
                    if filename in self.expression_mapping:
                        expression = self.expression_mapping[filename]
                    else:
                        expression = parts[2]
                        
                    # Store metadata and path
                    self.image_paths[valid_idx] = img_path
                    self.pose[valid_idx] = parts[1]
                    self.expression[valid_idx] = expression
                    self.eyes[valid_idx] = parts[3][:-4]

                    image = Image.open(img_path).convert('L')
                    if self.transform:
                        image = self.transform(image)
                    self.images[valid_idx] = image
                    valid_idx += 1

        # Trim lists to the valid size
        if valid_idx < exp_size:
          self.image_paths = self.image_paths[:valid_idx]
          self.pose = self.pose[:valid_idx]
          self.expression = self.expression[:valid_idx]
          self.eyes = self.eyes[:valid_idx]
          self.images = self.images[:valid_idx]

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        return self.images[idx], (self.pose[idx], self.expression[idx], self.eyes[idx])

Following are the plots for epoch vs loss, confusion matrix and prediction sample.

Epoch vs Loss curve for CNN Classifier with Cleanlab Original Label
Epoch vs Loss curve for CNN Classifier with Cleanlab Original Label
Epoch vs Loss curve for CNN Classifier with Cleanlab Prediction Label
Epoch vs Loss curve for CNN Classifier with Cleanlab Prediction Label

Note: The loss curve for expression looks much better now for both the cases. The validation loss follows decreasing pattern, in fact, a slight decrease. This is a good sign. Also, the loss is ~ 1.25(<1.386) after 10 epochs, which is an improvement.

Confusion Matrix for CNN Classifier with Cleanlab Original Label
Confusion Matrix for CNN Classifier with Cleanlab Original Label
Confusion Matrix for CNN Classifier with Cleanlab Prediction Label
Confusion Matrix for CNN Classifier with Cleanlab Prediction Label

The following can be concluded from the confusion matrix

Note: The distribution of expression classes has changed now with neutral class being over-represented. (Why?)

Prediction sample for CNN Classifier with Cleanlab Original Label
Prediction sample for CNN Classifier with Cleanlab Original Label
Prediction sample for CNN Classifier with Cleanlab Prediction Label
Prediction sample for CNN Classifier with Cleanlab Prediction Label

Results

Label pose expression eyes
Original 97.6% 16.8% 97.6%
Cleanlab Original 93.6% 44.0% 98.4%
Cleanlab Prediction 94.4% 44.8% 97.6%
Table: Accuracy of the model for different labels

Conclusion

Badly labelled data will take you only so far.