Face & Skin Condition Analysis — Fair, Explainable, Production-ready AI
Fairness-first face & skin AI: multi-label detection (acne, pigmentation, wrinkles) with per-skin-tone evaluation, Grad‑CAM explainability, and reproducible MLOps (MLflow + DVC).

TL;DR
Face & Skin Condition Analysis is a production-oriented computer vision system that detects acne, hyperpigmentation, and wrinkles while explicitly targeting fairness across skin tones (Fitzpatrick I–VI). This post covers the problem, architecture, fairness-first training strategies, results, and how to get started with the code and models.
Introduction
Medical AI systems often hide dangerous disparities behind high overall accuracy. In this project I built a multi-label model (EfficientNet-B3 / ViT) and an MLOps pipeline to prioritize equitable performance—especially for darker skin tones (Fitzpatrick V–VI). The goal: a practical, reproducible system that balances strong metrics with interpretability and fairness.
Why this project
1) Dermatology models frequently underperform on darker skin tones.
2) Clinical applications require transparent, explainable outputs.
3) Reproducibility is essential: experiments, data, and checkpoints should be versioned.
This repo demonstrates concrete strategies for measuring and reducing bias while keeping the system production ready (ONNX inference, MLflow tracking, DVC).
Getting started (quick)
Prerequisites
1) Python 3.8+
2) GPU recommended for training (CUDA)
3) Install dependencies:
python -m pip install -r requirements.txt
Quick inference (example)
python inference/predict.py --image path/to/face.jpg
--checkpoint outputs/checkpoints/best_model.pthOr use the ONNX export for faster inference in production: see `onnx/export.py`.
System overview
Pipeline:
Input Image → Face Detection → Tone-aware Preprocessing → Multi-label Model→ Predictions + Confidence → Grad-CAM Explainability → Fairness Evaluation → ONNX Export
Key modules:
1) preprocessing/ — face detection, alignment, tone-aware augmentation
2) models/ — EfficientNet / ViT implementations and model factory
3) training/ — dataset, dataloader, losses, trainer
4) evaluation/ — metrics, bias analysis, per-group reporting
5) explainability/ — Grad-CAM utilities
Fairness-first training strategies
1) Balanced sampling: use `WeightedRandomSampler` to upsample underrepresented skin-tone groups during training.
training/data_loader.py — compute inverse-frequency weights per skin-tone
sampler = WeightedRandomSampler(weights, num_samples)2) Per-group threshold optimization: compute optimal decision thresholds on validation sets per Fitzpatrick group.
3) Tone-preserving augmentation: prefer geometric and CLAHE-style transforms; avoid aggressive color jitter.
4) Stratified evaluation: track precision/recall/F1 and AUROC per skin-tone group.
Results & limitations
- Overall AUROC: 0.97
- Pigmentation F1: 0.91
- Acne F1: 0.64
- Wrinkles F1: 0.33 (limited data: 27 samples)
Fairness gap (initial): Dark (V–VI) F1 = 0.48 vs Medium (III–IV) F1 = 0.75 → gap = 0.26. Current work aims to reduce this to <0.15 via balanced sampling and threshold tuning.
Balanced sampling (concrete example)
import pandas as pd
import torch
from torch.utils.data import WeightedRandomSampler
# assume metadata.csv has a `fitzpatrick_group` column with values like 'I-II','III-IV','V-VI'
meta = pd.read_csv('data/processed/metadata.csv')
group_counts = meta['fitzpatrick_group'].value_counts().to_dict()
# inverse frequency weight per sample
weights = meta['fitzpatrick_group'].map(lambda g: 1.0 / group_counts[g]).values
sampler = WeightedRandomSampler(weights=torch.DoubleTensor(weights), num_samples=len(weights), replacement=True)
# pass `sampler` to DataLoader
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, sampler=sampler, num_workers=4)This keeps the dataset on disk unchanged while balancing exposure during training.
Per-group threshold optimization
Find thresholds that maximize F1 on validation per skin-tone group (example uses sklearn):
import numpy as np
from sklearn.metrics import precision_recall_curve
def best_threshold_for_f1(y_true, y_scores):
precisions, recalls, threshs = precision_recall_curve(y_true, y_scores)
f1 = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best = np.nanargmax(f1)
return threshs[best]
# compute per-group
group_thresholds = {}
for group in meta['fitzpatrick_group'].unique():
idx = meta['fitzpatrick_group'] == group
group_thresholds[group] = best_threshold_for_f1(y_val[idx], y_scores_val[idx])
# during inference, apply group_thresholds[detected_group]Tone-preserving augmentation
Aggressive color transforms can change skin tone. Use geometric transforms and CLAHE for contrast instead. Example augment pipeline using torchvision + a small custom CLAHE step with OpenCV:
import cv2
from torchvision import transforms
from PIL import Image
import numpy as np
class CLAHETransform:
def __call__(self, img: Image.Image) -> Image.Image:
arr = np.array(img)
if arr.ndim == 3:
lab = cv2.cvtColor(arr, cv2.COLOR_RGB2LAB)
l, a, b = cv2.split(lab)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
l2 = clahe.apply(l)
lab = cv2.merge((l2,a,b))
arr = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
return Image.fromarray(arr)
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.9, 1.0)),
transforms.RandomHorizontalFlip(),
CLAHETransform(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225])
])
Grad-CAM (usage snippet)
Using pytorch-grad-cam to produce explainability heatmaps:
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image
import torchvision.transforms.functional as F
model.eval()
target_layer = model.features[-1] # depends on architecture
cam = GradCAM(model=model, target_layer=target_layer, use_cuda=True)
img = Image.open('path/to/face.jpg').convert('RGB')
inp = preprocess(img).unsqueeze(0).cuda()
grayscale_cam = cam(input_tensor=inp)[0]
rgb_img = np.array(img.resize((224,224))) / 255.0
visualization = show_cam_on_image(rgb_img, grayscale_cam, use_rgb=True)
Image.fromarray(visualization).save('outputs/explainability/cam_overlay.png')
MLflow logging (minimal)
import mlflow
mlflow.set_experiment('face-skin-analysis')
with mlflow.start_run():
mlflow.log_param('model', 'efficientnet-b3')
mlflow.log_param('balance_skin_tones', True)
mlflow.log_metric('val_auc', 0.97)
mlflow.log_artifact('outputs/checkpoints/best_model.pth')ONNX export & inference (quick)
# export
import torch
dummy = torch.randn(1,3,224,224).cuda()
torch.onnx.export(model, dummy, 'outputs/onnx/model.onnx', opset_version=13, input_names=['input'], output_names=['output'], dynamic_axes={'input':{0:'batch'}, 'output':{0:'batch'}})
# inference with onnxruntime
import onnxruntime as ort
ort_sess = ort.InferenceSession('outputs/onnx/model.onnx')
outputs = ort_sess.run(None, {'input': img_tensor.cpu().numpy()})How this is organized for reproducibility
- MLflow for experiments (
mlruns/) — compare runs and artifacts - DVC for data and pipeline reproducibility (
dvc.yaml, DVC-tracked data) - Checkpoints and ONNX models in
outputs/
Roadmap
Phase 1 — Fairness first (current): stabilize 3 conditions, reduce fairness gaps.
Phase 2 — Data expansion: add Fitzpatrick17k and other public datasets to improve representation.
Phase 3 — Feature expansion: severity scoring, more conditions, deployment-ready API.
Quick notes for contributors
- If you add dataset samples, please update
data/processed/metadata.csvand rundvc repro. - Use MLflow tags for run metadata:
skin_tone_group,balance_skin_tones,model_arch.
Conclusion
High accuracy is not enough for medical AI. This project shows a practical path to building ML systems that are fairer and more explainable while remaining production-ready and reproducible. Contributions and critiques welcome.