Data Bites

Public datasets worth keeping close.

A lightweight index of useful public datasets: what they contain, what they are good for, and what to check before using them in a project.

KolektorSDD2 poster image from Dataset Ninja

Manufacturing · Surface defect segmentation

KolektorSDD2

A surface-quality inspection dataset for studying defect segmentation under sparse and imbalanced manufacturing defect conditions.

Use for: surface defect segmentation, manufacturing inspection, imbalanced defect detection experiments
Scale: Dataset Ninja lists 3,336 images and 15,764 annotated defect objects
Watch out: Dataset Ninja lists CC BY-SA 4.0; check the original citation and dataset terms before redistribution.

MVTec AD teaser image from the dataset website

Industrial inspection · Unsupervised anomaly detection

MVTec AD

A real-world industrial anomaly detection dataset with defect-free training images, anomalous test images, and pixel-level annotations.

Use for: unsupervised anomaly detection, anomaly localization, industrial visual inspection baselines
Scale: Over 5,000 high-resolution images across 15 object and texture categories
Watch out: Commercial use is not allowed under the standard dataset license; contact MVTec if the use case may be commercial.

MedMNIST overview thumbnail from the dataset website

Biomedical imaging · Classification benchmark

MedMNIST

A standardized collection of 2D and 3D biomedical image classification datasets with MNIST-like size options for fast medical imaging experiments.

Use for: biomedical image classification, lightweight benchmarking, AutoML tests, 2D/3D model sanity checks
Scale: About 708K 2D images and 10K 3D images across 18 datasets
Watch out: Most subsets are CC BY 4.0, but DermaMNIST is CC BY-NC 4.0; the dataset is not intended for clinical use.