Blog
Company updates, tutorials, research, and more!
Assessing the Quality of Synthetic Data with Cleanlab Studio
07/12/2023
Use AI to measure the quality of LLM-generated data, automatically detecting unrealistic synthetic examples and underrepresented tails of the real data distribution.
- Elías Snorrason
Enhancing Product Analytics and E-commerce with Cleanlab Studio
07/06/2023
Using AI to analyze product listings for errors, and how this boosts the accuracy of product categorization and analytics efforts.
- Sanjana Garg
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5
06/29/2023
You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data.
- Chris Mauck
- Jonas Mueller
Improving Legal Judgement Prediction with Cleanlab Studio
06/27/2023
A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.
- Hui Wen Goh
Improving any OpenAI Language Model by Systematically Improving its Data
06/01/2023
Reduce LLM prediction error by 37% via data-centric AI.
- Chris Mauck
- Jonas Mueller
Datalab: A Linter for ML Datasets
05/16/2023
Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.
- Elías Snorrason
- Sanjana Garg
- Hui Wen Goh
- Jesse Cummings
- Jonas Mueller
CleanVision: Audit your Image Data for better Computer Vision
03/22/2023
Introducing an open-source Python package to automatically identify common issues in image datasets.
- Sanjana Garg
- Ulyana Tkachenko
- Yiming Chen
- Elías Snorrason
- Jonas Mueller
ActiveLab: Active Learning with Data Re-Labeling
03/02/2023
ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.
- Hui Wen Goh
- Jonas Mueller
Cleanlab: The History, Present, and Future
04/01/2022
How an MIT grad student project became a company with tech used by Google, Amazon, Tesla, Uber, Facebook, and companies around the world.
- Curtis Northcutt
Cleanlab Studio: Issues Found in Popular Datasets
The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Explore all sorts of issues found in popular datasets by Cleanlab Studio!
Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data
05/30/2023
A simple method to determine if a dataset violates the IID assumption in common ways (e.g. temporal drift, or interaction between almost adjacent datapoints).
- Jesse Cummings
- Elías Snorrason
- Jonas Mueller
Effectively Annotate Text Data for Transformers via Active Learning + Re-labeling
05/22/2023
Use ActiveLab to efficiently choose which data to (re)label to train the best Transformer model.
- Chris Mauck
Training Transformer Networks in Scikit-Learn?!
03/08/2023
Learn how to easily make any Tensorflow/Keras model compatible with scikit-learn.
- Hui Wen Goh
cleanlab 2.3 adds support for Active Learning, Tensorflow/Keras models made sklearn-compatible, and highly scalable Label Error Detection
03/01/2023
Highlighting what's new in cleanlab 2.3
- Jonas Mueller
Handling Mislabeled Tabular Data to Improve Your XGBoost Model
02/06/2023
Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.
- Chris Mauck
Automatic Error Detection for Image/Text Tagging and Multi-label Datasets
11/29/2022
Introducing new data quality algorithms for multi-label classification in cleanlab v2.2
- Aditya Thyagarajan
- Elías Snorrason
- Curtis Northcutt
- Jonas Mueller
Out-of-Distribution Detection via Embeddings or Predictions
10/19/2022
Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.
- Ulyana Tkachenko
- Jonas Mueller
A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier
10/19/2022
Exploring new ways to identify outliers based on probabilistic predictions from a trained classifier.
- Ulyana Tkachenko
- Jonas Mueller
- Curtis Northcutt
Detecting Label Errors in Entity Recognition Data
10/12/2022
Understanding cleanlab's new methods for text-based token classification tasks.
- Wei-Chen (Eric) Wang
- Elías Snorrason
- Jonas Mueller
CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators
10/05/2022
Understanding cleanlab's new methods for multi-annotator data and what makes them effective.
- Hui Wen Goh
- Ulyana Tkachenko
- Jonas Mueller
cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI
09/21/2022
Highlighting new features available in cleanlab 2.1
- Curtis Northcutt
- Jonas Mueller
How we built Cleanlab Vizzy
08/17/2022
How we built an in-browser visualization of Cleanlab's Confident Learning algorithm.
- Caleb Chiam
- Luke Mainwaring
- Yiming Chen
Handling Label Errors in Text Classification Datasets
05/10/2022
Learn how to find label issues in text datasets and improve NLP models.
- Wei Jing Lok
- Jonas Mueller
- Hui Wen Goh
Finding Label Issues in Audio Classification Datasets
04/27/2022
Learn how to find label issues in any audio classification dataset.
- Johnson Kuan
- Jonas Mueller
- Anish Athalye
Finding Label Issues in Image Classification Datasets
04/21/2022
Learn how to automatically find label issues in any image classification dataset.
- Wei Jing Lok
- Jonas Mueller
cleanlab 2.0: Automatically Find Errors in ML Datasets
04/21/2022
Announcing cleanlab 2.0: an open-source framework for machine learning and analytics with messy, real-world data.
- Curtis Northcutt
- Jonas Mueller
- Anish Athalye