Blog

Company updates, tutorials, research, and more!

Assessing the Quality of Synthetic Data with Cleanlab Studio

Assessing the Quality of Synthetic Data with Cleanlab Studio

07/12/2023

Use AI to measure the quality of LLM-generated data, automatically detecting unrealistic synthetic examples and underrepresented tails of the real data distribution.

  • Elías SnorrasonElías Snorrason
Enhancing Product Analytics and E-commerce with Cleanlab Studio

Enhancing Product Analytics and E-commerce with Cleanlab Studio

07/06/2023

Using AI to analyze product listings for errors, and how this boosts the accuracy of product categorization and analytics efforts.

  • Sanjana GargSanjana Garg
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5

Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5

06/29/2023

You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller
Improving Legal Judgement Prediction with Cleanlab Studio

Improving Legal Judgement Prediction with Cleanlab Studio

06/27/2023

A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.

  • Hui Wen GohHui Wen Goh
Improving any OpenAI Language Model by Systematically Improving its Data

Improving any OpenAI Language Model by Systematically Improving its Data

06/01/2023

Reduce LLM prediction error by 37% via data-centric AI.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller
Datalab: A Linter for ML Datasets

Datalab: A Linter for ML Datasets

05/16/2023

Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.

  • Elías SnorrasonElías Snorrason
  • Sanjana GargSanjana Garg
  • Hui Wen GohHui Wen Goh
  • Jesse CummingsJesse Cummings
  • Jonas MuellerJonas Mueller
CleanVision: Audit your Image Data for better Computer Vision

CleanVision: Audit your Image Data for better Computer Vision

03/22/2023

Introducing an open-source Python package to automatically identify common issues in image datasets.

  • Sanjana GargSanjana Garg
  • Ulyana TkachenkoUlyana Tkachenko
  • Yiming ChenYiming Chen
  • Elías SnorrasonElías Snorrason
  • Jonas MuellerJonas Mueller
ActiveLab: Active Learning with Data Re-Labeling

ActiveLab: Active Learning with Data Re-Labeling

03/02/2023

ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.

  • Hui Wen GohHui Wen Goh
  • Jonas MuellerJonas Mueller
Cleanlab: The History, Present, and Future

Cleanlab: The History, Present, and Future

04/01/2022

How an MIT grad student project became a company with tech used by Google, Amazon, Tesla, Uber, Facebook, and companies around the world.

  • Curtis NorthcuttCurtis Northcutt
Cleanlab Studio: Issues Found in Popular Datasets

Cleanlab Studio: Issues Found in Popular Datasets

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Explore all sorts of issues found in popular datasets by Cleanlab Studio!

    Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data

    Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data

    05/30/2023

    A simple method to determine if a dataset violates the IID assumption in common ways (e.g. temporal drift, or interaction between almost adjacent datapoints).

    • Jesse CummingsJesse Cummings
    • Elías SnorrasonElías Snorrason
    • Jonas MuellerJonas Mueller
    Effectively Annotate Text Data for Transformers via Active Learning + Re-labeling

    Effectively Annotate Text Data for Transformers via Active Learning + Re-labeling

    05/22/2023

    Use ActiveLab to efficiently choose which data to (re)label to train the best Transformer model.

    • Chris MauckChris Mauck
    Training Transformer Networks in Scikit-Learn?!

    Training Transformer Networks in Scikit-Learn?!

    03/08/2023

    Learn how to easily make any Tensorflow/Keras model compatible with scikit-learn.

    • Hui Wen GohHui Wen Goh
    cleanlab 2.3 adds support for Active Learning, Tensorflow/Keras models made sklearn-compatible, and highly scalable Label Error Detection
    Handling Mislabeled Tabular Data to Improve Your XGBoost Model

    Handling Mislabeled Tabular Data to Improve Your XGBoost Model

    02/06/2023

    Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.

    • Chris MauckChris Mauck
    Automatic Error Detection for Image/Text Tagging and Multi-label Datasets

    Automatic Error Detection for Image/Text Tagging and Multi-label Datasets

    11/29/2022

    Introducing new data quality algorithms for multi-label classification in cleanlab v2.2

    • Aditya ThyagarajanAditya Thyagarajan
    • Elías SnorrasonElías Snorrason
    • Curtis NorthcuttCurtis Northcutt
    • Jonas MuellerJonas Mueller
    Out-of-Distribution Detection via Embeddings or Predictions

    Out-of-Distribution Detection via Embeddings or Predictions

    10/19/2022

    Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.

    • Ulyana TkachenkoUlyana Tkachenko
    • Jonas MuellerJonas Mueller
    A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier

    A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier

    10/19/2022

    Exploring new ways to identify outliers based on probabilistic predictions from a trained classifier.

    • Ulyana TkachenkoUlyana Tkachenko
    • Jonas MuellerJonas Mueller
    • Curtis NorthcuttCurtis Northcutt
    Detecting Label Errors in Entity Recognition Data

    Detecting Label Errors in Entity Recognition Data

    10/12/2022

    Understanding cleanlab's new methods for text-based token classification tasks.

    • Wei-Chen (Eric) WangWei-Chen (Eric) Wang
    • Elías SnorrasonElías Snorrason
    • Jonas MuellerJonas Mueller
    CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators

    CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators

    10/05/2022

    Understanding cleanlab's new methods for multi-annotator data and what makes them effective.

    • Hui Wen GohHui Wen Goh
    • Ulyana TkachenkoUlyana Tkachenko
    • Jonas MuellerJonas Mueller
    cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI

    cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI

    09/21/2022

    Highlighting new features available in cleanlab 2.1

    • Curtis NorthcuttCurtis Northcutt
    • Jonas MuellerJonas Mueller
    How we built Cleanlab Vizzy

    How we built Cleanlab Vizzy

    08/17/2022

    How we built an in-browser visualization of Cleanlab's Confident Learning algorithm.

    • Caleb ChiamCaleb Chiam
    • Luke MainwaringLuke Mainwaring
    • Yiming ChenYiming Chen
    Handling Label Errors in Text Classification Datasets

    Handling Label Errors in Text Classification Datasets

    05/10/2022

    Learn how to find label issues in text datasets and improve NLP models.

    • Wei Jing LokWei Jing Lok
    • Jonas MuellerJonas Mueller
    • Hui Wen GohHui Wen Goh
    Finding Label Issues in Audio Classification Datasets

    Finding Label Issues in Audio Classification Datasets

    04/27/2022

    Learn how to find label issues in any audio classification dataset.

    • Johnson KuanJohnson Kuan
    • Jonas MuellerJonas Mueller
    • Anish AthalyeAnish Athalye
    Finding Label Issues in Image Classification Datasets

    Finding Label Issues in Image Classification Datasets

    04/21/2022

    Learn how to automatically find label issues in any image classification dataset.

    • Wei Jing LokWei Jing Lok
    • Jonas MuellerJonas Mueller
    cleanlab 2.0: Automatically Find Errors in ML Datasets

    cleanlab 2.0: Automatically Find Errors in ML Datasets

    04/21/2022

    Announcing cleanlab 2.0: an open-source framework for machine learning and analytics with messy, real-world data.

    • Curtis NorthcuttCurtis Northcutt
    • Jonas MuellerJonas Mueller
    • Anish AthalyeAnish Athalye