Large-Scale Pre-Trained Language Models

DistilBERT

How can we compress BERT while keeping 97% of the performance?

5 min readMar 6, 2022

In 2019, the team at Hugging Face released a model based on BERT that was 40% smaller and 60% faster while retaining 97% of the language understanding capability. They called it DistilBERT.

Large-Scale Pre-Trained Language Models

DistilBERT

How can we compress BERT while keeping 97% of the performance?

Written by Naoki