Large-Scale Pre-Trained Language Models


How can we compress BERT while keeping 97% of the performance?

5 min readMar 6, 2022


In 2019, the team at Hugging Face released a model based on BERT that was 40% smaller and 60% faster while retaining 97% of the language understanding capability. They called it DistilBERT.

