Large-Scale Pre-Trained Language Models

DistilBERT

How can we compress BERT while keeping 97% of the performance?

Naoki
5 min readMar 6, 2022

--

In 2019, the team at Hugging Face released a model based on BERT that was 40% smaller and 60% faster while retaining 97% of the language understanding capability. They called it DistilBERT.

--

--