Quantization is a technique that can be used to speed up inference or training through storing and computing tensors at a lower bitlength than floating points. We at Labelf use Pytorch as our primary deep learning library, which supports int8 quantization. Quantizing according to this will lead to a 4x speed up as well as memory reduction compared to a regular fp32 model (NEAT).
So far so good, right? Quantizing your model does however come at a cost. You basically decrease the accuracy slightly due to approximations that occur when converting from fp32 to in8. So, whether you should quantize your model or not really depends on the scenario your facing. If you’re planing on running your model locally on a device with low memory resources, quantization might be a good trade off.
Pytorch supports a number of quantization options. An easy method is dynamic quantization, where the weights and activations are converted to int8. The activations are converted dynamically before all computations. If you want to test out dynamic quantization, its this easy. You simply convert your model with a one-liner:
If you want to compare the model size, just:
Finally, don’t forget to check the speed up and how the accuracy compares!
For further reading, check out Pytorchs awesome docs:
There's also a lot of interesting things happening in research:
I-BERT: Integer-only BERT Quantization
BinaryBERT: Pushing the Limit of BERT Quantization