Speed Things Up with Quantization

Quantization is a technique that can be used to speed up inference or training through storing and computing tensors at a lower bitlength than floating points. We at Labelf use Pytorch as our primary deep learning library, which supports int8 quantization. Quantizing according to this will lead to a 4x speed up as well as memory reduction compared to a regular fp32 model (NEAT).

So far so good, right? Quantizing your model does however come at a cost. You basically decrease the accuracy slightly due to approximations that occur when converting from fp32 to in8. So, whether you should quantize your model or not really depends on the scenario your facing. If you’re planing on running your model locally on a device with low memory resources, quantization might be a good trade off.

Pytorch supports a number of quantization options. An easy method is dynamic quantization, where the weights and activations are converted to int8. The activations are converted dynamically before all computations. If you want to test out dynamic quantization, its this easy. You simply convert your model with a one-liner:

‍


import torch.quantization

quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

If you want to compare the model size, just:


def get_name(variable, namespace):
   return [name for name in namespace if namespace[name] is variable]

def print_size(model):
   torch.save(model.state_dict(), "temp.p")
   size=os.path.getsize("temp.p")
   print(f"Model: {get_name(model, globals())[0]}, Size: {size/1e3} Kb”)
   os.remove('temp.p')

print_size(model)
print_size(quantized_model)

‍

Finally, don’t forget to check the speed up and how the accuracy compares!

For further reading, check out Pytorchs awesome docs:

https://pytorch.org/docs/stable/quantization.html

‍

There's also a lot of interesting things happening in research:

I-BERT: Integer-only BERT Quantization

https://arxiv.org/abs/2101.01321

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

‍

BinaryBERT: Pushing the Limit of BERT Quantization

https://arxiv.org/abs/2012.15701

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King

‍

Labelf Blog