Labelf.ai Blog
Blog homeLabelf.ai

Speed Things Up with Quantization

January 11, 2021
Dev

Quantization is a technique that can be used to speed up inference or training through storing and computing tensors at a lower bitlength than floating points. We at Labelf use Pytorch as our primary deep learning library, which supports int8 quantization. Quantizing according to this will lead to a 4x speed up as well as memory reduction compared to a regular fp32 model (NEAT).

So far so good, right? Quantizing your model does however come at a cost. You basically decrease the accuracy slightly due to approximations that occur when converting from fp32 to in8. So, whether you should quantize your model or not really depends on the scenario your facing. If you’re planing on running your model locally on a device with low memory resources, quantization might be a good trade off.

Pytorch supports a number of quantization options. An easy method is dynamic quantization, where the weights and activations are converted to int8. The activations are converted dynamically before all computations. If you want to test out dynamic quantization, its this easy. You simply convert your model with a one-liner:


import torch.quantization

quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

If you want to compare the model size, just:


def get_name(variable, namespace):
   return [name for name in namespace if namespace[name] is variable]

def print_size(model):
   torch.save(model.state_dict(), "temp.p")
   size=os.path.getsize("temp.p")
   print(f"Model: {get_name(model, globals())[0]}, Size: {size/1e3} Kb”)
   os.remove('temp.p')

print_size(model)
print_size(quantized_model)

Finally, don’t forget to check the speed up and how the accuracy compares!

For further reading, check out Pytorchs awesome docs:

https://pytorch.org/docs/stable/quantization.html

There's also a lot of interesting things happening in research:

I-BERT: Integer-only BERT Quantization

https://arxiv.org/abs/2101.01321

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

BinaryBERT: Pushing the Limit of BERT Quantization

https://arxiv.org/abs/2012.15701

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King

More posts from this author

Dev

Explore more posts

Apply to our private beta

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.