1. Introduction

BiBench: Benchmarking and Analyzing Network Binarization

Abstract. Neural network binarization is one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a general technique, recent works reveal that applying binarization in various practical scenarios, including multiple tasks, architectures, and hardware, is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To comprehensively understand binarization methods, we present BiBench, a carefully engineered benchmark with in-depth analysis for network binarization. We first inspect the requirements of binarization in the actual production setting. Then for the sake of fairness and systematic, we define the evaluation tracks and metrics. We also perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show that binarization still faces severe accuracy challenges, and newer state-of-the-art binarization algorithms bring diminishing improvements, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, based on our benchmark results and analysis, we suggest establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way toward more extensive adoption of network binarization and serves as a fundamental work for future research.

Note: we are continuously integrating and polishing this repository and will publish a stable version upon acceptance.

2. Installation

2.1. Environment Preparation

  1. Create a conda virtual environment and activate it.

conda create -n bibench python=3.8 -y
conda activate bibench
  1. Install PyTorch and torchvision following the official instructions.

conda install pytorch={torch_version} torchvision cudatoolkit={cu_version} -c pytorch

E.g., install PyTorch 1.8.0 & CUDA 10.2.

conda install pytorch=1.8.0 torchvision cudatoolkit=10.2 -c pytorch

Important: Make sure that your compilation CUDA version and runtime CUDA version match. Besides, for RTX 30 series GPU, cudatoolkit>=11.0 is required.

  1. Install mmcv and other repositories for different tasks

  • mmcv-full

We recommend you to install the pre-build package as below.

For CPU:

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cpu/{torch_version}/index.html

Please replace {torch_version} in the url to your desired one.

For GPU:

pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/{cu_version}/{torch_version}/index.html

Please replace {cu_version} and {torch_version} in the url to your desired one.

For example, to install mmcv-full with CUDA 10.2 and PyTorch 1.8.0, use the following command:

pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.8.0/index.html

See here for different versions of MMCV compatible to different PyTorch and CUDA versions. For more version download link, refer to openmmlab-download.

Optionally you can choose to compile mmcv from source by the following command

git clone https://github.com/open-mmlab/mmcv.git -b v1.5.3
cd mmcv
MMCV_WITH_OPS=1 pip install -e .  # package mmcv-full, which contains cuda ops, will be installed after this step
# OR pip install -e .  # package mmcv, which contains no cuda ops, will be installed after this step
cd ..

Important: You need to run pip uninstall mmcv first if you have mmcv installed. If mmcv and mmcv-full are both installed, there will be ModuleNotFoundError.

  • mmcls

pip install mmcls
  • bipc, bispeech, binlp

These repositories are now included in the source codes. You can move to each directory and use pip install -v -e . to install them.

  • mmdet (Optional)

pip install mmdet

2.2. Data Preparation

CIFAR-10 & ImageNet. We follow the dataset usage of MMClassificationsation in this part. This implementation of CIFAR-10 is modified from this link. Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. This class has improved the following points on the basis of the class ImageNet, in order to save memory, we enable the serialize_data optional by default.

Pascal VOC & COCO. We follow the dataset usage of MMDetection in this part. Public datasets like Pascal VOC or mirror and COCO are available from official websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an extension of Pascal VOC 2007 without overlap, and we usually use them together.

ModelNet40 and ShapeNet. The alignment ModelNet and ShapeNet can be downloaded at link1 and link2, respectively, and then be saved in corresponding folders.

GLUE. The original GLUE data can be accessed from this link. Put the original data (train.csv, dev.csv) and the augmented data (named as train_${TASK_NAME}_aug_with_logits.csv) to ${GLUE_DIR}/${TASK_NAME}.

Speech Commands. The Google Speech Commands V1 dataset can be downloaded in the reference document link.

The dataset directory should be like this.

BiBench
├── data
│   ├── datasets
│   │   ├── cifar10
│   │   ├── imagenet
│   │   ├── VOCdevkit
│   │   ├── coco
│   │   ├── ModelNet40
│   │   ├── ShapeNet
│   │   ├── GLUE
│   │   ├── SpeechCommands

2.3. Training

Training with a single / multiple GPUs

python tools/train.py ${CONFIG_FILE} ${WORK_DIR}

Example: using 1 GPU to train BiBench.

python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --gpus 1

Training with Slurm

If you can run BiBench on a cluster managed with slurm, you can use the script slurm_train.sh.

./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM}

Common optional arguments include:

  • --resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file.

Example: using 8 GPUs to train BiBench on a slurm cluster.

./tools/slurm_train.sh my_partition my_job configs/acc_cifar10/resnet18_bnn_adam_1e-3_cosinelr.py work_dirs/acc_cifar10 8

You can check slurm_train.sh for full arguments and environment variables.

2.4. Add Custom Binarization Algorithms

With just 3 steps, researchers can define and evaluate custom binarization algorithms easily in BiBench:

Step 1. Operator defination: create a file for the custom binarization algorithm under bibench/models/layers, and complete the definition of binarized Conv1d, Conv2d, and Linear operators in it.

Step 2. Operator registration: register the binarized operators defined in Step 1 to CONV_LAYERS in bibench/models/layers/builder.py.

Step 3. Configuration definition: define the configuration for the learning task, neural architecture, or any track you would like to evaluate (existing configurations can be referred to).

Then you can get started with BiBench and evaluate your binarization algorithm!

3. Binarization Algorithms

BNN. During the training process, BNN uses the straight-through estimator (STE) to calculate gradient \(\boldsymbol{g_{x}}\) which takes into account the saturation effect:

\[\begin{split}\mathtt{sign}(\boldsymbol{x})= \begin{cases} +1,& \mathrm{if} \ \boldsymbol x \ge 0\\ -1,& \mathrm{otherwise} \end{cases}\qquad \boldsymbol{g_{x}}= \begin{cases} \boldsymbol{g_b},& \mathrm{if} \ \boldsymbol x \in \left(-1, 1\right)\\ 0,& \mathrm{otherwise}. \end{cases}\end{split}\]

And during inference, the computation process is expressed as

\[\boldsymbol o = \operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w}),\]

where \(\circledast\) indicates a convolutional operation using XNOR and bitcount operations.

The related code in our codebase refers to BinaryNet and the original paper.

XNOR-Net. XNOR-Net obtains the channel-wise scaling factors \(\boldsymbol \alpha=\frac{\left\|\boldsymbol{w}\right\|}{\left|\boldsymbol{w}\right|}\) for the weight and \(\boldsymbol{K}\) contains scaling factors \(\beta\) for all sub-tensors in activation \(\boldsymbol{a}\). We can approximate the convolution between activation \(\boldsymbol{a}\) and weight \(\boldsymbol{w}\) mainly using binary operations:

\[\boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol{K} \boldsymbol \alpha,\]

where \(\boldsymbol{w} \in \mathbb{R}^{c \times w \times h}\) and \(\boldsymbol{a} \in \mathbb{R}^{c \times w_{\text {in }} \times h_{\text {in }}}\) denote the weight and input tensor, respectively. And the STE is also applied in the backward propagation of the training process.

The related code in our codebase refers to XNOR-Net (1), XNOR-Net (2), and the original paper.

DoReFa-Net. DoReFa-Net applies the following function for \(1\)-bit weight and activation:

\[\boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha.\]

And the STE is also applied in the backward propagation with the full-precision gradient.

The related code in our codebase refers to DoReFa-Net (1), DoReFa-Net (2), and the original paper.

Bi-Real Net. Bi-Real Net proposes a piece-wise polynomial function as the gradient approximation function:

\[\begin{split} \operatorname{bireal}\left(\boldsymbol{a}\right)=\left\{\begin{array}{lr} -1 & \text { if } \boldsymbol{a}<-1 \\ 2 \boldsymbol{a}+\boldsymbol{a}^2 & \text { if }-1 \leqslant \boldsymbol{a}<0 \\ 2 \boldsymbol{a}-\boldsymbol{a}^2 & \text { if } 0 \leqslant \boldsymbol{a}<1 \\ 1 & \text { otherwise } \end{array}, \quad \frac{\partial \operatorname{bireal}\left(\boldsymbol{a}\right)}{\partial \boldsymbol{a}}= \begin{cases}2+2 \boldsymbol{a} & \text { if }-1 \leqslant \boldsymbol{a}<0 \\ 2-2 \boldsymbol{a} & \text { if } 0 \leqslant \boldsymbol{a}<1 \\ 0 & \text { otherwise }\end{cases}\right. .\end{split}\]

And the forward propagation of Bi-Real Net is the same as DoReFa-Net.

The related code in our codebase refers to Bi-Real Net and the original paper.

XNOR-Net++. XNOR-Net++ proposes to re-formulate XNOR-Net as:

\[\boldsymbol{o} = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \Gamma,\]

and we adopt the \(\boldsymbol \Gamma\) as the following form in experiments (achieve the best performance in the original paper):

\[\boldsymbol \Gamma=\boldsymbol \alpha \otimes \boldsymbol \beta \otimes \boldsymbol \gamma, \quad \boldsymbol \alpha \in \mathbb{R}^{\boldsymbol{o}}, \boldsymbol \beta \in \mathbb{R}^{h_{\text {out }}}, \boldsymbol \gamma \in \mathbb{R}^{w_{\text {out }}},\]

where \(\boldsymbol \alpha\), \(\boldsymbol \beta\), and \(\boldsymbol \gamma\) are learnable during training.

The related code in our codebase refers to XNOR-Net++ and the original paper.

ReActNet. ReActNet defines an RSign as a binarization function with channel-wise learnable thresholds:

\[\begin{split}\boldsymbol{x}=\operatorname{rsign}\left(\boldsymbol{x}\right)=\left\{\begin{array}{ll} +1, & \text { if } \boldsymbol{x}>\boldsymbol \alpha \\ -1, & \text { if } \boldsymbol{x} \leq \boldsymbol \alpha \end{array} .\right.\end{split}\]

where \(\boldsymbol \alpha\) is a learnable coefficient controlling the threshold. And the forward propagation is

\[\boldsymbol o = (\operatorname{rsign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha.\]

The related code in our codebase refers to ReActNet and the original paper.

ReCU. As described in their paper, ReCU is formulated as

\[\operatorname{recu}(\boldsymbol{w})=\max \left(\min \left(\boldsymbol{w}, Q_{(\tau)}\right), Q_{(1-\tau)}\right),\]

where \(Q_{(\tau)}\) and \(Q_{(1-\tau)}\) denote the \(\tau\) quantile and \(1-\tau\) quantile of \(\boldsymbol{w}\), respectively. And other implementations also strictly follow the original paper and official code.

The related code in our codebase refers to ReCU and the original paper.

FDA. FDA computes the gradients of \(\boldsymbol{o}\) in the backward propagation as:

\[\frac{\partial \ell}{\partial \mathbf{t}}=\frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right) \boldsymbol{w}_1^{\top} +\frac{\partial \ell}{\partial \boldsymbol{o}} \eta^{\prime}(\mathbf{t}) +\frac{\partial \ell}{\partial \boldsymbol{o}} \odot \frac{4 \omega}{\pi} \sum_{i=0}^n \cos ((2 i+1) \omega \mathbf{t}),\]

where \(\frac{\partial \ell}{\partial \boldsymbol{o}}\) is the gradient from the upper layers, \(\odot\) represents element-wise multiplication, and \(\frac{\partial \ell}{\partial \mathbf{t}}\) is the partial gradient on \(\mathbf{t}\) that backward propagates to the former layer. And \(\boldsymbol{w}_1\) and \(\boldsymbol{w}_2\) are weights in the original models and the noise adaptation modules respectively. FDA updates them as

\[\frac{\partial \ell}{\partial \boldsymbol{w}_1}=\mathbf{t}^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right),\qquad \frac{\partial \ell}{\partial \boldsymbol{w}_2}=\sigma\left(\mathbf{t} \boldsymbol{w}_1\right)^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}}.\]

The related code in our codebase refers to FDA and the original paper.

4. Learning Tasks

4.1. 2D Visual Tasks

The classification tasks’ implementations of our codebase borrows from related tasks in MMClassification, including CIFAR-10 and ImageNet classification tasks and models.

CIFAR-10. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images commonly used to train machine learning and computer vision algorithms. This dataset is widely used for image classification tasks. There are 60,000 color images, each of which measures 32x32 pixels. All images are categorized into 10 different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class has 6000 images, where 5000 are for training and 1000 are for testing.

ImageNet. ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories.

The images are collected from the web and labeled by human labelers using a crowd-sourced image labeling service called Amazon Mechanical Turk. As part of the Pascal Visual Object Challenge, ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) was established in 2010. There are approximately 1.2 million training images, 50,000 validation images, and 150,000 testing images in total in ILSVRC. ILSVRC uses a subset of ImageNet, with about 1000 images in each of the 1000 categories. ImageNet also uses accuracy to evaluate the predicted results, which is defined above.

The object detection tasks’ implementations of our codebase borrows from related tasks in MMDetection, including Pascal VPC07 and COCO17 detection tasks and models.

Pascal VOC07. The PASCAL Visual Object Classes 2007 (Pascal VOC07) dataset contains 20 object categories including vehicles, households, animals, and other: airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. As a benchmark for object detection, semantic segmentation, and object classification, this dataset contains pixel-level segmentation annotations, bounding box annotations, and object class annotations.

COCO17. The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. According to community feedback, in the 2017 release, the training/validation split was changed from 83K/41K to 118K/5K. And the images and annotations are the same. The 2017 test set is a subset of 41K images from the 2015 test set. Additionally, 123K images are included in the unannotated dataset.

4.2. 3D Visual Tasks

The 3D point cloud tasks’ implementations of our codebase borrows from related tasks in PointNet and BiPointNet, including ModelNet40 classification and ShapeNet segmentation tasks and models.

ModelNet40. The ModelNet40 dataset contains point clouds of synthetic objects. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular due to the diversity of categories, clean shapes, and well-constructed dataset. In the original ModelNet40, 12,311 CAD-generated meshes are divided into 40 categories, where 9,843 are for training, and 2,468 are for testing. The point cloud data points are sampled by a uniform sampling method from mesh surfaces and then scaled into a unit sphere by moving to the origin.

ShapeNet. ShapeNet is a large-scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University, and the Toyota Technological Institute in Chicago, USA. Using WordNet hypernym-hyponym relationships, the repository contains over 300M models, with 220,000 classified into 3,135 classes. There are 31,693 meshes in the ShapeNet Parts subset, divided into 16 categories of objects (i.e., tables, chairs, planes, etc.). Each shape contains 2-5 parts (with 50 part classes in total).

4.3. Natural Language Understanding Tasks

The natural language understanding tasks’ implementations of our codebase borrows from related tasks in DynaBERT and BiBERT, including the GLUE benchmark tasks and models.

GLUE. General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE, and WNLI.

4.4. Speech Tasks

The speech tasks’ implementations of our codebase borrows from related tasks in FSMN and BiFSMN, including the Google Speech Commands classification tasks and models.

Google Speech Commands. As part of its training and evaluation process, Google Speech Commands Classification (SpeechCom) provides a collection of audio recordings containing spoken words. Its primary goal is to provide a way to build and test small models that detect a single word that belongs to a set of ten target words. Models should detect as few false positives as possible from background noise or unrelated speech while providing as few false positives as possible.

5. Neural Architectures

5.1. CNNs

The CNNs’ implementations of our codebase borrows from MMClassification and MMDetection.

ResNet. Residual Networks, or ResNets, learn residual functions concerning the layer inputs instead of learning unreferenced functions. Instead of making stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. There is empirical evidence that these networks are easier to optimize and can achieve higher accuracy with considerably increased depth.

VGG. VGG is a classical convolutional neural network architecture. It is proposed by an analysis of how to increase the depth of such networks. It is characterized by its simplicity: the network utilizes small 3$:raw-latex:`times`$3 filters, and the only other components are pooling layers and a fully connected layer.

MobileNetV2. MobileNetV2 is a convolutional neural network architecture that performs well on mobile devices. This model has an inverted residual structure with residual connections between the bottleneck layers. The intermediate expansion layer employs lightweight depthwise convolutions to filter features as a source of nonlinearity. In MobileNetV2, the architecture begins with an initial layer of 32 convolution filters, followed by 19 residual bottleneck layers.

Faster-RCNN. Faster R-CNN is an object detection model that improves Fast R-CNN by utilizing a region proposal network (RPN) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. A fully convolutional network is used to predict the bounds and objectness scores of objects at each position simultaneously. RPNs use end-to-end training to produce region proposals of high quality and instruct the unified network where to search. Sharing their convolutional features allows RPN and Fast R-CNN to be combined into a single network. Faster R-CNN consists of two modules. The first module is a deep, fully convolutional network that proposes regions, and the second is the detector that uses the proposals for giving the final prediction boxes.

SSD. SSD is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. During prediction, each default box is adjusted to match better the shape of the object based on its scores for each object category. In addition, the network automatically handles objects of different sizes by combining predictions from multiple feature maps with different resolutions.

5.2. Transformers

The transformers’ implementations of our codebase borrows from DynaBERT and BiBERT.

BERT. BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint using a masked language model (MLM) pre-training objective. By masking some tokens from the input, the masked language model attempts to estimate the original vocabulary id of the masked word based solely on its context. An MLM objective differs from a left-to-right language model in that it enables the representation to integrate the left and right contexts, which facilitates pre-training a deep bidirectional Transformer. Additionally, BERT uses a next-sentence prediction task that pre-trains text-pair representations along with the masked language model. Note that we replace the direct binarized attention with a bi-attention mechanism to prevent the model from completely crashing.

5.3. MLPs

The MLPs’ implementations of our codebase borrows from PointNet, BiPointNet, FSMN and BiFSMN.

PointNet. PointNet is a unified architecture for applications ranging from object classification and part segmentation to scene semantic parsing. The architecture directly receives point clouds as input and outputs either class labels for the entire input or point segment/part labels. PointNet-Vanilla is a variant of PointNet, which drops off the T-Net module. And for all PointNet models, we apply the EMA-Max as the aggregator, because directly following the max pooling aggregator will cause the binarized PointNets to fail to converge.

FSMN. Feedforward sequential memory networks or FSMN is a novel neural network structure to model long-term dependency in time series without using recurrent feedback. It is a standard fully connected feedforward neural network containing some learnable memory blocks. As a short-term memory mechanism, the memory blocks encode long context information using a tapped-delay line structure.

Deep-FSMN. The Deep-FSMN architecture is an improved feedforward sequential memory network (FSMN) with skip connections between memory blocks in adjacent layers. By utilizing skip connections, information can be transferred across layers, and thus the gradient vanishing problem can be avoided when building very deep structures.