Introduction ------------ **BiBench: Benchmarking and Analyzing Network Binarization** **Abstract.** Neural network binarization is one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a general technique, recent works reveal that applying binarization in various practical scenarios, including multiple tasks, architectures, and hardware, is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To comprehensively understand binarization methods, we present **BiBench**, a carefully engineered benchmark with in-depth analysis for network binarization. We first inspect the requirements of binarization in the actual production setting. Then for the sake of fairness and systematic, we define the evaluation tracks and metrics. We also perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show that binarization still faces severe accuracy challenges, and newer state-of-the-art binarization algorithms bring diminishing improvements, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, based on our benchmark results and analysis, we suggest establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way toward more extensive adoption of network binarization and serves as a fundamental work for future research. *Note: we are continuously integrating and polishing this repository and will publish a stable version upon acceptance.* Installation ------------ Environment Preparation ~~~~~~~~~~~~~~~~~~~~~~~ a. Create a conda virtual environment and activate it. .. code:: shell conda create -n bibench python=3.8 -y conda activate bibench b. Install PyTorch and torchvision following the `official instructions `__. .. code:: shell conda install pytorch={torch_version} torchvision cudatoolkit={cu_version} -c pytorch E.g., install PyTorch 1.8.0 & CUDA 10.2. .. code:: shell conda install pytorch=1.8.0 torchvision cudatoolkit=10.2 -c pytorch **Important:** Make sure that your compilation CUDA version and runtime CUDA version match. Besides, for RTX 30 series GPU, cudatoolkit>=11.0 is required. c. Install mmcv and other repositories for different tasks - mmcv-full We recommend you to install the pre-build package as below. For CPU: .. code:: shell pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cpu/{torch_version}/index.html Please replace ``{torch_version}`` in the url to your desired one. For GPU: .. code:: shell pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/{cu_version}/{torch_version}/index.html Please replace ``{cu_version}`` and ``{torch_version}`` in the url to your desired one. For example, to install mmcv-full with CUDA 10.2 and PyTorch 1.8.0, use the following command: .. code:: shell pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.8.0/index.html See `here `__ for different versions of MMCV compatible to different PyTorch and CUDA versions. For more version download link, refer to `openmmlab-download `__. Optionally you can choose to compile mmcv from source by the following command .. code:: shell git clone https://github.com/open-mmlab/mmcv.git -b v1.5.3 cd mmcv MMCV_WITH_OPS=1 pip install -e . # package mmcv-full, which contains cuda ops, will be installed after this step # OR pip install -e . # package mmcv, which contains no cuda ops, will be installed after this step cd .. Important: You need to run ``pip uninstall mmcv`` first if you have mmcv installed. If mmcv and mmcv-full are both installed, there will be ``ModuleNotFoundError``. - mmcls .. code:: shell pip install mmcls - bipc, bispeech, binlp These repositories are now included in the source codes. You can move to each directory and use ``pip install -v -e .`` to install them. - mmdet (Optional) .. code:: shell pip install mmdet Data Preparation ~~~~~~~~~~~~~~~~ **CIFAR-10 & ImageNet**. We follow the dataset usage of `MMClassificationsation `__ in this part. This implementation of CIFAR-10 is modified from this `link `__. Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. This class has improved the following points on the basis of the class ``ImageNet``, in order to save memory, we enable the ``serialize_data`` optional by default. **Pascal VOC & COCO**. We follow the dataset usage of `MMDetection `__ in this part. Public datasets like `Pascal VOC `__ or mirror and `COCO `__ are available from official websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an extension of Pascal VOC 2007 without overlap, and we usually use them together. **ModelNet40 and ShapeNet**. The alignment ModelNet and ShapeNet can be downloaded at `link1 `__ and `link2 `__, respectively, and then be saved in corresponding folders. **GLUE**. The original GLUE data can be accessed from this `link `__. Put the original data (``train.csv``, ``dev.csv``) and the augmented data (named as ``train_${TASK_NAME}_aug_with_logits.csv``) to ``${GLUE_DIR}/${TASK_NAME}``. **Speech Commands**. The Google Speech Commands V1 dataset can be downloaded in the reference document `link `__. The dataset directory should be like this. :: BiBench ├── data │ ├── datasets │ │ ├── cifar10 │ │ ├── imagenet │ │ ├── VOCdevkit │ │ ├── coco │ │ ├── ModelNet40 │ │ ├── ShapeNet │ │ ├── GLUE │ │ ├── SpeechCommands Training ~~~~~~~~ **Training with a single / multiple GPUs** :: python tools/train.py ${CONFIG_FILE} ${WORK_DIR} Example: using 1 GPU to train BiBench. :: python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --gpus 1 **Training with Slurm** If you can run BiBench on a cluster managed with `slurm `__, you can use the script ``slurm_train.sh``. :: ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} Common optional arguments include: - ``--resume-from ${CHECKPOINT_FILE}``: Resume from a previous checkpoint file. Example: using 8 GPUs to train BiBench on a slurm cluster. :: ./tools/slurm_train.sh my_partition my_job configs/acc_cifar10/resnet18_bnn_adam_1e-3_cosinelr.py work_dirs/acc_cifar10 8 You can check ``slurm_train.sh`` for full arguments and environment variables. Add Custom Binarization Algorithms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With **just 3 steps**, researchers can define and evaluate custom binarization algorithms easily in BiBench: *Step 1*. **Operator defination**: create a file for the custom binarization algorithm under ``bibench/models/layers``, and complete the definition of binarized ``Conv1d``, ``Conv2d``, and ``Linear`` operators in it. *Step 2*. **Operator registration**: register the binarized operators defined in *Step 1* to ``CONV_LAYERS`` in ``bibench/models/layers/builder.py``. *Step 3*. **Configuration definition**: define the configuration for the learning task, neural architecture, or any track you would like to evaluate (existing configurations can be referred to). Then you can get started with BiBench and evaluate your binarization algorithm! Binarization Algorithms ----------------------- **BNN**. During the training process, BNN uses the straight-through estimator (STE) to calculate gradient :math:`\boldsymbol{g_{x}}` which takes into account the saturation effect: .. math:: \mathtt{sign}(\boldsymbol{x})= \begin{cases} +1,& \mathrm{if} \ \boldsymbol x \ge 0\\ -1,& \mathrm{otherwise} \end{cases}\qquad \boldsymbol{g_{x}}= \begin{cases} \boldsymbol{g_b},& \mathrm{if} \ \boldsymbol x \in \left(-1, 1\right)\\ 0,& \mathrm{otherwise}. \end{cases} And during inference, the computation process is expressed as .. math:: \boldsymbol o = \operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w}), where :math:`\circledast` indicates a convolutional operation using XNOR and bitcount operations. The related code in our codebase refers to `BinaryNet `__ and the original paper. **XNOR-Net**. XNOR-Net obtains the channel-wise scaling factors :math:`\boldsymbol \alpha=\frac{\left\|\boldsymbol{w}\right\|}{\left|\boldsymbol{w}\right|}` for the weight and :math:`\boldsymbol{K}` contains scaling factors :math:`\beta` for all sub-tensors in activation :math:`\boldsymbol{a}`. We can approximate the convolution between activation :math:`\boldsymbol{a}` and weight :math:`\boldsymbol{w}` mainly using binary operations: .. math:: \boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol{K} \boldsymbol \alpha, where :math:`\boldsymbol{w} \in \mathbb{R}^{c \times w \times h}` and :math:`\boldsymbol{a} \in \mathbb{R}^{c \times w_{\text {in }} \times h_{\text {in }}}` denote the weight and input tensor, respectively. And the STE is also applied in the backward propagation of the training process. The related code in our codebase refers to `XNOR-Net (1) `__, `XNOR-Net (2) `__, and the original paper. **DoReFa-Net**. DoReFa-Net applies the following function for :math:`1`-bit weight and activation: .. math:: \boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha. And the STE is also applied in the backward propagation with the full-precision gradient. The related code in our codebase refers to `DoReFa-Net (1) `__, `DoReFa-Net (2) `__, and the original paper. **Bi-Real Net**. Bi-Real Net proposes a piece-wise polynomial function as the gradient approximation function: .. math:: \operatorname{bireal}\left(\boldsymbol{a}\right)=\left\{\begin{array}{lr} -1 & \text { if } \boldsymbol{a}<-1 \\ 2 \boldsymbol{a}+\boldsymbol{a}^2 & \text { if }-1 \leqslant \boldsymbol{a}<0 \\ 2 \boldsymbol{a}-\boldsymbol{a}^2 & \text { if } 0 \leqslant \boldsymbol{a}<1 \\ 1 & \text { otherwise } \end{array}, \quad \frac{\partial \operatorname{bireal}\left(\boldsymbol{a}\right)}{\partial \boldsymbol{a}}= \begin{cases}2+2 \boldsymbol{a} & \text { if }-1 \leqslant \boldsymbol{a}<0 \\ 2-2 \boldsymbol{a} & \text { if } 0 \leqslant \boldsymbol{a}<1 \\ 0 & \text { otherwise }\end{cases}\right. . And the forward propagation of Bi-Real Net is the same as DoReFa-Net. The related code in our codebase refers to `Bi-Real Net `__ and the original paper. **XNOR-Net++**. XNOR-Net++ proposes to re-formulate XNOR-Net as: .. math:: \boldsymbol{o} = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \Gamma, and we adopt the :math:`\boldsymbol \Gamma` as the following form in experiments (achieve the best performance in the original paper): .. math:: \boldsymbol \Gamma=\boldsymbol \alpha \otimes \boldsymbol \beta \otimes \boldsymbol \gamma, \quad \boldsymbol \alpha \in \mathbb{R}^{\boldsymbol{o}}, \boldsymbol \beta \in \mathbb{R}^{h_{\text {out }}}, \boldsymbol \gamma \in \mathbb{R}^{w_{\text {out }}}, where :math:`\boldsymbol \alpha`, :math:`\boldsymbol \beta`, and :math:`\boldsymbol \gamma` are learnable during training. The related code in our codebase refers to `XNOR-Net++ `__ and the original paper. **ReActNet**. ReActNet defines an RSign as a binarization function with channel-wise learnable thresholds: .. math:: \boldsymbol{x}=\operatorname{rsign}\left(\boldsymbol{x}\right)=\left\{\begin{array}{ll} +1, & \text { if } \boldsymbol{x}>\boldsymbol \alpha \\ -1, & \text { if } \boldsymbol{x} \leq \boldsymbol \alpha \end{array} .\right. where :math:`\boldsymbol \alpha` is a learnable coefficient controlling the threshold. And the forward propagation is .. math:: \boldsymbol o = (\operatorname{rsign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha. The related code in our codebase refers to `ReActNet `__ and the original paper. **ReCU**. As described in their paper, ReCU is formulated as .. math:: \operatorname{recu}(\boldsymbol{w})=\max \left(\min \left(\boldsymbol{w}, Q_{(\tau)}\right), Q_{(1-\tau)}\right), where :math:`Q_{(\tau)}` and :math:`Q_{(1-\tau)}` denote the :math:`\tau` quantile and :math:`1-\tau` quantile of :math:`\boldsymbol{w}`, respectively. And other implementations also strictly follow the original paper and official code. The related code in our codebase refers to `ReCU `__ and the original paper. **FDA**. FDA computes the gradients of :math:`\boldsymbol{o}` in the backward propagation as: .. math:: \frac{\partial \ell}{\partial \mathbf{t}}=\frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right) \boldsymbol{w}_1^{\top} +\frac{\partial \ell}{\partial \boldsymbol{o}} \eta^{\prime}(\mathbf{t}) +\frac{\partial \ell}{\partial \boldsymbol{o}} \odot \frac{4 \omega}{\pi} \sum_{i=0}^n \cos ((2 i+1) \omega \mathbf{t}), where :math:`\frac{\partial \ell}{\partial \boldsymbol{o}}` is the gradient from the upper layers, :math:`\odot` represents element-wise multiplication, and :math:`\frac{\partial \ell}{\partial \mathbf{t}}` is the partial gradient on :math:`\mathbf{t}` that backward propagates to the former layer. And :math:`\boldsymbol{w}_1` and :math:`\boldsymbol{w}_2` are weights in the original models and the noise adaptation modules respectively. FDA updates them as .. math:: \frac{\partial \ell}{\partial \boldsymbol{w}_1}=\mathbf{t}^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right),\qquad \frac{\partial \ell}{\partial \boldsymbol{w}_2}=\sigma\left(\mathbf{t} \boldsymbol{w}_1\right)^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}}. The related code in our codebase refers to `FDA `__ and the original paper. Learning Tasks -------------- 2D Visual Tasks ~~~~~~~~~~~~~~~ The **classification tasks’ implementations** of our codebase borrows from related tasks in `MMClassification `__, including CIFAR-10 and ImageNet classification tasks and models. **CIFAR-10**. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images commonly used to train machine learning and computer vision algorithms. This dataset is widely used for image classification tasks. There are 60,000 color images, each of which measures 32x32 pixels. All images are categorized into 10 different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class has 6000 images, where 5000 are for training and 1000 are for testing. **ImageNet**. ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images are collected from the web and labeled by human labelers using a crowd-sourced image labeling service called Amazon Mechanical Turk. As part of the Pascal Visual Object Challenge, ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) was established in 2010. There are approximately 1.2 million training images, 50,000 validation images, and 150,000 testing images in total in ILSVRC. ILSVRC uses a subset of ImageNet, with about 1000 images in each of the 1000 categories. ImageNet also uses accuracy to evaluate the predicted results, which is defined above. The **object detection tasks’ implementations** of our codebase borrows from related tasks in `MMDetection `__, including Pascal VPC07 and COCO17 detection tasks and models. **Pascal VOC07**. The PASCAL Visual Object Classes 2007 (Pascal VOC07) dataset contains 20 object categories including vehicles, households, animals, and other: airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. As a benchmark for object detection, semantic segmentation, and object classification, this dataset contains pixel-level segmentation annotations, bounding box annotations, and object class annotations. **COCO17**. The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. According to community feedback, in the 2017 release, the training/validation split was changed from 83K/41K to 118K/5K. And the images and annotations are the same. The 2017 test set is a subset of 41K images from the 2015 test set. Additionally, 123K images are included in the unannotated dataset. .. _d-visual-tasks-1: 3D Visual Tasks ~~~~~~~~~~~~~~~ The **3D point cloud tasks’ implementations** of our codebase borrows from related tasks in `PointNet `__ and `BiPointNet `__, including ModelNet40 classification and ShapeNet segmentation tasks and models. **ModelNet40**. The ModelNet40 dataset contains point clouds of synthetic objects. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular due to the diversity of categories, clean shapes, and well-constructed dataset. In the original ModelNet40, 12,311 CAD-generated meshes are divided into 40 categories, where 9,843 are for training, and 2,468 are for testing. The point cloud data points are sampled by a uniform sampling method from mesh surfaces and then scaled into a unit sphere by moving to the origin. **ShapeNet**. ShapeNet is a large-scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University, and the Toyota Technological Institute in Chicago, USA. Using WordNet hypernym-hyponym relationships, the repository contains over 300M models, with 220,000 classified into 3,135 classes. There are 31,693 meshes in the ShapeNet Parts subset, divided into 16 categories of objects (*i.e.*, tables, chairs, planes, *etc*.). Each shape contains 2-5 parts (with 50 part classes in total). Natural Language Understanding Tasks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **natural language understanding tasks’ implementations** of our codebase borrows from related tasks in `DynaBERT `__ and `BiBERT `__, including the GLUE benchmark tasks and models. **GLUE**. General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE, and WNLI. Speech Tasks ~~~~~~~~~~~~ The **speech tasks’ implementations** of our codebase borrows from related tasks in `FSMN `__ and `BiFSMN `__, including the Google Speech Commands classification tasks and models. **Google Speech Commands**. As part of its training and evaluation process, Google Speech Commands Classification (SpeechCom) provides a collection of audio recordings containing spoken words. Its primary goal is to provide a way to build and test small models that detect a single word that belongs to a set of ten target words. Models should detect as few false positives as possible from background noise or unrelated speech while providing as few false positives as possible. Neural Architectures -------------------- CNNs ~~~~ The **CNNs’ implementations** of our codebase borrows from `MMClassification `__ and `MMDetection `__. **ResNet**. Residual Networks, or ResNets, learn residual functions concerning the layer inputs instead of learning unreferenced functions. Instead of making stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. There is empirical evidence that these networks are easier to optimize and can achieve higher accuracy with considerably increased depth. **VGG**. VGG is a classical convolutional neural network architecture. It is proposed by an analysis of how to increase the depth of such networks. It is characterized by its simplicity: the network utilizes small 3$:raw-latex:`\times`$3 filters, and the only other components are pooling layers and a fully connected layer. **MobileNetV2**. MobileNetV2 is a convolutional neural network architecture that performs well on mobile devices. This model has an inverted residual structure with residual connections between the bottleneck layers. The intermediate expansion layer employs lightweight depthwise convolutions to filter features as a source of nonlinearity. In MobileNetV2, the architecture begins with an initial layer of 32 convolution filters, followed by 19 residual bottleneck layers. **Faster-RCNN**. Faster R-CNN is an object detection model that improves Fast R-CNN by utilizing a region proposal network (RPN) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. A fully convolutional network is used to predict the bounds and objectness scores of objects at each position simultaneously. RPNs use end-to-end training to produce region proposals of high quality and instruct the unified network where to search. Sharing their convolutional features allows RPN and Fast R-CNN to be combined into a single network. Faster R-CNN consists of two modules. The first module is a deep, fully convolutional network that proposes regions, and the second is the detector that uses the proposals for giving the final prediction boxes. **SSD**. SSD is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. During prediction, each default box is adjusted to match better the shape of the object based on its scores for each object category. In addition, the network automatically handles objects of different sizes by combining predictions from multiple feature maps with different resolutions. Transformers ~~~~~~~~~~~~ The **transformers’ implementations** of our codebase borrows from `DynaBERT `__ and `BiBERT `__. **BERT**. BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint using a masked language model (MLM) pre-training objective. By masking some tokens from the input, the masked language model attempts to estimate the original vocabulary id of the masked word based solely on its context. An MLM objective differs from a left-to-right language model in that it enables the representation to integrate the left and right contexts, which facilitates pre-training a deep bidirectional Transformer. Additionally, BERT uses a next-sentence prediction task that pre-trains text-pair representations along with the masked language model. Note that we replace the direct binarized attention with a bi-attention mechanism to prevent the model from completely crashing. MLPs ~~~~ The **MLPs’ implementations** of our codebase borrows from `PointNet `__, `BiPointNet `__, `FSMN `__ and `BiFSMN `__. **PointNet**. PointNet is a unified architecture for applications ranging from object classification and part segmentation to scene semantic parsing. The architecture directly receives point clouds as input and outputs either class labels for the entire input or point segment/part labels. PointNet-Vanilla is a variant of PointNet, which drops off the T-Net module. And for all PointNet models, we apply the EMA-Max as the aggregator, because directly following the max pooling aggregator will cause the binarized PointNets to fail to converge. **FSMN**. Feedforward sequential memory networks or FSMN is a novel neural network structure to model long-term dependency in time series without using recurrent feedback. It is a standard fully connected feedforward neural network containing some learnable memory blocks. As a short-term memory mechanism, the memory blocks encode long context information using a tapped-delay line structure. **Deep-FSMN**. The Deep-FSMN architecture is an improved feedforward sequential memory network (FSMN) with skip connections between memory blocks in adjacent layers. By utilizing skip connections, information can be transferred across layers, and thus the gradient vanishing problem can be avoided when building very deep structures.