Introduction
------------
**BiBench: Benchmarking and Analyzing Network Binarization**
**Abstract.** Neural network binarization is one of the most promising
compression approaches with extraordinary computation and memory savings
by minimizing the bit-width of weight and activation. However, despite
being a general technique, recent works reveal that applying
binarization in various practical scenarios, including multiple tasks,
architectures, and hardware, is not trivial. Moreover, common
challenges, such as severe degradation in accuracy and limited
efficiency gains, suggest that specific attributes of binarization are
not thoroughly studied and adequately understood. To comprehensively
understand binarization methods, we present **BiBench**, a carefully
engineered benchmark with in-depth analysis for network binarization. We
first inspect the requirements of binarization in the actual production
setting. Then for the sake of fairness and systematic, we define the
evaluation tracks and metrics. We also perform a comprehensive
evaluation with a rich collection of milestone binarization algorithms.
Our benchmark results show that binarization still faces severe accuracy
challenges, and newer state-of-the-art binarization algorithms bring
diminishing improvements, even at the expense of efficiency. Moreover,
the actual deployment of certain binarization operations reveals a
surprisingly large deviation from their theoretical consumption.
Finally, based on our benchmark results and analysis, we suggest
establishing a paradigm for accurate and efficient binarization among
existing techniques. We hope BiBench paves the way toward more extensive
adoption of network binarization and serves as a fundamental work for
future research.
*Note: we are continuously integrating and polishing this repository and
will publish a stable version upon acceptance.*
Installation
------------
Environment Preparation
~~~~~~~~~~~~~~~~~~~~~~~
a. Create a conda virtual environment and activate it.
.. code:: shell
conda create -n bibench python=3.8 -y
conda activate bibench
b. Install PyTorch and torchvision following the `official
instructions `__.
.. code:: shell
conda install pytorch={torch_version} torchvision cudatoolkit={cu_version} -c pytorch
E.g., install PyTorch 1.8.0 & CUDA 10.2.
.. code:: shell
conda install pytorch=1.8.0 torchvision cudatoolkit=10.2 -c pytorch
**Important:** Make sure that your compilation CUDA version and runtime
CUDA version match. Besides, for RTX 30 series GPU, cudatoolkit>=11.0 is
required.
c. Install mmcv and other repositories for different tasks
- mmcv-full
We recommend you to install the pre-build package as below.
For CPU:
.. code:: shell
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cpu/{torch_version}/index.html
Please replace ``{torch_version}`` in the url to your desired one.
For GPU:
.. code:: shell
pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/{cu_version}/{torch_version}/index.html
Please replace ``{cu_version}`` and ``{torch_version}`` in the url to
your desired one.
For example, to install mmcv-full with CUDA 10.2 and PyTorch 1.8.0, use
the following command:
.. code:: shell
pip install "mmcv-full>=1.3.17,<=1.5.3" -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.8.0/index.html
See
`here `__
for different versions of MMCV compatible to different PyTorch and CUDA
versions. For more version download link, refer to
`openmmlab-download `__.
Optionally you can choose to compile mmcv from source by the following
command
.. code:: shell
git clone https://github.com/open-mmlab/mmcv.git -b v1.5.3
cd mmcv
MMCV_WITH_OPS=1 pip install -e . # package mmcv-full, which contains cuda ops, will be installed after this step
# OR pip install -e . # package mmcv, which contains no cuda ops, will be installed after this step
cd ..
Important: You need to run ``pip uninstall mmcv`` first if you have mmcv
installed. If mmcv and mmcv-full are both installed, there will be
``ModuleNotFoundError``.
- mmcls
.. code:: shell
pip install mmcls
- bipc, bispeech, binlp
These repositories are now included in the source codes. You can move to
each directory and use ``pip install -v -e .`` to install them.
- mmdet (Optional)
.. code:: shell
pip install mmdet
Data Preparation
~~~~~~~~~~~~~~~~
**CIFAR-10 & ImageNet**. We follow the dataset usage of
`MMClassificationsation `__
in this part. This implementation of CIFAR-10 is modified from this
`link `__.
Since the dataset ImageNet21k is extremely big, cantains 21k+ classes
and 1.4B files. This class has improved the following points on the
basis of the class ``ImageNet``, in order to save memory, we enable the
``serialize_data`` optional by default.
**Pascal VOC & COCO**. We follow the dataset usage of
`MMDetection `__
in this part. Public datasets like `Pascal
VOC `__ or mirror and
`COCO `__ are available from official
websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an
extension of Pascal VOC 2007 without overlap, and we usually use them
together.
**ModelNet40 and ShapeNet**. The alignment ModelNet and ShapeNet can be
downloaded at
`link1 `__
and
`link2 `__,
respectively, and then be saved in corresponding folders.
**GLUE**. The original GLUE data can be accessed from this
`link `__. Put the original data
(``train.csv``, ``dev.csv``) and the augmented data (named as
``train_${TASK_NAME}_aug_with_logits.csv``) to
``${GLUE_DIR}/${TASK_NAME}``.
**Speech Commands**. The Google Speech Commands V1 dataset can be
downloaded in the reference document
`link `__.
The dataset directory should be like this.
::
BiBench
├── data
│ ├── datasets
│ │ ├── cifar10
│ │ ├── imagenet
│ │ ├── VOCdevkit
│ │ ├── coco
│ │ ├── ModelNet40
│ │ ├── ShapeNet
│ │ ├── GLUE
│ │ ├── SpeechCommands
Training
~~~~~~~~
**Training with a single / multiple GPUs**
::
python tools/train.py ${CONFIG_FILE} ${WORK_DIR}
Example: using 1 GPU to train BiBench.
::
python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --gpus 1
**Training with Slurm**
If you can run BiBench on a cluster managed with
`slurm `__, you can use the script
``slurm_train.sh``.
::
./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM}
Common optional arguments include:
- ``--resume-from ${CHECKPOINT_FILE}``: Resume from a previous
checkpoint file.
Example: using 8 GPUs to train BiBench on a slurm cluster.
::
./tools/slurm_train.sh my_partition my_job configs/acc_cifar10/resnet18_bnn_adam_1e-3_cosinelr.py work_dirs/acc_cifar10 8
You can check ``slurm_train.sh`` for full arguments and environment
variables.
Add Custom Binarization Algorithms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With **just 3 steps**, researchers can define and evaluate custom
binarization algorithms easily in BiBench:
*Step 1*. **Operator defination**: create a file for the custom
binarization algorithm under ``bibench/models/layers``, and complete the
definition of binarized ``Conv1d``, ``Conv2d``, and ``Linear`` operators
in it.
*Step 2*. **Operator registration**: register the binarized operators
defined in *Step 1* to ``CONV_LAYERS`` in
``bibench/models/layers/builder.py``.
*Step 3*. **Configuration definition**: define the configuration for the
learning task, neural architecture, or any track you would like to
evaluate (existing configurations can be referred to).
Then you can get started with BiBench and evaluate your binarization
algorithm!
Binarization Algorithms
-----------------------
**BNN**. During the training process, BNN uses the straight-through
estimator (STE) to calculate gradient :math:`\boldsymbol{g_{x}}` which
takes into account the saturation effect:
.. math::
\mathtt{sign}(\boldsymbol{x})=
\begin{cases}
+1,& \mathrm{if} \ \boldsymbol x \ge 0\\
-1,& \mathrm{otherwise}
\end{cases}\qquad
\boldsymbol{g_{x}}=
\begin{cases}
\boldsymbol{g_b},& \mathrm{if} \ \boldsymbol x \in \left(-1, 1\right)\\
0,& \mathrm{otherwise}.
\end{cases}
And during inference, the computation process is expressed as
.. math:: \boldsymbol o = \operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w}),
where :math:`\circledast` indicates a convolutional operation using XNOR
and bitcount operations.
The related code in our codebase refers to
`BinaryNet `__ and the
original paper.
**XNOR-Net**. XNOR-Net obtains the channel-wise scaling factors
:math:`\boldsymbol \alpha=\frac{\left\|\boldsymbol{w}\right\|}{\left|\boldsymbol{w}\right|}`
for the weight and :math:`\boldsymbol{K}` contains scaling factors
:math:`\beta` for all sub-tensors in activation :math:`\boldsymbol{a}`.
We can approximate the convolution between activation
:math:`\boldsymbol{a}` and weight :math:`\boldsymbol{w}` mainly using
binary operations:
.. math:: \boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol{K} \boldsymbol \alpha,
where :math:`\boldsymbol{w} \in \mathbb{R}^{c \times w \times h}` and
:math:`\boldsymbol{a} \in \mathbb{R}^{c \times w_{\text {in }} \times h_{\text {in }}}`
denote the weight and input tensor, respectively. And the STE is also
applied in the backward propagation of the training process.
The related code in our codebase refers to `XNOR-Net
(1) `__, `XNOR-Net
(2) `__, and the original paper.
**DoReFa-Net**. DoReFa-Net applies the following function for
:math:`1`-bit weight and activation:
.. math:: \boldsymbol o = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha.
And the STE is also applied in the backward propagation with the
full-precision gradient.
The related code in our codebase refers to `DoReFa-Net
(1) `__, `DoReFa-Net
(2) `__,
and the original paper.
**Bi-Real Net**. Bi-Real Net proposes a piece-wise polynomial function
as the gradient approximation function:
.. math::
\operatorname{bireal}\left(\boldsymbol{a}\right)=\left\{\begin{array}{lr}
-1 & \text { if } \boldsymbol{a}<-1 \\
2 \boldsymbol{a}+\boldsymbol{a}^2 & \text { if }-1 \leqslant \boldsymbol{a}<0 \\
2 \boldsymbol{a}-\boldsymbol{a}^2 & \text { if } 0 \leqslant \boldsymbol{a}<1 \\
1 & \text { otherwise }
\end{array}, \quad \frac{\partial \operatorname{bireal}\left(\boldsymbol{a}\right)}{\partial \boldsymbol{a}}= \begin{cases}2+2 \boldsymbol{a} & \text { if }-1 \leqslant \boldsymbol{a}<0 \\
2-2 \boldsymbol{a} & \text { if } 0 \leqslant \boldsymbol{a}<1 \\
0 & \text { otherwise }\end{cases}\right. .
And the forward propagation of Bi-Real Net is the same as DoReFa-Net.
The related code in our codebase refers to `Bi-Real
Net `__ and the original
paper.
**XNOR-Net++**. XNOR-Net++ proposes to re-formulate XNOR-Net as:
.. math:: \boldsymbol{o} = (\operatorname{sign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \Gamma,
and we adopt the :math:`\boldsymbol \Gamma` as the following form in
experiments (achieve the best performance in the original paper):
.. math:: \boldsymbol \Gamma=\boldsymbol \alpha \otimes \boldsymbol \beta \otimes \boldsymbol \gamma, \quad \boldsymbol \alpha \in \mathbb{R}^{\boldsymbol{o}}, \boldsymbol \beta \in \mathbb{R}^{h_{\text {out }}}, \boldsymbol \gamma \in \mathbb{R}^{w_{\text {out }}},
where :math:`\boldsymbol \alpha`, :math:`\boldsymbol \beta`, and
:math:`\boldsymbol \gamma` are learnable during training.
The related code in our codebase refers to
`XNOR-Net++ `__
and the original paper.
**ReActNet**. ReActNet defines an RSign as a binarization function with
channel-wise learnable thresholds:
.. math::
\boldsymbol{x}=\operatorname{rsign}\left(\boldsymbol{x}\right)=\left\{\begin{array}{ll}
+1, & \text { if } \boldsymbol{x}>\boldsymbol \alpha \\
-1, & \text { if } \boldsymbol{x} \leq \boldsymbol \alpha
\end{array} .\right.
where :math:`\boldsymbol \alpha` is a learnable coefficient controlling
the threshold. And the forward propagation is
.. math:: \boldsymbol o = (\operatorname{rsign}(\boldsymbol{a}) \circledast \operatorname{sign}(\boldsymbol{w})) \odot \boldsymbol \alpha.
The related code in our codebase refers to
`ReActNet `__ and the original
paper.
**ReCU**. As described in their paper, ReCU is formulated as
.. math:: \operatorname{recu}(\boldsymbol{w})=\max \left(\min \left(\boldsymbol{w}, Q_{(\tau)}\right), Q_{(1-\tau)}\right),
where :math:`Q_{(\tau)}` and :math:`Q_{(1-\tau)}` denote the
:math:`\tau` quantile and :math:`1-\tau` quantile of
:math:`\boldsymbol{w}`, respectively. And other implementations also
strictly follow the original paper and official code.
The related code in our codebase refers to
`ReCU `__ and the original paper.
**FDA**. FDA computes the gradients of :math:`\boldsymbol{o}` in the
backward propagation as:
.. math::
\frac{\partial \ell}{\partial \mathbf{t}}=\frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right) \boldsymbol{w}_1^{\top}
+\frac{\partial \ell}{\partial \boldsymbol{o}} \eta^{\prime}(\mathbf{t})
+\frac{\partial \ell}{\partial \boldsymbol{o}} \odot \frac{4 \omega}{\pi} \sum_{i=0}^n \cos ((2 i+1) \omega \mathbf{t}),
where :math:`\frac{\partial \ell}{\partial \boldsymbol{o}}` is the
gradient from the upper layers, :math:`\odot` represents element-wise
multiplication, and :math:`\frac{\partial \ell}{\partial \mathbf{t}}` is
the partial gradient on :math:`\mathbf{t}` that backward propagates to
the former layer. And :math:`\boldsymbol{w}_1` and
:math:`\boldsymbol{w}_2` are weights in the original models and the
noise adaptation modules respectively. FDA updates them as
.. math::
\frac{\partial \ell}{\partial \boldsymbol{w}_1}=\mathbf{t}^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}} \boldsymbol{w}_2^{\top} \odot\left(\left(\mathbf{t} \boldsymbol{w}_1\right) \geq 0\right),\qquad
\frac{\partial \ell}{\partial \boldsymbol{w}_2}=\sigma\left(\mathbf{t} \boldsymbol{w}_1\right)^{\top} \frac{\partial \ell}{\partial \boldsymbol{o}}.
The related code in our codebase refers to
`FDA `__
and the original paper.
Learning Tasks
--------------
2D Visual Tasks
~~~~~~~~~~~~~~~
The **classification tasks’ implementations** of our codebase borrows
from related tasks in
`MMClassification `__,
including CIFAR-10 and ImageNet classification tasks and models.
**CIFAR-10**. The CIFAR-10 dataset (Canadian Institute For Advanced
Research) is a collection of images commonly used to train machine
learning and computer vision algorithms. This dataset is widely used for
image classification tasks. There are 60,000 color images, each of which
measures 32x32 pixels. All images are categorized into 10 different
classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships,
and trucks. Each class has 6000 images, where 5000 are for training and
1000 are for testing.
**ImageNet**. ImageNet is a dataset of over 15 million labeled
high-resolution images belonging to roughly 22,000 categories.
The images are collected from the web and labeled by human labelers
using a crowd-sourced image labeling service called Amazon Mechanical
Turk. As part of the Pascal Visual Object Challenge, ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC) was established in
2010. There are approximately 1.2 million training images, 50,000
validation images, and 150,000 testing images in total in ILSVRC. ILSVRC
uses a subset of ImageNet, with about 1000 images in each of the 1000
categories. ImageNet also uses accuracy to evaluate the predicted
results, which is defined above.
The **object detection tasks’ implementations** of our codebase borrows
from related tasks in
`MMDetection `__, including
Pascal VPC07 and COCO17 detection tasks and models.
**Pascal VOC07**. The PASCAL Visual Object Classes 2007 (Pascal VOC07)
dataset contains 20 object categories including vehicles, households,
animals, and other: airplane, bicycle, boat, bus, car, motorbike, train,
bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat,
cow, dog, horse, sheep, and person. As a benchmark for object detection,
semantic segmentation, and object classification, this dataset contains
pixel-level segmentation annotations, bounding box annotations, and
object class annotations.
**COCO17**. The MS COCO (Microsoft Common Objects in Context) dataset is
a large-scale object detection, segmentation, key-point detection, and
captioning dataset. The dataset consists of 328K images. According to
community feedback, in the 2017 release, the training/validation split
was changed from 83K/41K to 118K/5K. And the images and annotations are
the same. The 2017 test set is a subset of 41K images from the 2015 test
set. Additionally, 123K images are included in the unannotated dataset.
.. _d-visual-tasks-1:
3D Visual Tasks
~~~~~~~~~~~~~~~
The **3D point cloud tasks’ implementations** of our codebase borrows
from related tasks in
`PointNet `__ and
`BiPointNet `__, including
ModelNet40 classification and ShapeNet segmentation tasks and models.
**ModelNet40**. The ModelNet40 dataset contains point clouds of
synthetic objects. As the most widely used benchmark for point cloud
analysis, ModelNet40 is popular due to the diversity of categories,
clean shapes, and well-constructed dataset. In the original ModelNet40,
12,311 CAD-generated meshes are divided into 40 categories, where 9,843
are for training, and 2,468 are for testing. The point cloud data points
are sampled by a uniform sampling method from mesh surfaces and then
scaled into a unit sphere by moving to the origin.
**ShapeNet**. ShapeNet is a large-scale repository for 3D CAD models
developed by researchers from Stanford University, Princeton University,
and the Toyota Technological Institute in Chicago, USA. Using WordNet
hypernym-hyponym relationships, the repository contains over 300M
models, with 220,000 classified into 3,135 classes. There are 31,693
meshes in the ShapeNet Parts subset, divided into 16 categories of
objects (*i.e.*, tables, chairs, planes, *etc*.). Each shape contains
2-5 parts (with 50 part classes in total).
Natural Language Understanding Tasks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The **natural language understanding tasks’ implementations** of our
codebase borrows from related tasks in
`DynaBERT `__
and `BiBERT `__, including the GLUE
benchmark tasks and models.
**GLUE**. General Language Understanding Evaluation (GLUE) benchmark is
a collection of nine natural language understanding tasks, including
single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks
MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI,
RTE, and WNLI.
Speech Tasks
~~~~~~~~~~~~
The **speech tasks’ implementations** of our codebase borrows from
related tasks in
`FSMN `__ and
`BiFSMN `__, including the Google
Speech Commands classification tasks and models.
**Google Speech Commands**. As part of its training and evaluation
process, Google Speech Commands Classification (SpeechCom) provides a
collection of audio recordings containing spoken words. Its primary goal
is to provide a way to build and test small models that detect a single
word that belongs to a set of ten target words. Models should detect as
few false positives as possible from background noise or unrelated
speech while providing as few false positives as possible.
Neural Architectures
--------------------
CNNs
~~~~
The **CNNs’ implementations** of our codebase borrows from
`MMClassification `__
and `MMDetection `__.
**ResNet**. Residual Networks, or ResNets, learn residual functions
concerning the layer inputs instead of learning unreferenced functions.
Instead of making stacked layers directly fit a desired underlying
mapping, residual nets let these layers fit a residual mapping. There is
empirical evidence that these networks are easier to optimize and can
achieve higher accuracy with considerably increased depth.
**VGG**. VGG is a classical convolutional neural network architecture.
It is proposed by an analysis of how to increase the depth of such
networks. It is characterized by its simplicity: the network utilizes
small 3$:raw-latex:`\times`$3 filters, and the only other components are
pooling layers and a fully connected layer.
**MobileNetV2**. MobileNetV2 is a convolutional neural network
architecture that performs well on mobile devices. This model has an
inverted residual structure with residual connections between the
bottleneck layers. The intermediate expansion layer employs lightweight
depthwise convolutions to filter features as a source of nonlinearity.
In MobileNetV2, the architecture begins with an initial layer of 32
convolution filters, followed by 19 residual bottleneck layers.
**Faster-RCNN**. Faster R-CNN is an object detection model that improves
Fast R-CNN by utilizing a region proposal network (RPN) with the CNN
model. The RPN shares full-image convolutional features with the
detection network, enabling nearly cost-free region proposals. A fully
convolutional network is used to predict the bounds and objectness
scores of objects at each position simultaneously. RPNs use end-to-end
training to produce region proposals of high quality and instruct the
unified network where to search. Sharing their convolutional features
allows RPN and Fast R-CNN to be combined into a single network. Faster
R-CNN consists of two modules. The first module is a deep, fully
convolutional network that proposes regions, and the second is the
detector that uses the proposals for giving the final prediction boxes.
**SSD**. SSD is a single-stage object detection method that discretizes
the output space of bounding boxes into a set of default boxes over
different aspect ratios and scales per feature map location. During
prediction, each default box is adjusted to match better the shape of
the object based on its scores for each object category. In addition,
the network automatically handles objects of different sizes by
combining predictions from multiple feature maps with different
resolutions.
Transformers
~~~~~~~~~~~~
The **transformers’ implementations** of our codebase borrows from
`DynaBERT `__
and `BiBERT `__.
**BERT**. BERT, or Bidirectional Encoder Representations from
Transformers, improves upon standard Transformers by removing the
unidirectionality constraint using a masked language model (MLM)
pre-training objective. By masking some tokens from the input, the
masked language model attempts to estimate the original vocabulary id of
the masked word based solely on its context. An MLM objective differs
from a left-to-right language model in that it enables the
representation to integrate the left and right contexts, which
facilitates pre-training a deep bidirectional Transformer. Additionally,
BERT uses a next-sentence prediction task that pre-trains text-pair
representations along with the masked language model. Note that we
replace the direct binarized attention with a bi-attention mechanism to
prevent the model from completely crashing.
MLPs
~~~~
The **MLPs’ implementations** of our codebase borrows from
`PointNet `__,
`BiPointNet `__,
`FSMN `__ and
`BiFSMN `__.
**PointNet**. PointNet is a unified architecture for applications
ranging from object classification and part segmentation to scene
semantic parsing. The architecture directly receives point clouds as
input and outputs either class labels for the entire input or point
segment/part labels. PointNet-Vanilla is a variant of PointNet, which
drops off the T-Net module. And for all PointNet models, we apply the
EMA-Max as the aggregator, because directly following the max pooling
aggregator will cause the binarized PointNets to fail to converge.
**FSMN**. Feedforward sequential memory networks or FSMN is a novel
neural network structure to model long-term dependency in time series
without using recurrent feedback. It is a standard fully connected
feedforward neural network containing some learnable memory blocks. As a
short-term memory mechanism, the memory blocks encode long context
information using a tapped-delay line structure.
**Deep-FSMN**. The Deep-FSMN architecture is an improved feedforward
sequential memory network (FSMN) with skip connections between memory
blocks in adjacent layers. By utilizing skip connections, information
can be transferred across layers, and thus the gradient vanishing
problem can be avoided when building very deep structures.