Model Compression in Deep Vision Networks (Part 1)
Compressing CNNs using Pruning and Mixed Precision Quantization
This article is the first half of the review of 4 research papers, read as part of the course Practical Deep Learning Systems Performance (COMSE6998) at Columbia University in Fall 2022 under the guidance of Prof. Parijat Dube. In this blog I summarize two of those papers and you can refer to Ria’s article for the second half.
Introduction
Large Scale CNNs (Convolutional Neural Networks) are the most elementary part of many state-of-the-art computer vision algorithms, achieving high accuracies in a wide variety of tasks. The architectures of such networks tend to be very deep in the sense that they have a huge number of parameters. All the parameters, which are millions and billions in number, have a high redundancy in weights, and use tremendous compute resources. Owing to its high compute and hardware requirements, it becomes a very difficult task to deploy such models on embedded and mobile devices which have limited memory and computational resources. To solve this problem, research efforts have been made to reduce the size of the models using various techniques such as Pruning, Neural Architecture Search, Quantization, and Knowledge Distillation. This blog will be covering a few specific ways of Pruning from Learning to Prune Filters in CNNs paper and using Quantization from Mixed Precision Quantization paper to reduce the model sizes and computations.
Paper 1: Learning to Prune Filters in CNNs [1]
Prior to the technique proposed by this paper, the existing methods pruned a filter based on the magnitude of the weights or by taking their L1 norms. Pruning individual weights of a filter leads to an increase in Sparsity of the CNNs, which then requires specially designed software or hardware to handle it. This paper introduced a data-driven way to prune filters based on its contribution to the overall accuracy of the model. They formulated this approach as a “try-and-learn” task where each pruning agent, modeled by a neural network, takes in the filter weights as the input and outputs a binary decision (yes/no) on keeping the filter.
Each pruning agent is trained using a novel reward function (Eq 1) which is the product of an accuracy term and an efficiency term. This reward function provides an easy control over the tradeoff between network performance and scale. The accuracy term (Eq 3) of the reward function consists of a drop-bound parameter “b” which makes sure that the accuracy drop after pruning the filters does not exceed the threshold and it also promotes higher accuracy. The efficiency term is the log of the ratio of the total number of filters in the model to the count of kept filters by the pruning agent (shown in Eq 2); this favors higher pruning. As the reward function is not differentiable with respect to the parameters of the pruning agent, policy gradient update is used to modify the parameters while training.
The authors of the paper evaluated the performance of their model for two tasks, image segmentation, and visual recognition. For the visual recognition task, they performed the experiment in 2 ways. First, they trained pruning agents for each layer individually and later trained the agents for all the convolutional layers together. The results from individual layer pruning on a VGG-16 network trained on CIFAR-10 dataset with accuracy drop-bound b = 2 has been shown in Fig 1. From the figure, we can observe that we can prune a layer up to 99.4% with a very small (0.5%) accuracy drop (layer 10 & 11).
To prune the filters individually, they start with a baseline model (f) which is already trained on the dataset and then prune it based on the chosen actions for each weight by the neural pruning agent. The actions for the weights in the layer are initialized at random and are learned over iterations as we evaluate the reward function on the actions and the accuracy drop based on policy gradient method. After the surgery of the baseline network, the network is fine-tuned on the dataset again to adjust to the pruning actions. This process is repeated for each layer of the network.
In the case of all layers being pruned together, we start with L-pruning agents for an “L” layer network , one for each convolution layer in the network and a baseline network. They followed the individual layer pruning method across layers in this case. The baseline network is not re-initialized as we go deeper in the network and the pruned network from the early layers are carried to the deeper layers to prune further. After each layer pruning, the model is fine-tuned on the train dataset to compensate for the performance decrease. Fig 2 below shows the results of this process for VGG-16 on CIFAR-10 dataset with a drop-bound of 2. We can observe here that each layer is pruned a little less than its single-layer pruning counterpart, but is still aggressive. From Fig 1 and Fig 2 we can conclude that deeper layers have a higher redundancy in filters which can be pruned without hampering the accuracy much.
The tradeoff between network performance and scale is provided by the drop-bound parameter “b”. The experiment on different values of b has been shown in Fig 3 and the results are shown in Fig 4. From this, we can see that a larger drop bound gives a higher pruning ratio, more saved FLOPs (floating point operations), higher GPU and CPU Speedup, and a larger accuracy drop on the test set. Also, we can see that for the data-driven pruning the accuracy drop, with the same prune ratio as the magnitude-based pruning, is lower on the VGG-16 network.
Similar experiments were performed on ResNet-18 network with CIFAR-10 dataset which is larger and more complex than VGG-16. It was observed that the pruning ratios were smaller (still significant) than the ones in VGG-16 which was expected as ResNet-18 is a more efficient network architecture. The results showed that, in the residual block, the first Conv layer is easier to prune than the second one; this can also be seen in Fig 5 (7th/9th/12th/14th — first layer, 8th/10th/13th/15th — second layer).
For the semantic segmentation task, experiments were performed on FCN-32s network with Pascal VOC dataset and SegNet network on CamVid dataset. The networks defined for segmentation tasks are based on pixel-level labeling process, which require more weights and representational capacity and are thus more challenging to prune. The metric used to evaluate the model was global pixel accuracy and the drop bound b was set as 2. From Fig 6, we can see that for the SegNet network there is a negative accuracy drop which implies an increase in accuracy which is favored by the accuracy term in the reward function. This was due to less overfitting by the network owing to pruning.
An interesting result can be seen in Fig 7 which shows the pruning results for SegNet on CamVid dataset that even though the network is symmetric, the deeper half of the network has more redundant filters to be pruned. It was observed that 49.2% of the filters were removed in the second half as compared to 26.9% from the first half.
Paper 2: Mixed Precision Quantization via Differentiable Neural Architecture Search [2]
Quantization is another technique to compress the model size by using smaller bits to store the filter weights in the convolution layers with less precision (< 32 bits). The work before this paper considered the same precision for all the activations and weights in each layer of the CNN. The idea of having different precisions (bit-width) for different layers was first introduced in this paper. Starting with a CNN of N layers and M candidate precisions in each layer, we want to find the most optimal precision assignments to reduce the model size, memory footprint and the number of computations while keeping the accuracy drop minimum. This would take O(M^N) time to search which is exponential; thus they introduced a new approach to efficiently solve this problem.
The authors introduced Differentiable Neural Architecture Search (DNAS) framework which represented the architecture search space as a stochastic “super-net” which is a DAG (Directed Acyclic Graph) with nodes representing the intermediate data tensors (feature maps) of the CNNs and the edges represent the operators (convolution layers) of the network as can be seen in Fig 1. Any candidate architecture can thus be a child subgraph of this super net. Any edge of the graph is executed stochastically and the probability of execution is parameterized by some architecture parameters. Thus, we need to find the optimal architecture parameters that gives the most promising accuracy and then the network can be sampled from the optimal parameters found. The optimal architecture parameters are found by training the super net with SGD (Stochastic Gradient Descent) with respect to both the network weights and the architecture parameter.
In the DNAS framework, the data tensors (node vⱼ) are calculated by summing all the incoming edges from previous nodes (vᵢ’s) along with a mask variable, this has been shown in the Eq 1 above. The mask variable m is an “edge-mask” and is used to decide the edge which will be executed in the candidate network. The paper started out with having the edge mask as discrete values {0,1} and all the masks to a particular node sum to 1, this ensures that only 1 edge is executed. The super net is made stochastic by having edges executed stochastically with the probability distribution given in Eq 2. This boils down the problem of architecture search to finding the optimal architecture distribution parameter θ which minimizes the expected loss function. Using the architecture parameter θ we can get the candidate architecture by sampling from P_θ.
As we try to solve for the optimal architecture parameter θ, we try to estimate the gradient of the loss function using Monte Carlo method (given in Eq 3), we find that there is high variance, as the architecture space size is orders of magnitude larger than any feasible Batch size B. High variance of gradient makes the SGD difficult to converge. This problem is solved by using Gumbel Softmax to control edge selection, instead of applying “hard-sampling” we use “soft-sampling” to choose edges for execution between two nodes. This makes the mask vector variable continuous and directly differentiable wrt θ (Eq 4).
The temperature coefficient τ is exponentially decayed. This helps because at the start of the training we require gradients with low variance, which is favoured by high values of τ making the distribution of m uniform. As the training progresses, τ decreases which makes the distribution of m more and more discrete. Near the end of the training, higher variance is favourable which is achieved by discrete selection of edges. This annealing of the temperature coefficient helps. This pipeline works as DARTS (train entire network together) in the beginning and as ENAS (sample child network to be trained independently) towards the end. There are 2 parameters to train on here w and θ which are trained alternately and separately. w optimizes all the candidate edges and θ optimizes the probability of choosing edges with better performance. Thus w is trained for few extra epochs (N_{warmup}) to get sufficiently trained before training θ.
For the problem of mixed precision quantization, the DNAS framework creates the super net with the same macro structure (number of layers and number of filters in each layer). The loss function used is given in Eq 5 below, where Cost(a) denotes the cost of candidate architecture (based on chosen precision of weights and activations) and C(.) is used to balance the Cross Entropy term with Cost (it can favour one based on the hyper parameter chosen).
Experiments were performed on two datasets (CIFAR-10 and ImageNet) with different network architectures. ResNet{20,56,110} were quantized only on weights and full precision was kept for the activations and the search was carried out at the block level, all layers in a block get the same precision. This experiment was done on CIFAR-10 dataset and by convention first and last layers are not quantized. The precision was selected between {0-32}, where 0 means the block is ignored and input is the same as output and 32 means full precision was used for that block. The results of this experiment are shown in Fig 2 below.
From the results above, we can observe that the DNAS framework helped with a more efficient search of the solution as the full precision results are better in general. Also, all the Most Accurate models have higher accuracy by 0.37% while compressing the model 11.6x — 12.5x, and the most efficient models can achieve 16.6x–20.3x model size compression with accuracy drop of less than 0.39%. From fig 3 we can see that for the most efficient network, the third block (in group 1) has 0 precision which means that it is skipped.
ResNet{18,34} were quantized using imagenet dataset. They performed 2 different types of experiments, one to compress model size where they quantize weights alone and the other to compress computational cost where they quantize weights as well as activations. The results can be seen in Fig 4 for model compression and Fig 5 for computation compression. DNAS is very efficient and it takes less than 5 hours to finish a search on ResNet 18 for ImageNet using 8 V100 GPUs.
From Fig 4 we can observe very similar things as from Fig 2, the DNAS framework outperforms full precision counterparts and Most Accurate Models have a higher accuracy than their counters with 2 (TTQ)and 3 (ADMM) bit fixed precisions. From Fig 5 we can see that the most accurate architecture (arch-1) have very similar accuracy to the full precision models with compression rates of 40x. The arch-2 has a similar compression rate as PACT, DoReFa, QIP but the accuracy is higher. Arch-3 has a higher compression rate as well as higher accuracy than GroupNet.
Observations
From the 2 papers discussed above, we can observe that -
- The deeper the Neural Network is, higher the number of parameters, size and more computations have to be performed. From the 1st Paper we see that as we go deeper into any ConvNet the filters keep learning redundant information from the data and deeper layers can be pruned more than 90% without losing much accuracy on the task at hand but the search takes a lot of time.
- From the 2nd paper, we observe that storing the weights and activations with full precision for each layer is an overkill and the experiments show that the DNAS alone (full precision) improves accuracy of the models and the quantization results in a better accuracy in their most accurate models and drops very little in their most efficient models.
Conclusion
- There are a lot of different ways to do model compression, Model Pruning, Quantization, Neural Architecture Search and Knowledge Distillation, etc., and it is a very experimental task.
- DNAS is a general architecture search framework that is not limited to the mixed precision quantization problem and can be utilized in different applications.
- Data-driven pruning performs better instead of pruning based on magnitude of weights or the L1 norm.
- The “try-and-learn” algorithm is not very efficient on its own. We need a way to formulate the pruning of entire network as one learning task for higher automation, right now it is a long running sequential task.
References
- Learning to prune filters in convolutional neural networks [2018]
- Mixed precision quantization of convnets via differentiable neural architecture search [2018]
- Pruning filters for efficient convnets [2016]
- Darts: Differentiable Architecture Search [2018]
- Efficient Neural Architecture Search via Parameter Sharing [2018]