Learning Compressed Transforms with Low Displacement Rank
Introduction
-
Structured representations for weight help achieve good compression and acceleration while maintaining generalization properties.
-
Some popular examples of low-rank matrices are: Block circulant, Toeplitz matrices, etc
-
These constructions tend to underperform on general fully-connected layers
-
The proposed approach leverages a LDR (low displacement rank) framework, which encodes structure through two sparse displacement operators and a low-rank residual term.
-
Displacement operators are jointly learned with the low rank residual.
-
Achieves higher accuracy than the general unstructured layers while using an order of magnitude fewer parameters.
-
Provides guaranteed compression, matrix-vector multiplication algorithms that are quasi-linear in number of parameters.
Displacement Operators and LDR Framework
-
Represent a matrix through displacement operators
-
with a residual so that
-
can be manipulated through compressed representation
-
Assuming and have disjoint eigenvalues, can be recovered from
-
Rank is called the displacement rank of w.r.t
-
Toeplitz-like matrices have LDR w.r.t shift or cycle operators.
-
Standard formulation for and
, where
- Displacement equation for Toeplitz like matrix.
-
Shift invariant structure leads to a residual rank of atmost 2.
-
Some examples of matrices that satisfy displacement property:Hankel, Toeplitz, Cauchy, etc.
-
Certain combinations of LDR matrices preserve low displacement rank
Learning Displacement Operators
- Two classes of new displacement operators: Sub-diagonal and Tri-Diagonal
- These matrices can model rich structure and subsume many types of linear transformations used in machine learning
-
Given , we need to do a matrix-vector multiplication.
-
Several schemes for explicitly reconstructing from its displacement parameters are known for specific cases but do not always apply to general operators.
-
are used to implicitly construct a slightly different matrix with at most double the displacement rank, which is simpler to work with.
-
are stored, are either sub-diagonal or tri-diagonal operators and
-
Near linear time algorithms for LDR-TD and LDR-SD are recently shown to exist.
-
Complete algorithms were not provided, as they relied on theoretical results such as the transposition principle.
-
Paper provides a near-linear time algorithm for LDR-SD matrices.
-
For LDR-TD, are explicitly constructed and then standard matrix vector multiplication is used.
Theoretical Properties of Structured Matrices
-
The matrices used are unusual in that the parameters interact multiplicatively (namely in , ) to implicitly define the actual layer.
-
It can be proved that this does not change the learnability of classes.
- Displacement rank and equivariance: Related to building equivariant feature representations that transform in predictable ways when the input is transformed. An equivariant feature map satisfies:
-
Perturbing the input by a transformation B before passing through the map is equivalent to first finding the features , then transforming by .
-
LDR matrices are a suitable choice for modeling approximately equivariant linear maps.
Results
-
The proposed method is tested on a single hidden layer neural network and the fully-connected layer of a CNN.
-
Datasets used: Two variants of MNIST (bg-rot, noise), CIFAR-10 (grayscale), and NORB.
-
The table shown below compares the proposed approach to several baselines using dense, structured matrices to compress the hidden layer of a single hidden layer neural network.
- Paper also shows results for language modeling task by replacing the weights in a single layer LSTM
- Results on replacing a five-layer CNN consisting of convolutional channels, max pooling, FC layers with two generic LDR matrices.
References:
- Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018