Links: [arXiv] [code]

### Introduction

• Structured representations for weight help achieve good compression and acceleration while maintaining generalization properties.

• Some popular examples of low-rank matrices are: Block circulant, Toeplitz matrices, etc

• These constructions tend to underperform on general fully-connected layers

• The proposed approach leverages a LDR (low displacement rank) framework, which encodes structure through two sparse displacement operators and a low-rank residual term.

• Displacement operators are jointly learned with the low rank residual.

• Achieves higher accuracy than the general unstructured layers while using an order of magnitude fewer parameters.

• Provides guaranteed compression, matrix-vector multiplication algorithms that are quasi-linear in number of parameters.

### Displacement Operators and LDR Framework

• Represent a matrix $\mathbf{M} \in \mathbb{R}^{m \times n}$ through displacement operators $\left(\mathbf{A} \in \mathbb{R}^{m \times m}, \mathbf{B} \in \mathbb{R}^{n \times n}\right)$

• $\nabla_{\mathbf{A}, \mathbf{B}} : \mathbf{M} \mapsto \mathbf{A M}-\mathbf{M B}$ with a residual $\mathbf{R}$ so that $\mathbf{AM - MB = R}$

• $M$ can be manipulated through compressed representation $\mathbf{(A, B, R)}$

• Assuming $A$ and $B$ have disjoint eigenvalues, $\mathbf{M}$ can be recovered from $\mathbf{(A, B, R)}$

• Rank $\mathbf{R}$ is called the displacement rank of $\mathbf{M}$ w.r.t $\mathbf{(A, B)}$

• Toeplitz-like matrices have LDR w.r.t shift or cycle operators.

• Standard formulation for $\mathbf{A}$ and $\mathbf{B}$

$\mathbf{A}= \mathbf{Z}_{1}$, $\mathbf{B}= \mathbf{Z}_{-1}$ where $\mathbf{Z}_{f}= \begin{bmatrix} \hspace{-0.3cm} 0_{1 \hspace{0.05cm} \times \hspace{0.05cm} (n-1)} \hspace{0.6cm} f \\ \hspace{0.1cm} I_{n-1} \hspace{0.6cm} 0_{(n-1) \hspace{0.05cm} \times \hspace{0.05cm} 1} \end{bmatrix}$

• Displacement equation for Toeplitz like matrix. • Shift invariant structure leads to a residual rank of atmost 2.

• Some examples of matrices that satisfy displacement property:Hankel, Toeplitz, Cauchy, etc.

• Certain combinations of LDR matrices preserve low displacement rank ### Learning Displacement Operators

• Two classes of new displacement operators: Sub-diagonal and Tri-Diagonal • These matrices can model rich structure and subsume many types of linear transformations used in machine learning • Given $\mathbf{A, B, R}$, we need to do a matrix-vector multiplication.

• Several schemes for explicitly reconstructing $\mathbf{M}$ from its displacement parameters are known for specific cases but do not always apply to general operators.

• $\mathbf{(A, B, R)}$ are used to implicitly construct a slightly different matrix with at most double the displacement rank, which is simpler to work with. • $\mathbf{(A, B, G, H)}$ are stored, $\mathbf{(A, B)}$ are either sub-diagonal or tri-diagonal operators and $\mathbf{(G, H) \in \mathbb{R}^{n \times r}}$

• Near linear time algorithms for LDR-TD and LDR-SD are recently shown to exist.

• Complete algorithms were not provided, as they relied on theoretical results such as the transposition principle.

• Paper provides a near-linear time algorithm for LDR-SD matrices.

• For LDR-TD, $\mathcal{K}\left(\mathbf{A}, \mathbf{g}_{i}\right) \text { and } \mathcal{K}\left(\mathbf{B}^{T}, \mathbf{h}_{i}\right)$ are explicitly constructed and then standard matrix vector multiplication is used. ### Theoretical Properties of Structured Matrices

• The matrices used are unusual in that the parameters interact multiplicatively (namely in $\mathbf{A_i}$, $\mathbf{B_i}$) to implicitly define the actual layer.

• It can be proved that this does not change the learnability of classes. • Displacement rank and equivariance: Related to building equivariant feature representations that transform in predictable ways when the input is transformed. An equivariant feature map $\Phi$ satisfies:

$\Phi(B(x))=A(\Phi(x))$

• Perturbing the input by a transformation B before passing through the map $\Phi$ is equivalent to first finding the features $\Phi$, then transforming by $A$.

• LDR matrices are a suitable choice for modeling approximately equivariant linear maps.

### Results

• The proposed method is tested on a single hidden layer neural network and the fully-connected layer of a CNN.

• Datasets used: Two variants of MNIST (bg-rot, noise), CIFAR-10 (grayscale), and NORB.

• The table shown below compares the proposed approach to several baselines using dense, structured matrices to compress the hidden layer of a single hidden layer neural network. • Paper also shows results for language modeling task by replacing the weights in a single layer LSTM • Results on replacing a five-layer CNN consisting of convolutional channels, max pooling, FC layers with two generic LDR matrices. 1. Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018