Learning Compressed Transforms with Low Displacement Rank
Introduction

Structured representations for weight help achieve good compression and acceleration while maintaining generalization properties.

Some popular examples of lowrank matrices are: Block circulant, Toeplitz matrices, etc

These constructions tend to underperform on general fullyconnected layers

The proposed approach leverages a LDR (low displacement rank) framework, which encodes structure through two sparse displacement operators and a lowrank residual term.

Displacement operators are jointly learned with the low rank residual.

Achieves higher accuracy than the general unstructured layers while using an order of magnitude fewer parameters.

Provides guaranteed compression, matrixvector multiplication algorithms that are quasilinear in number of parameters.
Displacement Operators and LDR Framework

Represent a matrix through displacement operators

with a residual so that

can be manipulated through compressed representation

Assuming and have disjoint eigenvalues, can be recovered from

Rank is called the displacement rank of w.r.t

Toeplitzlike matrices have LDR w.r.t shift or cycle operators.

Standard formulation for and
, where
 Displacement equation for Toeplitz like matrix.

Shift invariant structure leads to a residual rank of atmost 2.

Some examples of matrices that satisfy displacement property:Hankel, Toeplitz, Cauchy, etc.

Certain combinations of LDR matrices preserve low displacement rank
Learning Displacement Operators
 Two classes of new displacement operators: Subdiagonal and TriDiagonal
 These matrices can model rich structure and subsume many types of linear transformations used in machine learning

Given , we need to do a matrixvector multiplication.

Several schemes for explicitly reconstructing from its displacement parameters are known for specific cases but do not always apply to general operators.

are used to implicitly construct a slightly different matrix with at most double the displacement rank, which is simpler to work with.

are stored, are either subdiagonal or tridiagonal operators and

Near linear time algorithms for LDRTD and LDRSD are recently shown to exist.

Complete algorithms were not provided, as they relied on theoretical results such as the transposition principle.

Paper provides a nearlinear time algorithm for LDRSD matrices.

For LDRTD, are explicitly constructed and then standard matrix vector multiplication is used.
Theoretical Properties of Structured Matrices

The matrices used are unusual in that the parameters interact multiplicatively (namely in , ) to implicitly define the actual layer.

It can be proved that this does not change the learnability of classes.
 Displacement rank and equivariance: Related to building equivariant feature representations that transform in predictable ways when the input is transformed. An equivariant feature map satisfies:

Perturbing the input by a transformation B before passing through the map is equivalent to first finding the features , then transforming by .

LDR matrices are a suitable choice for modeling approximately equivariant linear maps.
Results

The proposed method is tested on a single hidden layer neural network and the fullyconnected layer of a CNN.

Datasets used: Two variants of MNIST (bgrot, noise), CIFAR10 (grayscale), and NORB.

The table shown below compares the proposed approach to several baselines using dense, structured matrices to compress the hidden layer of a single hidden layer neural network.
 Paper also shows results for language modeling task by replacing the weights in a single layer LSTM
 Results on replacing a fivelayer CNN consisting of convolutional channels, max pooling, FC layers with two generic LDR matrices.
References:
 Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018