Links: [arXiv] [code]

Introduction

  • Structured representations for weight help achieve good compression and acceleration while maintaining generalization properties.

  • Some popular examples of low-rank matrices are: Block circulant, Toeplitz matrices, etc

  • These constructions tend to underperform on general fully-connected layers

  • The proposed approach leverages a LDR (low displacement rank) framework, which encodes structure through two sparse displacement operators and a low-rank residual term.

  • Displacement operators are jointly learned with the low rank residual.

  • Achieves higher accuracy than the general unstructured layers while using an order of magnitude fewer parameters.

  • Provides guaranteed compression, matrix-vector multiplication algorithms that are quasi-linear in number of parameters.

Displacement Operators and LDR Framework

  • Represent a matrix through displacement operators

  • with a residual so that

  • can be manipulated through compressed representation

  • Assuming and have disjoint eigenvalues, can be recovered from

  • Rank is called the displacement rank of w.r.t

  • Toeplitz-like matrices have LDR w.r.t shift or cycle operators.

  • Standard formulation for and

          , where

  • Displacement equation for Toeplitz like matrix.

  • Shift invariant structure leads to a residual rank of atmost 2.

  • Some examples of matrices that satisfy displacement property:Hankel, Toeplitz, Cauchy, etc.

  • Certain combinations of LDR matrices preserve low displacement rank

Learning Displacement Operators

  • Two classes of new displacement operators: Sub-diagonal and Tri-Diagonal

  • These matrices can model rich structure and subsume many types of linear transformations used in machine learning

  • Given , we need to do a matrix-vector multiplication.

  • Several schemes for explicitly reconstructing from its displacement parameters are known for specific cases but do not always apply to general operators.

  • are used to implicitly construct a slightly different matrix with at most double the displacement rank, which is simpler to work with.

  • are stored, are either sub-diagonal or tri-diagonal operators and

  • Near linear time algorithms for LDR-TD and LDR-SD are recently shown to exist.

  • Complete algorithms were not provided, as they relied on theoretical results such as the transposition principle.

  • Paper provides a near-linear time algorithm for LDR-SD matrices.

  • For LDR-TD, are explicitly constructed and then standard matrix vector multiplication is used.

Theoretical Properties of Structured Matrices

  • The matrices used are unusual in that the parameters interact multiplicatively (namely in , ) to implicitly define the actual layer.

  • It can be proved that this does not change the learnability of classes.

  • Displacement rank and equivariance: Related to building equivariant feature representations that transform in predictable ways when the input is transformed. An equivariant feature map satisfies:

                  

  • Perturbing the input by a transformation B before passing through the map is equivalent to first finding the features , then transforming by .

  • LDR matrices are a suitable choice for modeling approximately equivariant linear maps.

Results

  • The proposed method is tested on a single hidden layer neural network and the fully-connected layer of a CNN.

  • Datasets used: Two variants of MNIST (bg-rot, noise), CIFAR-10 (grayscale), and NORB.

  • The table shown below compares the proposed approach to several baselines using dense, structured matrices to compress the hidden layer of a single hidden layer neural network.

  • Paper also shows results for language modeling task by replacing the weights in a single layer LSTM

  • Results on replacing a five-layer CNN consisting of convolutional channels, max pooling, FC layers with two generic LDR matrices.

References:

  1. Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018