An eigenspace view reveals how predictor networks and stopgrads provide implicit variance regularization

Published in NeurIPS 2022 Workshop on Self-Supervised Learning, 2022

SSL learns useful representations from unlabelled data by training networks to be invariant to pairs of augmented versions of the same input. Non-contrastive methods avoid collapse either by directly regularizing the covariance matrix of network outputs or through asymmetric loss architectures, two seemingly unrelated approaches. Here, by building on DirectPred, we lay out a theoretical framework that reconciles these two views. We derive analytical expressions for the representational learning dynamics in linear networks. By expressing them in the eigenspace of the embedding covariance matrix, where the solutions decouple, we reveal the mechanism and conditions that provide implicit variance regularization. These insights allow us to formulate a new isotropic loss function that equalizes eigenvalue contribution and renders learning more robust. Finally, we show empirically that our findings translate to nonlinear networks trained on CIFAR-10 and STL-10.

DirectLoss Top and middle rows show the neural updates in different settings, in dimension $M=2$ for visualization. Bottom row shows the evolution of the eigenvalues of $W_\mathrm{P}$ upon training in the settings corresponding to the top row, but in dimensions $N=15$ and $M=10$. a) Omitting the stop-grad leads to representational collapse. b) Applying the stop-grad on the wrong side also leads to collapse with potentially diverging eigenmodes. c) Optimizing the BYOL/SimSiam loss leads to isotropic representations. d) Optimizing the isotropic loss has the same effect, but uniform for all eigenvalues.

Download paper here