NeurIPS 2019 Paper Review
Andrew Simmons – Quantitative Researcher
Giacomo De Palma, Bobak Toussi Kiani, Seth Lloyd
Deep neural networks often have more parameters than datapoints in their training set, so one might naively expect them to be very overfit, and have poor generalisation properties. They are fully capable of learning randomly-labelled data by memorising the dataset, so why should we expect them to perform at all well when given a real dataset? In practice we often find that they generalise “unreasonably well”. One proposed explanation for this is that they are biased towards parametrising simple functions, and this paper identifies a sense in which this is formally true.
Vaishnavh Nagarajan, J. Zico Kolter
Many recent attempts to quantify how overparametrised neural networks are surprisingly good at generalisation has been related to uniform convergence of the empirical risk to the population risk. This paper, which won the “Outstanding new directions award” demonstrates a setting in which previous bounds based on uniform convergence yield trivial generalisation bounds, showing that uniform convergence is not a contributing factor to generalisation ability. They also demonstrate that some of these bounds in fact get worse as the training set size increases.
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, Xifeng Yan
The transformer architecture has produced state-of-the-art results in many tasks, primarily in sequence-to-sequence tasks such as machine translation. In this paper, the architecture is applied to the task of time series forecasting. In this paper Li et al demonstrate a novel approach that has a lower memory requirement than the original architecture and is more robust in the presence of local anomalies.
James Colless – Senior Quantitative Researcher
Aaron R. Voelker, Ivana Kajic, Chris Eliasmith
The authors propose a novel memory cell that makes efficient use of resources to store information across long time windows. In particular, they project their input history window onto a linear combination of scale-invariant Legendre polynomials, orthogonalizing it’s continuous-time history. One nice aspect of this ‘first-principles’ derivation of an RNN cell is that it allows the authors to provide theoretical guarantees for learning long-range dependencies. They demonstrate the cell outperforming LSTMs in memory capacity (unsurprising given how the cell was derived) but also on somewhat more relevant chaotic time-series prediction tasks. One potential drawback is that by construction this cell will have constant frequency resolution across a given window, which may not be optimal for very long window lengths, however this can be easily fixed using either stacked layers of LMUs with different window sizes or using a different set of basis functions with the desired frequency behaviour.
Stéphane d’Ascoli, Levent Sagun, Joan Bruna, Giulio Biroli
The authors attempt to explain why convolutional neural networks experimentally outperform fully connected networks on spatially structured data, through the lens of the loss landscape. They introduce a method that maps a CNN to its equivalent FCN which will have very sparse (due to the CNNs locality) and redundant (due to the CNNs weight sharing) set of weights. They then implement a new training regime which starts by training a CNN, embedding it into the equivalent FCN at some point (i.e. relaxing the priors about the best underlying architecture structure), and then resulting training in the FCN space. Interestingly when choosing the correct ‘relax-point’ they find some regimes where the eFCN outperforms both the CNN and the FCN trained from scratch, demonstrating that there can be a benefit in combining the prior information of the CNN and the expressivity of the FCN in a complementary way. Out obvious caveat to this approach from a practical standpoint is the massive memory overhead of the eFCN which can be several order of magnitude larger than the underlying CNN.
Adam Gaier, David Ha
In this work, the authors question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. They propose a search method for neural network architectures that can perform a task without any explicit weight training. To evaluate such networks they populate the connections with a single shared weight parameters (sampled from a uniform random distribution), and measure the networks performance. They show that in a supervised learning domain, neural network architectures can be found that achieve much higher than chance accuracy on MNIST using random weights.
Andreea Anghel, Nikolas Ioannou, Thomas Parnell, Nikolaos Papandreou, Celestine Mendler-Dünner, Haris Pozidis
The authors improve the run-time of training random forests on CPU architectures using a novel hybrid BFS-DFS algorithm as well as a grab-bag of other optimizations to reduce systems level bottlenecks at train time. They demonstrate an average speed-up of ~8 times when compared to H20 or xgboost when averaged over different datasets, forest parameters and hardware systems.
Michael Kratzer – Senior Quantitative Researcher
Eran Malach, Shai Shalev-Shwartz
Recent work has shown that gradient descent-based training can reach only a small subset of the functions expressible by deep neural networks and that function complexity gradually increases during training.
This paper provides an additional insight into the training process of deep neural networks: it shows (for a class of fractal functions) that complex but expressible functions can only be learned if there is a “path” of good approximations by simpler functions.
Florian Schäfer, Anima Anandkumar
Gradient descent is well-known not to convergence for minmax problems in general and various modifications have been proposed.
The authors present a second-order method with an elegant interpretation: they solve a regularized bilinear local approximation of the underlying game.
Compared to Newton’s method, this approach has several advantages including lower regularity requirements.
An implementation using automatic differentiation and iterative equation solvers is competitive with first-order methods despite the higher cost per step.
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy Liang
This paper considers the task of generating a working computer program from a high-level pseudocode description and a few test cases. As it is quite difficult to even produce a valid program, the success rate of standard approaches is quite low. The authors achieve a significant improvement with a search strategy that makes use of the compiler output from failed attempts.
Paul Crewe – Quant Research Manager
Harm Van Seijen, Mehdi Fatemi, Arash Tavakoli
The hyperparameter gamma, the discount factor, affects the optimisation process in reinforcement learning in several ways. This paper sets out a series of small experiments to study some of these effects in isolation and proposes the hypothesis that poor performance of (too) small discount factors is driven by the large range of the action gap (the difference in value estimates between best and next-best actions in a state) across the state space in these cases. This motivates optimising in log-value space and the paper shows how to do this with some standard convergence theorems
Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, Roeland Nusselder
Binary neural nets (where all the weights and activations are +/-1) have the potential to vastly reduce the cost of inference on the right hardware. Previous approaches achieved good accuracy by viewing the weights in BNNs as approximations to real-valued latent weights. This paper proposes an alternative interpretation for why these approaches worked: that the latent weights play a role analogous to the momentum term in optimisers like Adam. With this in mind they develop a new optimiser for BNNs that learns the binary weights ‘natively’ and achieves competitive performance with fewer hyperparameters.
Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Nogueira dos Santos
Feature selection is a constant thorn in the side of data scientists. This paper suggests the tantalising prospect that a neural net may be able to learn to select features directly in an unsupervised way. The idea is to train a network with inputs (x, y) to output large values if x and y were sampled from their joint distribution and small values if they were sampled independently. A sparsity-inducing gradient penalty acts like a non-linear equivalent of the LASSO penalty for linear models, encouraging the model to use only a subset of the features to guess the distribution. An extra variational parameter is introduced to smooth the penalty term and this turns out to naturally correspond to feature importance.
Hugh Salimbeni – Quantitative Researcher
Sanjeev Arora, Simon Du, Wei Hu · Zhiyuan Li, Russ Salakhutdinov, Ruosong Wang
This paper explored the connection between neural networks and Gaussian processes. There were actually 4 papers on this topic at the main conference (and at least three more at workshops) so it’s clear the connection has sparked a great deal of interest. The key idea is that a trained wide network is equivalent to a Gaussian process (as the width goes to infinity and the learning rate goes to zero). A similar result for untrained networks has been known for while, but the result for a trained network has been only recently explored. This paper provides insights for both Gaussian process and the deep learning communities.
Haibin YU, Yizhou Chen, Bryan Kian Hsiang Low, Patrick Jaillet, Zhongxiang Dai
This paper developed the model that occupied much of my PhD: the Deep Gaussian Processes. The new innovation is an implicit posterior: that is, a distribution that can provide samples but not a closed-form density. The implicit posterior is very flexible – much more so than the simple Gaussian ones that I developed in my work – but presents a significant difficulty that the objective function is not tractable. To get around this difficulty the authors use an adversarial approach: they use introduce an additional model that is trained to estimate the intractable term. The results they present demonstrate that the method works well in practice.
Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
The Stochastic Weight Averaging (SWA) method approach is a form of ensembling where the averaging is done in weight space rather than in the prediction space, using iterates from the gradient descent process. There was a paper at the conference discussing why this might be an especially good idea (see paper below). This paper extends this idea by fitting a (low rank) Gaussian to the iterates rather than just taking the mean. The approach appears to work well and is very cheap to compute, so might be very effective in practice.
Haowei He, Gao Huang, Yang Yuan
This paper is the latest in a long line of works that try to charactierze which local minima might be best for generalization. Previous work has investigated ‘sharp’ and ‘flat’ minima with the intuition idea that sharp ones generalize less well as the loss changes dramatically for small changes in the parameters, whereas flat ones are better as the loss only changes slightly for perturbations of the parameters. This paper makes the observation that the real situation is more subtle: there are minima that are sharp on one side and flat on the other (of course, in high dimensions the notation of ‘side’ can be rather counterintuitive). The demonstration of asymmetric minima leads to the conclusion that biasing the learning procedure to favour the flat side should work better. Stochastic Weight Averaging (SWA: see paper above for more details) is just such a method, as on average the SGD iterates spend longer on the flat side than the steep side.
Alessandro Ticchi – Quant Research Manager
Sindy Löwe, Peter O’Connor, Bastiaan Veeling
This is hands down my favourite 2019 NeuRIPS paper. The authors propose a local self-supervised representation learning approach to deep learning that doesn’t require backProp, and yet achieves results comparable to other state-of-the-art supervised methods. Yes, that is correct, deep learning without backProp. According to their approach, every layer of the network is trained completely independently of any other layer using some contrastive loss similar to the one previously proposed by Aaron van den Oord in “Representation Learning with Contrastive Predictive Coding” (which by itself is a very inspiring work). The contribution of this paper is to hierarchically train each layer to represent information contained in the previous layer in a more abstract way, i.e. a way that is maximally useful to predict future samples (assuming that the abstract structure of the world is slow changing and self-predictive, while noisy details change fast in stochastic ways). Using no lables, no backprop, no end-to-end training the authors are able to produce very good results for all sort of domains: speech, images, text and reinforcement learning in 3d environments. This approach solves the problem of vanishing/exploding gradients for very deep networks, and it allows completely parallel and distributed training of different layers in very large architectures. For all those reasons, I think this work could be a turning point for DL research, possibly comparable to what “attention” has been in the last few years. But what excites me the most is the Neuroscience implications of this work. In fact, we know that brain is by structure incapable of implementing end-to-end backprop, yet it is quite good at representation learning. Being able to break free from backprop for the first time might throw some light how the brain actually works.
Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, Haofeng Zhang
The paper gives some insight into something we all experienced at some point: it doesn’t matter how over parametrised our network is, if we stop early enough we can still get some pretty decent results. This is quite intuitive, for example an arbitrarily deep network with sigmoid activation will effectively be linear in the beginning of training as long as weights are small enough. With this work, the authors systematically explore the capacity of arbitrary networks through training, providing us a better understanding of this effect, and exploring how it affects generalization and overfitting.
Miika Aittala, Prafull Sharma, Lukas Murmann, Adam Yedidia, Gregory Wornell, Bill Freeman, Fredo Durand
This is one of those fun crazy ideas that you would never believe it could work, until you’re told it actually does. The authors played some video on a screen and pointed a camera towards one of the walls, in a randomly arranged room containing all sort of objects. Then, they asked a neural network to reconstruct what is playing in the video by only observing indirect illumination, and the network actually does it quite successfully. It should be noted that the network has never been trained before on the given scene, everything is happening in a completely unsupervised fashion. The key idea behind this work is closely related and inspired by recent research on Deep Image Prior: The observed video is factored into a matrix product between the unknown hidden scene video and an unknown light transport matrix. The two given matrices are both initialized convolutional neural networks trained in a one-off manner, and this results in decompositions that reflect the true motion in the hidden scene, allowing to reconstruct the general patterns in the video played in the screen (don’t expect high resolution results, blobs of light moving around is all you’ll see, but that’s still impressive in my opinion). At the poster session the authors suggested the idea of learning this prior empirically, using supervised learning in the presence of randomly generated ambients, and then using the learned prior for new unseen environments. Looking forward to see what they’re going to present us next year.
Pawel Zaczkowski – Quant Research Manager
David Janz, Jiri Hron, Przemysław Mazur, Katja Hofmann, José Miguel Hernández-Lobato, Sebastian Tschiatschek
One of the big open problems in reinforcement learning is how to balance effective exploration and exploitation. A very principled approach to the problem is posterior sampling reinforcement learning, where one (1) puts a Bayesian prior over a distribution of rewards and transition dynamics and (2) uses posterior sampling from the posterior transition distribution in order to compute the optimal policy. The method is very general and performs very well on relatively small tabular problems. However, due to its generality, it is very difficult to scale it up to larger problems.
The recent approaches to scale PSRL mostly included randomised value functions (RVF), where one puts a Bayesian prior over the Q-functions of the problem. This is computationally more tractable, however it suffers some theoretical drawbacks – in general, one loses control over the propagation of uncertainty between the distribution of true and approximate Q-functions, which can lead to very suboptimal behaviour.
The authors propose an approach which is within the RVF frameworks, however they put a Gaussian distribution over an embedding of the Q-function, calling it a method of successor uncertainties (SU). This allows for having explicit control over the error propagation, which then turns out to perform well on deep sparse-rewards problems.
They present state-of the art results on a tree MDP as well as Deep Sea MDP, together with very good performances on Atari 2600 games.
Matthew Fellows, Anuj Mahajan, Tim Rudner, Shimon Whiteson
The paper considers the actor-critic framework for tackling reinforcement learning problems. Future rewards are modelled as a parametrised likelihood function over the rewards. Rather than treating the parameters as model parameters, they are treated in a variational way, converting the problem into a variational inference problem. This approach has been taken in previous Sergei Levine’s work. While in itself a great step forwards, it had drawbacks such as (1) risk-seeking behaviour (2) generally finding it hard to identify deterministic optimal policies.
Variational Inference Reinforcement Learning (VIREL) tries to overcome some of the problems by implementing the KL-divergence optimization step in the Critic part of the algorithm by swapping the two distributions around. This leads to an algorithm which reduces the actor-critic framework to an EM-algorithm, with policy improvement equivalent to the E-step, and policy evaluation equivalent to the M-step.
The results are quite promising; the method outperforms the state-of-the art method on a number of tasks in the OpenAI gym.
Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiastschek, Cheng Zhang, Katja Hofmann
One of the goals in the future of RL is the ability to generalize to new environments. It is tempting to attempt porting similar techniques from the field of supervised learning such as Drop Out and Batch Normalization. However, this approach is tricky since, in RL, training data depends on the model, meaning stochastic regularization can have adverse effects. Furthermore, in general, data-distribution can be non-stationary in general RL problems. The authors therefore present a method of selectively injecting the noise (SNI) into the problem. This is further improved by applying Information Bottleneck Actor Critic (a technique which overcomes the low signal-to-noise ratio in the beginning of training).
The method works well in practice. It finds more general features in low-data regime and this translates to improved generalization. Particular experiments are performed on the Multiroom task, as well as the relatively new Coinrun task.
Simon Lyddon – Quantitative Researcher
Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, Sarah Michalak
Mixup training is a simple data augmentation strategy that involves generating pseudo-data from random convex combinations of observations. It’s an example of vicinal risk minimization, where a model is trained in the vicinity of the training data, as opposed to just on the training data itself. In previous work Mixup was shown to improve generalisation for a number of deep learning benchmarks, reducing network memorisation and increasing robustness.
In this paper the authors demonstrate a further benefit of Mixup, in that it leads to models that are less overconfident: the probabilistic outputs of classification models trained in this way are better calibrated to reality. Focussing on image classification, the authors show that this property holds for out of distribution and random noise data, and that much of this benefit comes from the ‘softening’ of classification labels. When predictive uncertainty is important, Mixup would seem to provide a cheap yet effective means of obtaining a robust yet performant model.
Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota, Mohammad Emtiyaz Khan
There was lots of interesting work presented at NeurIPS involving Bayesian ideas applied to deep learning. Such approaches can have the potential to equip state of the art prediction models with desirable Bayesian properties such as a coherent quantification of uncertainty. This might help lead to their adoption in higher risk applications where a black box is not a viable option. However, full Bayesian computation for such models is a hard problem.
In this paper the authors show how a recently developed Bayesian variational method can be used to perform practical deep learning that is scalable to large models and data, while incorporating techniques such as batch normalisation and momentum. Empirically the benefits of such an approach are shown to be similarly accurate predictions to the original model, but with improved uncertainty quantification and continual learning properties.
Co-author Khan presented an insightful tutorial on ‘Deep Learning with Bayesian Principles’ which makes many further connections between deep and Bayesian learning; it’s definitely worth a watch: https://slideslive.com/38921489/deep-learning-with-bayesian-principles
David Eriksson, Michael Pearce, Jacob Gardner, Ryan Tuner, Matthias Poloczek
Bayesian optimisation (BO) can be a useful tool for optimising black box functions in a principled manner, however it generally performs less well in large, high dimensional spaces where the number of trials is also large. The authors of this paper argue that Bayesian optimisation struggles to exploit good sub-regions enough, instead focussing on exploring a highly uncertain space. In this paper a global optimisation method called trust region Bayesian optimisation, or TuRBO for short, is developed that improves on global Bayesian optimisation’s shortcomings. The idea is to run a number of local Bayesian optimisations on smaller, adaptive ‘trust regions’ of the space, and assign trials to each region using a bandit framework. This seems like a pragmatic approach to this problem that the authors show outperforms across a number of benchmarks.