HIGHLIGHTS
Explaining Neural Scaling Laws and A Neural Scaling Law from the Dimension of the Data Manifold (Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma) (summarized by Rohin): We’ve seen lots of empirical work on scaling laws (AN #87), but can we understand theoretically why these arise? This paper suggests two different models for how powerlaw scaling laws could arise, variancelimited and resolutionlimited scaling, and argues that neural nets are typically trained in the resolutionlimited setting. In both cases, we have versions that occur when the dataset size D is large and the number of parameters P is low (parameterlimited), and when D is low and P is large (datalimited). Recall that a scaling law is a powerlaw equation that predicts the test loss L as a function of P and D. In this paper, we consider cases where only one of the resources is the bottleneck, so that our power laws are of the form L = kP^(α) or L = kD^(α), for constants k and α. (For simplicity, we’re assuming that the minimum value of our loss function is zero.) Resolutionlimited scaling happens when either the dataset is too small to “resolve” (capture) the true underlying function, or when the model doesn’t have enough capacity to “resolve” (fit) the training dataset. In this case, we’re going to take the common ML assumption that while our observation space might be highdimensional, the data itself comes from a lowdimensional manifold with dimension d, called the intrinsic dimension. We’ll model our neural net as transforming the input space into a roughly ddimensional representation of the manifold, which is then used in further processing by later layers. Thus the output of the network is a simple function over this lowdimensional representation. Let’s first consider the case where P is sufficiently large, so that we perfectly fit the training data, but D is limited. We can think of the training data as a “net” of points covering the true ddimensional manifold. Intuitively, to halve the distance between the points (making the net “twice as fine”), we need ~2^d times as many points. Some simple algebraic manipulation tells us that distance between points would then scale as D^(1/d). How can we translate this to the test loss? Let’s assume a simple nearest neighbor classifier where, given a test data point, we simply predict the value associated with the nearest training data point. This is equivalent to assuming that our neural net learns a piecewise constant function. In this case, for a test data point drawn from the same distribution as the training set, that data point will be “near” some training data point and our model will predict the same output as for the training data point. Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first nonzero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first nonzero term is the secondorder term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(2/d), that is, α = 2/d. The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(2/d) instead of D^(1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d. Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d. What about the case when P is the bottleneck? Well, in this case, since the training data is not the bottleneck, it is presumably a sufficiently good approximation to the underlying function; and so we are just seeing whether the learned model can match the dataset. Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(α), with α >= 2/d (and α >= 4/d for the case of L2 loss), where X is the number of “parts” in the approximation. In the case of linear models, we should have X = P, but for neural networks I believe the authors suggest that we should instead have X = w, the width of the network. (One motivation is that in the infinitewidth limit, neural networks behave like linear models.) In variancelimited scaling for D, the scaling bottleneck is the randomness inherent in the sampling of the dataset from the underlying distribution. We can view the dataset as a random variable, implying that the gradient is also a random variable since it is a function of the training dataset. We can then consider the “error term” δG = G  G_inf, which is the difference between the finitedataset gradients and the gradients for infinite data. We’ll make the assumption that you’re equally likely to be wrong in all directions  if there’s a dataset that makes you a bit more likely to predict A, then there’s also a corresponding equally likely dataset that makes you a bit less likely to predict A. In that case, in expectation δG is zero, since on average the errors all cancel out. Since D is assumed to be large, we can apply the law of large numbers to deduce that the variance of δG will scale as 1/D. Let us then consider the test loss as a function of the gradients. The test loss we actually get is L(G) = L(G_inf + δG). We can now Taylor expand this to get an expansion which tells us that the quantity we care about, L(G)  L(G_inf), is of the form AδG + B(δG)^2, where A and B are constants that depend on derivatives of the test loss in the infinite dataset case. We had already concluded that E[δG] = 0, and E[(δG)^2] is just the variance and so scales as 1/D, which implies that α = 1. Here’s a slightly less mathematical and more conceptual argument for the same thing (though note that this feels like a sketchier argument overall): Variance of the gradient scales as 1/D by the law of large numbers Thus standard deviation scales as 1/√D Thus the deviation of the empirical estimate of the gradients scales as 1/√D Thus the deviation of the neural net parameters scales as 1/√D Thus the deviation of the output of the final layer scales as 1/√D Any linear dependence on this deviation would cancel out in expectation, since the deviation could either increase or decrease the test loss. However, quadratic dependences would add together. These would scale as (1/√D)^2, that is, 1/D.
The authors also suggest that a similar argument can be applied to argue that for parameters, the loss scales as 1/w, where w is the width of the network. This is variancelimited scaling for P. This again relies on previous results showing that neural networks behave like linear models in the limit of infinite width. The authors use this theory to make a bunch of predictions which they can then empirically test. I’ll only go through the most obvious test: independently measuring the scaling exponent α and the intrinsic dimension d, and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small  language models just have a high intrinsic dimension). Variancelimited scaling is even easier to identify: we simply measure the scaling exponent α and check whether it is 1.
