Grokking Machine Learning

Brand: Unbranded
Type: Sale

Grokking Machine Learning

Brand: Unbranded
Type: Sale

RRP:	~~£99~~
Price:	£9.9

£9.9 FREE Shipping

In stock

We accept the following payment methods

Description

Directly training the model visualized above — ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU} \left(a_{\text{one-hot}}\textbf{W}_{\text{input}} + b_{\text{one-hot}}\textbf{W}_{\text{input}} \right) \textbf{W}_{\text{output}} ReLU ( a one-hot W input + b one-hot W input ) W output — does not actually result in generalization on modular arithmetic, even with the addition of weight decay. At least one of the matrices has to be factored:

The periodic patterns suggest the model is learning some sort of mathematical structure; the fact that it happens when the model starts to solve the test examples hints that it’s related to the model generalizing. But why does the model move away from the memorizing solution? And what is the generalizing solution? Generalizing With 1s and 0sThe model’s architecture is similarly simple: ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU}\left(\mathbf{a}_{\text{one-hot}} \mathbf{W}_{\text{input}} + \mathbf{b}_{\text{one-hot}} \mathbf{W}_{\text{input}}\right) \mathbf{W}_{\text{output}}

There have also been promising results in predicting grokking before it happens. Though some require knowledge of the generalizing solution or the overall data domain , some rely solely on the analysis of the training loss and might also apply to larger models — hopefully we’ll be able to build tools and techniques that can tell us when a model is parroting memorized information and when it’s using richer models. Does grokking happen in larger models trained on real world tasks? Earlier observations reported the grokking phenomenon in algorithmic tasks in small transformers and MLPs . Grokking has subsequently been found in more complex tasks involving images, text, and tabular data within certain ranges of hyperparameters . It’s also possible that the largest models, which are able to do many types of tasks, may be grokking many things at different speeds during training . Using x 2 xDo more complex models also suddenly generalize after they’re trained longer? Large language models can certainly seem like they have a rich understanding of the world, but they might just be regurgitating memorized bits of the enormous amount of text they’ve been trained on . How can we tell if they’re generalizing or memorizing? We observed that the generalizing solution is sparse after taking the discrete Fourier transformation, but the collapsed matrices have high norms. This suggests that direct weight decay on W output \textbf{W}_\text{output} W output and W input \textbf{W}_{\text{input}} W input doesn’t provide the right inductive bias for the task.

Brand: Unbranded
Category: Deals
Type: Sale

Fruugo ID: 258392218-563234582
EAN: 764486781913
Sold by: Fruugo