How Well Can Transformers Emulate In-context Newton's Method?
CoRR(2024)
摘要
Transformer-based models have demonstrated remarkable in-context learning
capabilities, prompting extensive research into its underlying mechanisms.
Recent studies have suggested that Transformers can implement first-order
optimization algorithms for in-context learning and even second order ones for
the case of linear regression. In this work, we study whether Transformers can
perform higher order optimization methods, beyond the case of linear
regression. We establish that linear attention Transformers with ReLU layers
can approximate second order optimization algorithms for the task of logistic
regression and achieve ϵ error with only a logarithmic to the error
more layers. As a by-product we demonstrate the ability of even linear
attention-only Transformers in implementing a single step of Newton's iteration
for matrix inversion with merely two layers. These results suggest the ability
of the Transformer architecture to implement complex algorithms, beyond
gradient descent.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要