Time-, Memory- and Parameter-Efficient Visual Adaptation
CoRR(2024)
摘要
As foundation models become more popular, there is a growing need to
efficiently finetune them for downstream tasks. Although numerous adaptation
methods have been proposed, they are designed to be efficient only in terms of
how many parameters are trained. They, however, typically still require
backpropagating gradients throughout the model, meaning that their
training-time and -memory cost does not reduce as significantly.
We propose an adaptation method which does not backpropagate gradients
through the backbone. We achieve this by designing a lightweight network in
parallel that operates on features from the frozen, pretrained backbone. As a
result, our method is efficient not only in terms of parameters, but also in
training-time and memory usage. Our approach achieves state-of-the-art
accuracy-parameter trade-offs on the popular VTAB benchmark, and we further
show how we outperform prior works with respect to training-time and -memory
usage too. We further demonstrate the training efficiency and scalability of
our method by adapting a vision transformer backbone of 4 billion parameters
for the computationally demanding task of video classification, without any
intricate model parallelism. Here, we outperform a prior adaptor-based method
which could only scale to a 1 billion parameter backbone, or fully-finetuning a
smaller backbone, with the same GPU and less training time.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要