Big Transfer (BiT): General Visual Representation Learning

1. Introduction
2. Big Transfer Methodology
3. Experimental Results
4. Key Insights
5. Conclusion

1. Introduction

Deep learning typically requires substantial task-specific data and computational resources, which can be prohibitively expensive for new tasks. Transfer learning offers a solution by replacing task-specific requirements with a pre-training phase. In this approach, a network is first trained on a large, generic dataset, and its weights are then used to initialize subsequent tasks, enabling effective learning with fewer data points and reduced computational demands. This paper revisits the simple paradigm of pre-training on large supervised source datasets and fine-tuning the model weights on target tasks. Rather than introducing novel components or complexity, the authors aim to provide a minimalistic recipe that leverages carefully selected existing techniques to achieve excellent performance across a wide range of tasks. This recipe is termed "Big Transfer" (BiT).

The BiT approach involves pre-training networks on datasets of varying scales, with the largest model, BiT-L, trained on the JFT-300M dataset containing 300 million noisily labeled images. The transferred models are evaluated on diverse tasks, including ImageNet's ILSVRC-2012, CIFAR-10/100, Oxford-IIIT Pet, Oxford Flowers-102, and the Visual Task Adaptation Benchmark (VTAB), which comprises 19 diverse datasets. BiT-L achieves state-of-the-art performance on many of these tasks and demonstrates remarkable effectiveness even when very limited downstream data is available. Additionally, the BiT-M model, pre-trained on the public ImageNet-21k dataset, shows significant improvements over popular ILSVRC-2012 pre-training. A key advantage of BiT is that it requires only one pre-training phase, and subsequent fine-tuning to downstream tasks is computationally inexpensive, unlike other state-of-the-art methods that need extensive training on support data conditioned on specific tasks.

2. Big Transfer Methodology

The Big Transfer (BiT) methodology is built on a few carefully selected components that are essential for creating an effective network for transfer learning. These components are categorized into upstream (used during pre-training) and downstream (used during fine-tuning) elements.

2.1 Upstream Components

Large-Scale Pre-training: BiT leverages large-scale supervised datasets for pre-training. The largest model, BiT-L, is trained on the JFT-300M dataset, which contains 300 million images with noisy labels. Another model, BiT-M, is trained on the ImageNet-21k dataset. The use of such extensive datasets allows the model to learn rich and general visual representations that are transferable to various downstream tasks.

Architecture and Training Hyperparameters: The authors emphasize the importance of selecting appropriate architectures and training hyperparameters. They explore the interplay between model scale, architecture choices, and hyperparameter settings to optimize pre-training performance. Detailed analysis is conducted to identify the key factors that contribute to high transfer performance, ensuring that the model can effectively capture and generalize visual features.

2.2 Downstream Components

Fine-tuning Protocol: After pre-training, the model is fine-tuned on the target task. BiT employs a simple and efficient fine-tuning protocol that requires minimal hyperparameter tuning. The authors propose a heuristic for setting hyperparameters during transfer, which works robustly across their diverse evaluation suite. This heuristic simplifies the adaptation process and reduces the computational cost associated with hyperparameter optimization for each new task.

Handling Diverse Data Regimes: BiT is designed to perform well across a wide range of data regimes, from few-shot learning scenarios with as few as one example per class to large-scale datasets with up to 1 million total examples. The methodology includes strategies for effective fine-tuning in data-scarce environments, ensuring that the model maintains high performance even with limited labeled data.

3. Experimental Results

The BiT models are evaluated on a variety of benchmarks to demonstrate their effectiveness in transfer learning. The experiments cover multiple datasets and data regimes, highlighting the robustness and versatility of the approach.

ILSVRC-2012

BiT-L achieves 87.5% top-1 accuracy on the full dataset and 76.8% with only 10 examples per class.

CIFAR-10

BiT-L attains 99.4% accuracy on the full dataset and 97.0% with 10 examples per class.

CIFAR-100

The model shows strong performance, with high accuracy rates in both full-data and few-shot settings.

VTAB Benchmark

BiT-L achieves 76.3% accuracy on the 19-task Visual Task Adaptation Benchmark using only 1,000 samples per task.

3.1 Performance on Few-Shot Learning

BiT excels in few-shot learning scenarios, where only a limited number of labeled examples are available per class. For instance, on ILSVRC-2012 with 10 examples per class, BiT-L achieves 76.8% accuracy, significantly outperforming baseline models. Similarly, on CIFAR-10 with 10 examples per class, it reaches 97.0% accuracy. These results underscore the model's ability to generalize from limited data, making it suitable for applications where collecting large labeled datasets is challenging.

3.2 Comparison with State-of-the-Art

BiT-L sets new state-of-the-art results on several benchmarks. When compared to previous generalist representation methods that are pre-trained independently of the final task, BiT-L demonstrates superior performance. For example, on the Oxford-IIIT Pets and Oxford Flowers-102 datasets, BiT-L achieves high accuracy in both full-data and few-shot settings, as illustrated in Figure 1 of the original paper. The model also outperforms ResNet-50 baselines pre-trained on ILSVRC-2012 across all evaluated tasks.

3.3 Efficiency and Scalability

One of the key advantages of BiT is its efficiency. The pre-trained models require only a short fine-tuning phase for each new task, reducing the computational burden. Additionally, the heuristic for hyperparameter setting eliminates the need for extensive tuning, further enhancing practicality. The scalability of BiT is evident from its performance on large-scale datasets like JFT-300M and ImageNet-21k, as well as its effectiveness on smaller datasets.

4. Key Insights

The success of Big Transfer can be attributed to several critical factors identified through detailed analysis:

Scale of Pre-training: Larger pre-training datasets, such as JFT-300M, lead to more robust and generalizable representations, which translate to better performance on downstream tasks.
Architecture Choices: The selection of appropriate network architectures and their scaling (e.g., depth and width) plays a vital role in capturing complex visual features.
Hyperparameter Heuristic: The proposed heuristic for setting fine-tuning hyperparameters ensures consistent performance across diverse tasks without the need for task-specific tuning.
Minimalistic Recipe: By focusing on a minimal set of well-chosen components, BiT avoids unnecessary complexity while achieving state-of-the-art results.
Broad Applicability: BiT's effectiveness across various data regimes and task types highlights its versatility as a general-purpose visual representation learning method.

5. Conclusion

Big Transfer (BiT) presents a simple yet powerful recipe for visual representation learning through large-scale pre-training and efficient fine-tuning. By leveraging carefully selected components and a straightforward transfer heuristic, BiT achieves state-of-the-art performance on a wide range of datasets and data regimes. The method's ability to perform well with limited downstream data, combined with its computational efficiency, makes it a practical solution for real-world applications. The release of pre-trained models, such as BiT-M trained on ImageNet-21k, provides valuable resources for the research community. Future work may explore further scaling, additional architectural innovations, and applications to emerging vision tasks.

Table of Contents