Beginner's Guide to Fine-Tuning Vision Transformers #7130

ellie-sleightholm · 2024-08-28T17:59:57Z

ellie-sleightholm
Aug 28, 2024

Fine-Tuning Vision Transformer Models - A Beginner's Guide

Introduction

I recently created an article on how you can fine-tune your own Vision Transformer models after searching the web for resources and struggling to find any. In this discussion, I've decided to summarise my article so that hopefully, beginners or others looking to fine-tune Vision Transformer models can do so with ease!

For the full code and a guided walk-through visit this article.

1. Load a Dataset

To perform fine-tuning, we will use a small image classification dataset. We’ll use the microsoft/cats_vs_dogs which is a collection of cat and dog images.

from datasets import load_dataset

ds = load_dataset('cats_vs_dogs')

image = entry['image']
labels = ds['train'].features['labels']

2. Preparing the Images - ViT Image Processor

from transformers import ViTFeatureExtractor

model_name_or_path = 'google/vit-base-patch16-224-in21k'
vit_feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)

# Process an image by passing it through the feature extractor
vit_feature_extractor(image, return_tensors='pt')

3. Processing the Dataset

def process_single_entry(entry):
    processed = vit_feature_extractor(entry['image'], return_tensors='pt')
    processed['labels'] = entry['labels']
    return processed

ds = load_dataset('cats_vs_dogs')

# Function to transform the dataset
def transformation(entry_batch):
    transformed = vit_feature_extractor([x for x in entry_batch['image']], return_tensors='pt')
    transformed['labels'] = entry_batch['labels']
    return transformed

from datasets import DatasetDict
# Set seed for reproducibility
random.seed(42)

# Generate random indices for train and validation datasets
all_indices = list(range(len(ds['train'])))
train_indices = random.sample(all_indices, 1000)
remaining_indices = list(set(all_indices) - set(train_indices))
validation_indices = random.sample(remaining_indices, 200)

# Select the subsets
train_ds = ds['train'].select(train_indices)
validation_ds = ds['train'].select(validation_indices)

# Create a DatasetDict with the new splits
small_ds = DatasetDict({
    'train': train_ds,
    'validation': validation_ds
})

# Apply the transformation
prepared_ds = small_ds.with_transform(transformation)

4. Training and Fine-Tuning

import torch 

# Data collator function
def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }

import numpy as np
from datasets import load_metric

# Metric computation function
metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

from transformers import ViTForImageClassification

# Initialize the model with the correct number of labels
num_labels = len(ds['train'].features['labels'].names)

model = ViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=num_labels
)

from transformers import TrainingArguments

# Training arguments
training_args = TrainingArguments(
  output_dir="./vit-cat-dogs-demo",
  per_device_train_batch_size=16,
  evaluation_strategy="steps",
  num_train_epochs=2,
  fp16=True,
  save_steps=10,
  eval_steps=10,
  logging_steps=10,
  learning_rate=2e-4,
  save_total_limit=2,
  remove_unused_columns=False,
  push_to_hub=False,
  report_to='tensorboard',
  load_best_model_at_end=True,
)

from transformers import Trainer
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_ds["train"],
    eval_dataset=prepared_ds["validation"],
    tokenizer=vit_feature_extractor,
)

# Train the model
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beginner's Guide to Fine-Tuning Vision Transformers #7130

{{title}}

Replies: 0 comments

Select a reply

Beginner's Guide to Fine-Tuning Vision Transformers #7130

ellie-sleightholm Aug 28, 2024

Fine-Tuning Vision Transformer Models - A Beginner's Guide

Introduction

1. Load a Dataset

2. Preparing the Images - ViT Image Processor

3. Processing the Dataset

4. Training and Fine-Tuning

Replies: 0 comments

ellie-sleightholm
Aug 28, 2024