[Computer Vision] Object Detection - Làm quen cách xác định đối tượng trong ảnh

Lab Introduction

AWS experience: Beginner
Time to complete: 20 minutes
Cost to complete: Free Tier
Services used: Google Colab

Giới thiệu về Lab

Trong bài thực hành này, chúng ta sẽ sử dụng bộ dữ liệu Pascal VOC 2012 – một trong những bộ dữ liệu nền tảng và phổ biến nhất trong lĩnh vực Computer Vision, đặc biệt là object detection và image segmentation.

🧾 Tổng quan về Pascal VOC 2012

PASCAL VOC (PAttern Analysis, Statistical modelling and Computational Learning Visual Object Classes) là một sáng kiến của cộng đồng nghiên cứu nhằm đánh giá hiệu quả của các mô hình thị giác máy tính qua nhiều năm. Phiên bản VOC 2012 là phiên bản cuối cùng và hoàn chỉnh nhất.

Năm phát hành: 2012
Số lượng ảnh: ~11.540 ảnh
Số lượng đối tượng (objects): ~27.450
Số lượng lớp (classes): 20 lớp đối tượng

📁 Các tác vụ hỗ trợ

VOC 2012 hỗ trợ nhiều bài toán thị giác máy tính:

1. Classification: Xác định loại đối tượng có trong ảnh.

2. Object Detection: Xác định loại và vị trí của đối tượng trong ảnh qua bounding box.

3. Semantic Segmentation: Phân vùng ảnh theo từng pixel tương ứng với đối tượng hoặc nền.

4. Person Layout: Nhận diện bố cục cơ thể người: đầu, tay, chân...

1. Tạo file Google Colab

Vào folder chứa các dự án Google Colab.

Để tạo một Colab Notebook mới, bạn nhấp chuột phải => More => Colaboratory

Sau đó mở file vừa tạo lên, bạn sẽ thấy giao diện như sau. Có thể đổi tên để dễ dàng quản lý

(ví dụ: ExerciseComputerVision)

2. Viết code triển khai step-by-step

1. Cài thư viện torchvision

!pip install torchvision

torchvision là thư viện con của PyTorch, được thiết kế chuyên biệt cho các bài toán Computer Vision (thị giác máy tính).

Torchvision cung cấp:

Các dataset được tích hợp sẵn: MNIST, CIFAR, ImageNet, Pascal VOC, COCO, CelebA, v.v.
Tiền xử lý ảnh (Transforms): Hỗ trợ biến đổi ảnh: Resize, CenterCrop, RandomFlip, Normalize, v.v.
Mô hình học sâu (Pretrained Models): Cung cấp các mô hình CNN nổi tiếng như: ResNet, VGG, MobileNet, EfficientNet, DenseNet,...
Tiện ích xử lý ảnh (Utils): Hàm hiển thị ảnh (make_grid, imshow), Hàm decode bounding box, segmentation, v.v.

2. Import thư viện cần thiết

import torchvision 
from torchvision.datasets import VOCDetection
from torchvision import transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

📌 Giải thích:

torchvision: Thư viện tiện ích cho Computer Vision trong PyTorch.
VOCDetection: Dataset class để làm việc với bộ dữ liệu Pascal VOC.
transforms: Dùng để xử lý ảnh đầu vào (resize, convert sang tensor, chuẩn hóa…).
DataLoader: Dùng để load batch dữ liệu (ở đây không dùng nhưng vẫn import).
matplotlib.pyplot & patches: Hiển thị ảnh và vẽ bounding box.
numpy: Xử lý mảng số.

3. Chuẩn bị transform cho ảnh

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

📌 Giải thích:

transforms.Resize: Resize ảnh về kích thước 224x224 (chuẩn cho các mô hình CNN như ResNet, VGG…).
transforms.ToTensor(): Chuyển ảnh PIL → Tensor có shape [C, H, W] và giá trị [0,1].

4. Tải bộ dữ liệu Pascal VOC

dataset = VOCDetection(root='./data', year='2012', image_set='train', download=True, transform=transform)

📌 Giải thích:

root='./data': Thư mục chứa dữ liệu.
year='2012', image_set='train': Tải tập huấn luyện năm 2012.
download=True: Tải về nếu chưa có.
transform=...: Áp dụng resize và chuyển thành tensor.

5. Lấy một mẫu từ dataset

img_tensor, target = dataset[0]
img = img_tensor.permute(1, 2, 0).numpy()

📌 Giải thích:

dataset[0]: Lấy mẫu đầu tiên gồm ảnh và thông tin annotation.
img_tensor: Tensor kích thước [C, H, W].
permute(1, 2, 0): Chuyển sang [H, W, C] để hiển thị bằng matplotlib.
numpy(): Chuyển từ tensor → mảng NumPy để hiển thị bằng imshow.

6. Hiển thị ảnh bằng matplotlib

fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(img)

➡️ Tạo canvas để vẽ ảnh, sử dụng ax để thêm các khung (bounding boxes).

7. Xử lý annotation (đối tượng trong ảnh)

objects = target['annotation']['object']
if isinstance(objects, dict):
    objects = [objects]

📌 Giải thích:

Annotation là dạng dict, trong đó object có thể là:
- Một dict duy nhất nếu chỉ có 1 object.
- Một list nếu có nhiều object.
Do đó, cần chuẩn hóa để luôn là list để lặp được.

8. Tính tỉ lệ scale (resize ảnh → cần scale lại bounding box)

orig_size = (int(target['annotation']['size']['width']),
             int(target['annotation']['size']['height']))
scale_x = 224 / orig_size[0]
scale_y = 224 / orig_size[1]

📌 Giải thích:

Annotation (tọa độ bounding box) dựa trên kích thước ảnh gốc.
Nhưng ảnh đã được resize về 224x224.
Vì vậy cần nhân tọa độ gốc với scale_x và scale_y để đưa về tọa độ mới tương ứng với ảnh resized.

9. Vẽ bounding boxes và nhãn đối tượng

for obj in objects:
    name = obj['name']
    bbox = obj['bndbox']
    xmin = int(float(bbox['xmin']) * scale_x)
    ymin = int(float(bbox['ymin']) * scale_y)
    xmax = int(float(bbox['xmax']) * scale_x)
    ymax = int(float(bbox['ymax']) * scale_y)
    width = xmax - xmin
    height = ymax - ymin

    rect = patches.Rectangle((xmin, ymin), width, height, linewidth=2, edgecolor='red', facecolor='none')
    ax.add_patch(rect)
    ax.text(xmin, ymin - 5, name, color='red', fontsize=12, weight='bold')

📌 Giải thích:

obj['name']: Tên của object (như person, dog, car...).
bbox: Lấy ra tọa độ của khung (xmin, ymin, xmax, ymax).
Nhân với hệ số scale để phù hợp với ảnh resized.
Dùng patches.Rectangle để vẽ khung.
Dùng ax.text(...) để hiển thị nhãn đối tượng ngay trên khung.

10. Ẩn trục và hiển thị ảnh cuối cùng

plt.axis('off')
plt.show()

➡️ Loại bỏ trục tọa độ (gây rối mắt), rồi hiển thị ảnh.

Full code:

import torchvision
from torchvision.datasets import VOCDetection
from torchvision import transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

# Chuẩn bị transform cho ảnh
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

# Tải VOC 2012 với annotations
dataset = VOCDetection(root='./data', year='2012', image_set='train', download=True, transform=transform)

# Lấy 1 mẫu
img_tensor, target = dataset[0]
img = img_tensor.permute(1, 2, 0).numpy()  # Chuyển từ [C,H,W] -> [H,W,C]

# Chuẩn bị hiển thị
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(img)

# Lấy object list
objects = target['annotation']['object']
if isinstance(objects, dict):
    objects = [objects]  # Trường hợp có 1 object duy nhất

# Tỷ lệ scale khung vì ảnh đã resize còn annotation thì theo ảnh gốc
orig_size = (int(target['annotation']['size']['width']),
             int(target['annotation']['size']['height']))
scale_x = 224 / orig_size[0]
scale_y = 224 / orig_size[1]

# Vẽ khung và label
for obj in objects:
    name = obj['name']
    bbox = obj['bndbox']
    xmin = int(float(bbox['xmin']) * scale_x)
    ymin = int(float(bbox['ymin']) * scale_y)
    xmax = int(float(bbox['xmax']) * scale_x)
    ymax = int(float(bbox['ymax']) * scale_y)
    width = xmax - xmin
    height = ymax - ymin

    rect = patches.Rectangle((xmin, ymin), width, height, linewidth=2, edgecolor='red', facecolor='none')
    ax.add_patch(rect)
    ax.text(xmin, ymin - 5, name, color='red', fontsize=12, weight='bold')

plt.axis('off')
plt.show()

3. Test kết quả

Nhấn nút Run Cell (icon hình tam giác). Đợi vài phút hình ảnh sẽ được hiển thị.

4. Mở rộng (Optional)

Sử dụng DataLoader để load ảnh và annotation theo batch.
DataLoader là công cụ quan trọng khi huấn luyện mô hình — vì bạn thường cần huấn luyện hàng ngàn ảnh theo batch.
Hiển thị nhiều ảnh kèm bounding box.

Full code:

import torch
from torchvision.datasets import VOCDetection
from torchvision import transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

# 1. Định nghĩa transform cho ảnh
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

# 2. Tải tập dữ liệu VOC 2012
dataset = VOCDetection(root='./data', year='2012', image_set='train', download=True, transform=transform)

# 3. Định nghĩa collate_fn để xử lý batch chứa dict
def voc_collate_fn(batch):
    images = []
    targets = []
    for img, target in batch:
        images.append(img)
        targets.append(target)
    return images, targets

# 4. Tạo DataLoader với batch_size = 4 và collate_fn tùy chỉnh
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=voc_collate_fn)

# 5. Hàm hiển thị batch ảnh kèm bounding boxes
def show_batch(images, targets):
    fig, axs = plt.subplots(2, 2, figsize=(12, 12))
    axs = axs.flatten()

    for i in range(len(images)):
        img = images[i].permute(1, 2, 0).numpy()
        target = targets[i]
        ax = axs[i]
        ax.imshow(img)

        # Scale bounding box vì ảnh đã resize
        orig_size = (
            int(target['annotation']['size']['width']),
            int(target['annotation']['size']['height'])
        )
        scale_x = 224 / orig_size[0]
        scale_y = 224 / orig_size[1]

        objects = target['annotation']['object']
        if isinstance(objects, dict):  # Nếu chỉ có 1 object
            objects = [objects]

        for obj in objects:
            name = obj['name']
            bbox = obj['bndbox']
            xmin = int(float(bbox['xmin']) * scale_x)
            ymin = int(float(bbox['ymin']) * scale_y)
            xmax = int(float(bbox['xmax']) * scale_x)
            ymax = int(float(bbox['ymax']) * scale_y)
            width = xmax - xmin
            height = ymax - ymin

            rect = patches.Rectangle((xmin, ymin), width, height, linewidth=2, edgecolor='red', facecolor='none')
            ax.add_patch(rect)
            ax.text(xmin, ymin - 5, name, color='red', fontsize=10, weight='bold')

        ax.axis('off')

    plt.tight_layout()
    plt.show()

# 6. Hiển thị một batch đầu tiên
for images, targets in dataloader:
    show_batch(images, targets)
    break  # chỉ hiển thị 1 batch

[Computer Vision] Object Detection - Làm quen cách xác định đối tượng trong ảnh

Lab Introduction

Giới thiệu về Lab

🧾 Tổng quan về Pascal VOC 2012

📁 Các tác vụ hỗ trợ

1. Tạo file Google Colab

2. Viết code triển khai step-by-step

1. Cài thư viện torchvision

2. Import thư viện cần thiết

📌 Giải thích:

3. Chuẩn bị transform cho ảnh

📌 Giải thích:

4. Tải bộ dữ liệu Pascal VOC

📌 Giải thích:

5. Lấy một mẫu từ dataset

📌 Giải thích:

6. Hiển thị ảnh bằng matplotlib

7. Xử lý annotation (đối tượng trong ảnh)

📌 Giải thích:

8. Tính tỉ lệ scale (resize ảnh → cần scale lại bounding box)

📌 Giải thích:

9. Vẽ bounding boxes và nhãn đối tượng

📌 Giải thích:

10. Ẩn trục và hiển thị ảnh cuối cùng

Full code:

3. Test kết quả

4. Mở rộng (Optional)

Full code:

EVENTS