Doximex.Com: TRECVID 2007

Monday, July 13, 2009

TRECVID 2007 - Introduction

TRECVID hiện đang là de facto standard benchmark (tạm dịch là chuẩn được thừa nhận ko chính thức) cho những người làm về video indexing, video retrieval. Lí do chính là TRECVID tập hợp các nhóm nghiên cứu hàng đầu thế giới (IBM, CMU, Columbia University, Microsoft Research Asia - MSRA, UvA, etc) và rất active trong việc trao đổi và chia sẻ các kết quả nghiên cứu giữa các nhóm với nhau. (Xem thêm một số benchmark tương tự tại đây). Gần như là ngầm định, nếu bạn submit paper làm về video retrieval tới các top conf. như ACM Multimedia, MIR, CIVR, ... với các topic như bridging the semantic gap, concept detection, scene classification, video search, etc thì chắc chắn reviewers sẽ quan tâm đến việc bạn có evaluate and compare với những gì đã được công bố ở TRECVID hay không (cũng dễ hiểu vì reviewers cũng tham gia TRECVID :-)).

Hàng năm TRECVID đều có competition. Nghĩa là với cùng một bộ dữ liệu và cùng một task, các nhóm tham gia sẽ submit kết quả của các hệ thống mà mình phát triển, NIST sẽ đánh giá dựa theo một chuẩn chung, công bằng. Nếu một nhóm nào đó có kết quả thuộc dạng top-3, thông thường bạn sẽ thấy paper của nhóm đấy xuất hiện ở các top conf. ở năm tiếp theo. Ngoài ra, với sự tham gia của các nhóm nghiên cứu trong cùng cộng đồng khắp nơi trên thế giới, nên một kết quả tốt ở TRECVID, đồng nghĩa với việc nhận được sự thừa nhận của mọi người.

Như thường lệ, năm nay TRECVID sẽ có các tasks chính sau:

1. Shot Boundary Detection

Dữ liệu video có kích thước rất lớn. Cứ tưởng tượng một đoạn video 30 phút có khoảng 54K frames (30fps), kích thước trung bình nếu nén ở MPEG-1 thì cũng hơn 300MB. Do đó, để thuận tiện cho việc xử lí, bước đầu tiên là phân rã (decompose) input video thành các segments có kích thước nhỏ hơn. Shots là một trong các đơn vị nhỏ nhất của các segments như vậy.

Shots are fundamental units of video, useful for higher-level processing. The task is as follows: identify the shot boundaries with their location and type (cut or gradual) in the given video clip(s)

Khó nhất đối với task này có lẽ là tìm các gradual shots. Tuy nhiên các kết quả của SBD hiện nay khá cao, kết quả tốt nhất có thể lên đến trên 90% của cả Precision and Recall.

2. High Level Feature Extraction

Một trong các challenging issues trong video/image retrieval hiện nay đó là bridging the semantic gap (không dám dịch sang tiếng Việt vì khó tìm nghĩa tương đương, nghĩa dịch thô là sự sai biệt trong ngữ nghĩa) giữa những gì máy tính hiểu và những gì con người hiểu. Ví dụ nhìn vào một bức ảnh có hình hoa hồng, con người có thể diễn giải ví dụ như hoa hồng, tình yêu nam nữ, etc, trong khi máy tính chỉ có thể hiểu ở các mức độ như color, shape, texture. Chính vì vậy mà người ta rất quan tâm đến việc nghiên cứu các models, learning algorithms sao cho máy tính có thể hiểu được, ví dụ ảnh này có US President Bush, ảnh kia có airplane, ảnh nọ là về sports, etc. HLF trong TRECVID chính là những semantic concepts dạng này.

Various high-level semantic features, concepts such as "Indoor/Outdoor", "People", "Speech" etc., occur frequently in video databases. The proposed task will contribute to work on a benchmark for evaluating the effectiveness of detection methods for semantic concepts
The task is as follows: given the feature test collection, the common shot boundary reference for the feature extraction test collection, and the list of feature definitions (see below), participants will return for each feature the list of at most 2000 shots from the test collection, ranked according to the highest possibility of detecting the presence of the feature . Each feature is assumed to be binary, i.e., it is either present or absent in the given reference shot.

Đây thực sự là challenging task và nhận được rất nhiều sự quan tâm của các nhóm nghiên cứu. Năm ngoái 2006, Tsinghua University là nhóm cho kết quả tốt nhất với Mean Average Precision là 19.2%. Có thể diễn giải một cách nôm na con số này như sau: với các concepts, ví dụ như weather (tìm các shot nói về dự báo thời tiết), hay là Flag-US (tìm các shot mà có hình cờ US), trung bình trong số 1,000 kết quả trả về, chỉ có chưa đến 200 kết quả là đúng (relevant). Cũng lưu ý thêm là theo report của Tsinghua Uni., thời gian ước lượng để training cho 39 concept detectors là khoảng 600 days (gần 2 năm) nếu chạy trên một PC, tuy nhiên may mắn thay là họ chạy trên các máy song song nên thời gian chỉ còn 10 ngày mà thôi.

3. Search

Đây có lẽ là task khó nhất bởi vì nó đòi hỏi xử lí như một video search engine thực thụ mà trong đó người dùng sẽ gõ vào các câu query ví dụ như: "Find shots with one or more people leaving or entering a vehicle" hay "Find shots of one or more people reading a newspaper". Các systems có thể chia làm 3 loại: fully automatic, manual và interactive. Fully automatic, có nghĩa là hệ thống sẽ phải xử lí nguyên câu query dưới dạng text như ở trên để tìm kết quả. Để làm điều này phải có các bước tiền xử lí như query parsing, query understanding, etc. Manual, có nghĩa là người dùng sẽ hỗ trợ parse câu query thành các thành phần sao cho system có thể hiểu được ví dụ chọn lại các keyword từ query. Sau khi hỗ trợ xong, máy tính sẽ tự làm tất cả để trả kết quả về. Interactive, có nghĩa là người dùng và máy tính sẽ tương tác với nhau để có được kết quả tốt nhất. Ngoài bước hỗ trợ như ở mức Manual, người dùng trong mức Interactive sẽ có các feedback với các kết quả mà máy tính trả về, máy tính sẽ lấy feedback đó để refine processing và trả kết quả ra. Ở mức này, thời gian tương tác sẽ bị hạn chế.

Search is high-level task which includes at least query-based retrieval and browsing. The search task models that of an intelligence analyst or analogous worker, who is looking for segments of video containing persons, objects, events, locations, etc. of interest. These persons, objects, etc. may be peripheral or accidental to the original subject of the video. The task is as follows: given the search test collection, a multimedia statement of information need (topic), and the common shot boundary reference for the search test collection, return a ranked list of at most 1000 common reference shots from the test collection, which best satisfy the need

Các kết quả tốt nhất năm ngoái chỉ cho thấy MAP khoảng chưa đến 10% đối với Fully Automatic Systems.

4. BBC Rushes Summarization

Đây là một task khá mới, chỉ có trong 1-2 năm trở lại đây. Mục đích là nghiên cứu các thuật toán về summarization mà có thể sẽ rất có ích trong các search engine. Ví dụ, nếu kết quả trả về cho câu query tìm các bộ phim hành động của Brad Pitt là bộ phim Mr and Mrs Smith chẳng hạn, thay vì phải play hết bộ phim này để hiểu, người ta có lẽ chỉ cần play một summary clip có thời gian ngắn hơn nhiều để xem bộ phim này có đủ hứng thú để xem tiếp hay không.

Vì liên quan đến vấn đề bản quyền, cho nên video data cho task này chỉ là các rushes, hiểu nôm na là các đoạn video được quay nhưng chưa được edit lại để sử dụng. Ví dụ cảnh của một bộ phim quay đi quay lại nhiều lần, một đoạn video của một camera man nghiệp dư quay cảnh khủng bố 11-9, etc. Để có thể đưa vào sử dụng, các rushes phải được edit và rút gọn lại từ 20 đến 40 lần so với nguyên gốc.

Rushes are the raw material (extra video, B-rolls footage) used to produce a video. 20 to 40 times as much material may be shot as actually becomes part of the finished product. Rushes usually have only natural sound. Actors are only sometimes present. So very little if any information is encoded in speech. Rushes contain many frames or sequences of frames that are highly repetitive, e.g., many takes of the same scene redone due to errors (e.g. an actor gets his lines wrong, a plane flies over, etc.), long segments in which the camera is fixed on a given scene or barely moving,etc. A significant part of the material might qualify as stock footage - reusable shots of people, objects, events, locations, etc. Rushes may share some characteristics with "ground reconnaissance" video.

The system task in rushes summarization will be, given a video from the rushes test collection, to automatically create an MPEG-1 summary clip less than or equal to a maximum duration (to be determined) that shows the main objects (animate and inanimate) and events in the rushes video to be summarized. The summary should minimize the number of frames used and present the information in ways that maximizes the usability of the summary and speed of objects/event recognition.
Such a summary could be returned with each video found by a video search engine much text search engines return short lists of keywords (in context) for each document found - to help the searcher decide whether to explore a given item further without viewing the whole item. It might be input to a larger system for filtering, exploring and managing rushes data.

Năm nay NII dự kiến sẽ tham gia TRECVID 2007 ở 2 tasks chính đó là High Level Feature Extraction và BBC Rushes Summarization. Đây cũng chính là công việc post-doc của tôi. Deadline để submit BBC Rushes Summarization là 11 May, sớm hơn mọi năm vì sẽ có 1 workshop tại conf. ACM Multimedia vào tháng 10 dành cho task này. Còn với HLF task, deadline là 10 Aug . Workshop hàng năm của TRECVID là vào đầu tháng 11 tại NIST, Maryland, USA.

Năm ngoái tôi có tham dự TRECVID và cũng đã hiểu khá rõ về state-of-the-art của các tasks. Kết quả năm ngoái của HLF task (coding và testing trong vòng 2 tuần) chỉ mới đạt xấp xỉ median (nghĩa là đứng ở tốp trên của nửa sau bảng xếp hạng :-) ). Hi vọng năm nay với thời gian dài hơn, kết quả sẽ tốt hơn.

Lê Đình Duy

Xem đầy đủ bài viết tại http://ledduy.blogspot.com/2009/07/trecvid-2007-introduction.html