imagEVAL - Task description

ImagEVAL is all about assessing image processing technology needed for sorting, finding and describing still images contained in vast data bases. The assessment focuses on features relating to what collection holders expect in terms of how images may be used (as described by a panel of participants from the defence, industrial and cultural sectors).

Description of the tasks

Detailed description of the task [pdf]

presentation of Imageval during the image clef workshop [pdf] (alicante, 19/09/2006)

Presentation meeting technovision (in french) march 2006 [pdf]

RECOGNISING TRANSFORMED IMAGES

The invariability of indexation techniques in relation to certain geometric or chromatic transformations is a fundamental part of protecting the intellectual property of visual content and sorting multimedia flow. The aim is to find all other images in the base that are derived from the original image requested

A set of N images were automatically transformed using various algorithms:

rotations
translations
projection on a inclined plane
black-and-white transformation
negative transformation
decrease of the saturation (in HSV colour space)
low-pass filtering
JPEG quality
adding noise
adding text area (such as a copyright mark)
adding a frame
insertion into an other image

To make it more complex, combinations of transformations were used

The training test is composed of about 4500 transformed images and the official database of about 45000 images (N=2500)

Two tasks were organized:

For the first task, the goal was to find all the transformed images from a source image. Main metric: MAP
The second task was the opposite: find the source image from a transformed image. Main metric : Mean Reciprocal Rank (MRR)

Queries

50 queries have been proposed for the first task and 60 for the second. The images have been chosen to correctly represent the diversity of the database. To make the task more complex, images with visual similarities have been selected in the database.

COMBINED TEXT/IMAGE SEARCH

Many published images contain text (internet pages for example). Images therefore play an illustrative role and come within the scope of a semantic context that may be analysed by a linguistic study of textual data. The aim is to assess how techniques involving text and images work together in order to improve similar image searches within the framework of information searches on text/image data.

As this task is “search on the web” oriented, the database has been created by extraction of pages from the Web, especially from Wikipedia for copyright reasons. The Web pages have been found using classical search engines (Google and Alltheweb)

An automatic segmentation of pages into text and image is performed. For this first edition, the web pages are in French. The link between the image and its position in the text is kept.

The database is composed of a list of 700 URLs and the corresponding text and images files. The participants were also invited to use personnal web page segmentation tools

Pages were selected using topics like: “Eiffel Tower”, “Lemon”, “Clown Fish”, “Uluru’s rock”, “Ethiopian flag” …

Using Wikipedia, we focused on more “encyclopaedic” and “picturable” topics:

animals,
places,
monuments,
obects

The goal of the task is to find all the images answering the query composed of a key word and few images positives

Queries

A query is a composition of keys words (for instance: “Tour Eiffel”) and few relevant images (that did not come from the database). 25 queries have been selected: bee, avocado, tennis ball, lemon, ladybird, Ethiopan flag, European flag, Picaso’s Guernica de Picasso, la Joconde, lava flow, Delacroix “Liberté Guidant le Peuple”, Great Wall of China, Percé Rock, clown fish, Siamese cat, tennis playground, Ayers Rock , zebra, Eiffel Tower, Statue of Liberty, Niagara falls, teddy bear, screwdriver, poplar tree, map of Norway. MAP is the principal metric.

DETECTING TEXTUAL ZONES IN AN IMAGE

Sometimes textual data can actually be found in an image or just superposed on it. This text is an important source of information for identifying people, places or on a wider scale, the context of the image. The aim of this task is to locate, extract and then identify textual elements in an image.

Because of the very high heterogeneity of the database (old postcards and news photos), this task was more difficult than expected

We think that the challenge proposed by ImagEVAL was more important than for a specific campaign like ICDAR. We only focused the task on text area detection and not the recognition of those character strings

For this task, the choice of metric was a very hard problem that can be seen as a classical object detection evaluation problem. The principal metric retained was proposed by Christian Wolf and Jean-Michel Jolion [LIRIS INSA/CNRS ]. A complementary metric is the more classical Precision and Recall adapted and used in ICDAR 2003

J.M. Jolion and C. Wolf proposed a metric that enables a better evaluation of the problems of over or under segmentation. The main idea is to consider different type of matching between ground truth bounding boxes and participants bounding boxes: one-to-one, one-to-many and many-to-one as presented in the following figure:

One-to-one (top), many-to-one, one-to-many matching. Proposed results are in dotted lines.

See the DetEVAL web pages with information and available tools for this metric

The database is composed of 500 images. An image contains a legend (postcard) or a text area that is a part of the scene.

190 old post cards (mainly with legends)
206 colour photographs (mainly text within the scene)
104 black and white photographs (mainly text within the scene)

90 images haven’t any text area: 40 old postcards, 47 colour photographs, 3 black and white photographs.

DETECTING OBJECTS

Identifying objects in any sort of image is at the heart of most technological breakthroughs concerning image processing as the technique can be used in all areas of application where image processing is involved (defence, civil, sales, surveillance etc.). In this task, the participant must recognise from a photographic base photos that represent the object given. This will be done by modelling the object using the image base representing it.

This task involves object detection using a limited learning database

Ten objects or classes of objects are considered:

tree,
minaret,
Eiffel tower,
cow,
American flag,
car,
armoured vehicle,
sunglasses,
road signs,
aircrafts

The main objective of the task is to evaluate the detection capabilities of a system to detect a particular object (American flag) or a class of object (car) using limited learning data. Thus, the dictionary database is composed of about 100 images per entity. The participants are forced to use only these data for the first run but they can use complementary data (coming from Internet or personal image collections) for other runs.

The learning database is composed of 743 images:

87 (Armored Vehicles)
103 (Car)
63 (Cow)
38 (Eiffel Tower)
82 (Minaret Mosque)
81 (Plane)
81 (Road Signs)
40 (Sunglasses)
114 (Tree)
54 (USA Flag)

The database is composed of 14000 images with high but realistic complexity (scales, compositions, occlusions, etc.) mainly coming from the HACHETTE Photos database [see]. Each image contains one, several or no object. About 5000 images without any of the ten objects also composed the database.

RECOGNISING ATTRIBUTES

The automatic classification of images is an important challenge. Analysing the image enables certain, very interesting attributes to be recognised in order to sort or classify them in a totally automatic way. In this task, participants must identify all images containing certain attributes. These attributes correspond to the very nature of the image as well as its context and make-up.

This task can be seen as a classification or an automatic annotation task. The semantics to be extracted from unannotated images involve the nature and the context of the image. Ten attributes are considered:

black-and-white photo
colour photo
colorized black-and-white photo
painting reproduction
indoor
outdoor
day
night
nature
urban

The attributes are organized in a shallow semantic tree:

Organization of the attributes

The first run must return results obtained from using only the given learning images, but other runs may use any supplementary data. The queries are an attribute or a series of attributes. MAP and Recall and Precision features are used for the evaluation.

The learning database is composed of 5474 images:

348 colored black and white old postcards
286 art reproduction (paintings)
4840 color and balck and white photographs

A groundtruth file was attached containing the membership of each image to the different attribute.
The test database is composed of 23572 images. Most of the images come from the HACHETTE Photos databases [see] and old postcards and art reproductions funds.

Queries

13 queries have been proposed :

(1) Art
(2) coloured Black&White
(3) Black&White / Indoor
(4) Black&White / Outdoor
(5) Colour / Indoor
(6) Colour / Outdoor
(7) Black&White / Outdoor / Night
(8) Black&White / Outdoor / Day / Urban
(9) Black&White / Outdoor / Day / Natural
(10) Colour / Outdoor / Day / Urban
(11) Colour / Outdoor / Day / Natural
(12) Colour / Outdoor / Day / Urban
(13) Colour / Outdoor / Night / Naural

PROTOCOLS AND ORGANISATION

ImagEVAL is organized in 3 steps:

Protocol discussion. Sending of sample databases
Test runs
Official campaign

The main metric (for ranking) is the MEAN AVERAGE PRECISION

For each suggested task in ImagEVAL, the protocol, organisation and listing of the different files are explained in the document ImagEVAL_info_tasks_eng.pdf