Tutorial CVPR10

Human-centered Vision Systems:

Context-driven Algorithm Design

Location: San Francisco, USA

June 14, 2010

Nicu Sebe, University of Trento, Italy

sebe@disi.unitn.it
Hamid Aghajan, Stanford University, USA

aghajan@stanford.edu

SYNOPSIS

This tutorial aims to offer its audience a new perspective towards opportunities in employing contextual data in human-centric

video-based applications. The tutorial covers topics and case studies in algorithm design for different applications in smart

environments and examines the use of contextual data in various forms to provide efficiency and reliability to the vision processing

operation or adaptation to user preferences. Examples of inference of the user’s activity, facial expression, eye gaze, gesture, emotion,

and intention, as well as object recognition based on user interactions are used to support the presented topics.

MOTIVATION

Multi-camera systems and multimodal sensor networks have been recently studied as the physical embodiments of data acquisition

and processing systems for creating novel smart environments and ambient intelligence applications. Areas such as smart homes,

elderly and patient care, human-computer interaction, ambience and lighting control, comfort and well-being, multimedia and

gaming, and avatar-based social networks have been discussed as examples of such emerging applications. These applications

share two key characteristics: they are enabled by real-time video (and other sensor data) processing and their objective is to serve

a human (an individual user). Convergence of these two characteristics offers a unique opportunity for algorithm designers to employ

a context-driven approach to vision-based inference. An area of research abundant with potentials is hence created when real-time

vision processing demands meet access to contextual data acquired in various possible ways or accumulated over time from the

environment and the human user. This tutorial covers key components and methods in the design of a human-centered vision system

that employs context as a resource.

The course focuses on four aspects of context-driven information fusion in video processing for human-centric applications:

(1)   Sources of contextual information and case studies in multi-camera networks;

(2)   Interfacing vision processing with high-level data fusion to build up knowledge based and behavior models;

(3)   Human pose, gaze, activity, facial expression, preferences, behavior modeling, and user feedback as sources of

human-centric context;

(4)   Case studies of incorporating vision-based activity and expression recognition algorithms into adaptive systems that

learn user preferences and adjust their services accordingly.

The course topics and case studies are supported by a large collection of implemented examples covering

various layers of processing from early vision extraction to intermediate soft decisions in multi-camera processing or latent-space

activity recognition, and to high-level inference of semantics based on visual clues.

BENEFITS & LIST OF TOPICS

The tutorial will provide the participants with an understanding of the key concepts, state-of-the-art techniques, new application

opportunities, and open issues in the areas described above. The course is organized into the following syllabus:

(1)          Introduction and motivation:

(2)          Use of context in video processing:

(3)          Interface of vision and high-level inference:

(4)          Human-centric inference

(5)          Adaptive systems

(6)          Conclusions and new frontiers

INTENDED AUDIENCE

The short course is intended for PhD students, scientists, engineers, application developers, computer vision specialists and others interested in the

areas of information retrieval and human-computer interaction. A basic understanding of image processing and machine learning is a prerequisite.