Tutorial CVPR10

Human-centered Vision Systems:

Context-driven Algorithm Design

Location: San Francisco, USA

June 14, 2010

Nicu Sebe, University of Trento, Italy

Hamid Aghajan, Stanford University, USA


This tutorial aims to offer its audience a new perspective towards opportunities in employing contextual data in human-centric 
video-based applications. The tutorial covers topics and case studies in algorithm design for different applications in smart 
environments and examines the use of contextual data in various forms to provide efficiency and reliability to the vision processing 
operation or adaptation to user preferences. Examples of inference of the userís activity, facial expression, eye gaze, gesture, emotion, 
and intention, as well as object recognition based on user interactions are used to support the presented topics.
Multi-camera systems and multimodal sensor networks have been recently studied as the physical embodiments of data acquisition 
and processing systems for creating novel smart environments and ambient intelligence applications. Areas such as smart homes, 
elderly and patient care, human-computer interaction, ambience and lighting control, comfort and well-being, multimedia and 
gaming, and avatar-based social networks have been discussed as examples of such emerging applications. These applications 
share two key characteristics: they are enabled by real-time video (and other sensor data) processing and their objective is to serve 
a human (an individual user). Convergence of these two characteristics offers a unique opportunity for algorithm designers to employ 
a context-driven approach to vision-based inference. An area of research abundant with potentials is hence created when real-time 
vision processing demands meet access to contextual data acquired in various possible ways or accumulated over time from the 
environment and the human user. This tutorial covers key components and methods in the design of a human-centered vision system 
that employs context as a resource. 
The course focuses on four aspects of context-driven information fusion in video processing for human-centric applications: 
(1)   Sources of contextual information and case studies in multi-camera networks; 
(2)   Interfacing vision processing with high-level data fusion to build up knowledge based and behavior models; 
(3)   Human pose, gaze, activity, facial expression, preferences, behavior modeling, and user feedback as sources of 
human-centric context; 
(4)   Case studies of incorporating vision-based activity and expression recognition algorithms into adaptive systems that 
learn user preferences and adjust their services accordingly. 
The course topics and case studies are supported by a large collection of implemented examples covering 
various layers of processing from early vision extraction to intermediate soft decisions in multi-camera processing or latent-space 
activity recognition, and to high-level inference of semantics based on visual clues.
The tutorial will provide the participants with an understanding of the key concepts, state-of-the-art techniques, new application 
opportunities, and open issues in the areas described above. The course is organized into the following syllabus:
(1)          Introduction and motivation:
(2)          Use of context in video processing: 
(3)          Interface of vision and high-level inference: 
(4)          Human-centric inference
(5)          Adaptive systems
(6)          Conclusions and new frontiers
The short course is intended for PhD students, scientists, engineers, application developers, computer vision specialists and others interested in the 
areas of information retrieval and human-computer interaction. A basic understanding of image processing and machine learning is a prerequisite.