Nicu Sebe, University of Trento, Italy
sebe@disi.unitn.it
Hamid Aghajan, Stanford University, USA
aghajan@stanford.edu
SYNOPSIS
This tutorial aims to offer its audience a new perspective towards opportunities in employing contextual data in human-centric
video-based applications. The tutorial covers topics and case studies in algorithm design for different applications in smart
environments and examines the use of contextual data in various forms to provide efficiency and reliability to the vision processing
operation or adaptation to user preferences. Examples of inference of the user’s activity, facial expression, eye gaze, gesture, emotion,
and intention, as well as object recognition based on user interactions are used to support the presented topics.
MOTIVATION
Multi-camera systems and multimodal sensor networks have been recently studied as the physical embodiments of data acquisition
and processing systems for creating novel smart environments and ambient intelligence applications. Areas such as smart homes,
elderly and patient care, human-computer interaction, ambience and lighting control, comfort and well-being, multimedia and
gaming, and avatar-based social networks have been discussed as examples of such emerging applications. These applications
share two key characteristics: they are enabled by real-time video (and other sensor data) processing and their objective is to serve
a human (an individual user). Convergence of these two characteristics offers a unique opportunity for algorithm designers to employ
a context-driven approach to vision-based inference. An area of research abundant with potentials is hence created when real-time
vision processing demands meet access to contextual data acquired in various possible ways or accumulated over time from the
environment and the human user. This tutorial covers key components and methods in the design of a human-centered vision system
that employs context as a resource.
The course focuses on four aspects of context-driven information fusion in video processing for human-centric applications:
(1) Sources of contextual information and case studies in multi-camera networks;
(2) Interfacing vision processing with high-level data fusion to build up knowledge based and behavior models;
(3) Human pose, gaze, activity, facial expression, preferences, behavior modeling, and user feedback as sources of
human-centric context;
(4) Case studies of incorporating vision-based activity and expression recognition algorithms into adaptive systems that
learn user preferences and adjust their services accordingly.
The course topics and case studies are supported by a large collection of implemented examples covering
various layers of processing from early vision extraction to intermediate soft decisions in multi-camera processing or latent-space
activity recognition, and to high-level inference of semantics based on visual clues.
BENEFITS & LIST OF TOPICS
The tutorial will provide the participants with an understanding of the key concepts, state-of-the-art techniques, new application
opportunities, and open issues in the areas described above. The course is organized into the following syllabus:
(1) Introduction and motivation:
(2) Use of context in video processing:
(3) Interface of vision and high-level inference:
(4) Human-centric inference
(5) Adaptive systems
(6) Conclusions and new frontiers
INTENDED AUDIENCE
The short course is intended for PhD students, scientists, engineers, application developers, computer vision specialists and others interested in the
areas of information retrieval and human-computer interaction. A basic understanding of image processing and machine learning is a prerequisite.