Selective Search for Object Localisation
Segmentation traditionally tries to find a single partitioning of the image into its unique objects before any recognition. As this is extremely hard if not impossible (see figure below), researchers resorted to localise objects through recognition by performing an exhaustive search within the image (i.e. sliding window method). But this ignores all useful information in low-level cues. Therefore we propose to combine the best of both worlds into a data-driven Selective Search: We exploit the structure of the image as in segmentation. We aim to generate all possible object locations as in exhaustive search.
We propose to diversify the sampling techniques to account for as many image conditions as possible:
- We use a hierarchical grouping to deal with all possible object scales
- We use a diverse set of grouping strategies and vary:
- The colour space of the image to deal with different invariance properties
- Region-based similarity functions to deal with the diverse nature of objects. In particular we use as similarities colour, texture, size, and/or a measure of insideness.
The final algorithm is fast and accurate: within 4 seconds it can generate 2,134 boxes with an Average Best Pascal Overlap score of 0.804. The small set of good quality boxes allows us to do object localisation using Bag-of-Words. With this system we won the ImageNet Large Scale Detection challenge 2011 and the Pascal VOC Detection Challenge 2012.
Software
-
Download Matlab (p)code for Selective Search. This is the updated version of our
journal paper and yields slightly better results than the old version. Can also give more or less
boxes than previous version. Read and run
demo.m
anddemoPascal2007.m
for instructions. - The deprecated selective search code can be found here. Sample boxes using this code are available for Pascal VOC 2007 trainval and 2007 test.
Papers
Main
[2] | J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi] |
[1] | K E A van de Sande, J.R.R. Uijlings, T Gevers, A.W.M. Smeulders, "Segmentation as Selective Search for Object Recognition", In ICCV, 2011. [bibtex] [pdf] [doi] |
Used in
[5] | V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, "Salient object detection: From pixels to segments", In Image and Vision Computing, 2013. [bibtex] [pdf] [doi] |
[4] | V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, N. Sebe, "A proto-object-based computational model for visual saliency", In Journal of Vision, 2013. (to appear) [bibtex] |
[3] | D.T. Le, J.R.R. Uijlings, R. Bernardi, "Exploiting Language Models for Visual Recognition", In EMNLP, 2013. [bibtex] [pdf] |
[2] | D.T. Le, R. Bernardi, J.R.R. Uijlings, "Exploiting Language Models to Recognize Unseen Actions", In ICMR, 2013. [bibtex] [pdf] [doi] |
[1] | J. Stöttinger, J.R.R. Uijlings, A.K. Pandey, N. Sebe, F. Giunchiglia, "(Unseen) Event Recognition via Semantic Compositionality", In CVPR, 2012. [bibtex] [pdf] [doi] |
Builds upon
[1] | J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi] |
The Visual Extent of an Object
While Bag-of-Words is widely used, its exact workings are less understood. In this project we perform a theoretical investigation on the visual extent of an object and on the role of context. For this analysis, we develop a technique to backproject the classification evidence of the Bag-of-Words method back into the image to measure and visualize how this method classifies images. Additionally, we create a confusion matrix for Average Precision. Using these tools, we perform our theoretical investigation from two angles: (a) Not knowing the object location, we determine where in the image support for object classification resides. (b) Assuming an ideal box around the object we evaluate the relative contribution of the object interior, object border, and surround.
In (a) we find that the surroundings contribute significantly to object classification where for boat the object area contributes negatively. In (b) we find that the surroundings no longer contribute, confirming a long standing fact in psychology. Unsurprisingly, comparing (a) and (b) we find that with good object localisation there is a considerable gain in accuracy to be made.
Additionally, we varied the amount of context around each object to measure their visual extent. We found that their visual extent it is determined by its category: Well-defined rigid objects have the object itself as the preferred spatial extent. Non-rigid objects have an unbounded spatial extent: all spatial extents produce equally good results. Objects primarily categorised based on their function have the whole image as their spatial extent.
Papers
Main
[2] | J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "The Visual Extent of an Object", In International Journal of Computer Vision, 2012. [bibtex] [pdf] [doi] |
[1] | J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "What is the Spatial Extent of an Object?", In CVPR, 2009. [bibtex] [pdf] [doi] |
Used in
[1] | J.R.R. Uijlings, A.W.M. Smeulders, "Visualising Bag-of-Words", In demo at ICCV, 2011. [bibtex] [pdf] |
Builds upon
[1] | J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi] |
Action/Event Recognition using Language Models
In human action recognition and event recognition, one problem is that the number of actions and events is staggering. Each object can be manipulated using many verbs, resulting in an high number of possible human actions. There are already many words describing events and adjectives can modify events. For example, an Indian wedding is (visually) different from a European wedding. So both the number of actions and events is enormous.
Most visual recognition systems need visual training examples for all classes, which requires a prohibitive human annotation effort. Instead, in this project we instead propose to perform visual recognition on the individual components of the actions/events, and use other sources to learn how actions and events are recognised through their components.
In our ICMR paper we aim to recognise human actions by visual recognition and localisation of an object, and learn from language the most plausible action for each object. We created a new dataset annotating human actions for Pascal VOC 2007, resulting in a action dataset which is restricted by the 20 object categories, but which is unbiased in terms of the frequency of actions that occur with a single object (unlike most action recognition dataset which try to collect an equal amount of examples per category).
In this framework we compared the part-based visual recognition model of Felzenszwalb et al. with our own Selective Search based Bag-of-Words recognition model and found that ours worked better. Additionally we compared two language models, LDA-R and TypeDM and found that TypeDM gives the best results. Finally, we show that the combination of localised objects and the language model yields better results than a state-of-the-art Bag-of-Words implementation.
In our CVPR paper we annotated events for the Pascal VOC 2007 dataset using Faceted Analysis Synthesis Theory, developed by Library and Information Science to organise vast collections of knowledge. The resulting events are perpetually genuine and can be viewed as a subset of universal knowledge. We show the promise of a compositional approach and show that it gives reasonable results for unseen event recognition.
Papers
Main
[3] | D.T. Le, J.R.R. Uijlings, R. Bernardi, "Exploiting Language Models for Visual Recognition", In EMNLP, 2013. [bibtex] [pdf] |
[2] | D.T. Le, R. Bernardi, J.R.R. Uijlings, "Exploiting Language Models to Recognize Unseen Actions", In ICMR, 2013. [bibtex] [pdf] [doi] |
[1] | J. Stöttinger, J.R.R. Uijlings, A.K. Pandey, N. Sebe, F. Giunchiglia, "(Unseen) Event Recognition via Semantic Compositionality", In CVPR, 2012. [bibtex] [pdf] [doi] |
Builds upon
[1] | J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi] |