Selective Search for Object Localisation

Multiple strategies are needed to find all objects in the above images. In (a) the spoon is in the salad-bowl which is on the table. Hence an image is intrinsically hierarchical and we need all scales to find these objects. In (b) the cats can be separated by colour and not by texture, while in (c) the reverse holds for the cameleon. In (d) the wheels are part of the car because they are enclosed by the body while they wildly differ in both colour and texture.

Segmentation traditionally tries to find a single partitioning of the image into its unique objects before any recognition. As this is extremely hard if not impossible (see figure below), researchers resorted to localise objects through recognition by performing an exhaustive search within the image (i.e. sliding window method). But this ignores all useful information in low-level cues. Therefore we propose to combine the best of both worlds into a data-driven Selective Search: We exploit the structure of the image as in segmentation. We aim to generate all possible object locations as in exhaustive search.

We propose to diversify the sampling techniques to account for as many image conditions as possible:

  • We use a hierarchical grouping to deal with all possible object scales
  • We use a diverse set of grouping strategies and vary:
    • The colour space of the image to deal with different invariance properties
    • Region-based similarity functions to deal with the diverse nature of objects. In particular we use as similarities colour, texture, size, and/or a measure of insideness.

The final algorithm is fast and accurate: within 4 seconds it can generate 2,134 boxes with an Average Best Pascal Overlap score of 0.804. The small set of good quality boxes allows us to do object localisation using Bag-of-Words. With this system we won the ImageNet Large Scale Detection challenge 2011 and the Pascal VOC Detection Challenge 2012.

Trade-off number of locations and Average Best Overlap.
Examples of locations found by our Selective Search algorithm (in red, green is ground truth).

Software

  • Download Matlab (p)code for Selective Search. This is the updated version of our journal paper and yields slightly better results than the old version. Can also give more or less boxes than previous version. Read and run demo.m and demoPascal2007.m for instructions.
  • The deprecated selective search code can be found here. Sample boxes using this code are available for Pascal VOC 2007 trainval and 2007 test.
The code can also be downloaded from the homepage of Koen van de Sande.

Papers

Main

[2]J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi]
[1]K E A van de Sande, J.R.R. Uijlings, T Gevers, A.W.M. Smeulders, "Segmentation as Selective Search for Object Recognition", In ICCV, 2011. [bibtex] [pdf] [doi]

Used in

[5]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, "Salient object detection: From pixels to segments", In Image and Vision Computing, 2013. [bibtex] [pdf] [doi]
[4]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, N. Sebe, "A proto-object-based computational model for visual saliency", In Journal of Vision, 2013. (to appear) [bibtex]
[3]D.T. Le, J.R.R. Uijlings, R. Bernardi, "Exploiting Language Models for Visual Recognition", In EMNLP, 2013. [bibtex] [pdf]
[2]D.T. Le, R. Bernardi, J.R.R. Uijlings, "Exploiting Language Models to Recognize Unseen Actions", In ICMR, 2013. [bibtex] [pdf] [doi]
[1]J. Stöttinger, J.R.R. Uijlings, A.K. Pandey, N. Sebe, F. Giunchiglia, "(Unseen) Event Recognition via Semantic Compositionality", In CVPR, 2012. [bibtex] [pdf] [doi]

Builds upon

[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi]

The Visual Extent of an Object

Visualising where the classification evidence resides for 'cat'. Yellow means strong positive evidence, blue means strong negative evidence, and grey is neutral.

While Bag-of-Words is widely used, its exact workings are less understood. In this project we perform a theoretical investigation on the visual extent of an object and on the role of context. For this analysis, we develop a technique to backproject the classification evidence of the Bag-of-Words method back into the image to measure and visualize how this method classifies images. Additionally, we create a confusion matrix for Average Precision. Using these tools, we perform our theoretical investigation from two angles: (a) Not knowing the object location, we determine where in the image support for object classification resides. (b) Assuming an ideal box around the object we evaluate the relative contribution of the object interior, object border, and surround.

In (a) we find that the surroundings contribute significantly to object classification where for boat the object area contributes negatively. In (b) we find that the surroundings no longer contribute, confirming a long standing fact in psychology. Unsurprisingly, comparing (a) and (b) we find that with good object localisation there is a considerable gain in accuracy to be made.

Additionally, we varied the amount of context around each object to measure their visual extent. We found that their visual extent it is determined by its category: Well-defined rigid objects have the object itself as the preferred spatial extent. Non-rigid objects have an unbounded spatial extent: all spatial extents produce equally good results. Objects primarily categorised based on their function have the whole image as their spatial extent.

Visualising where the classification evidence resides for 'boat'. Yellow means strong positive evidence, blue means strong negative evidence, and grey is neutral. As can be seen, the water is the strongest indicator for boat while sometimes large parts of the boat itself are negative.
Visualising per class the average precision when taking only information from the object or from its surround.

Papers

Main

[2]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "The Visual Extent of an Object", In International Journal of Computer Vision, 2012. [bibtex] [pdf] [doi]
[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "What is the Spatial Extent of an Object?", In CVPR, 2009. [bibtex] [pdf] [doi]

Used in

[1]J.R.R. Uijlings, A.W.M. Smeulders, "Visualising Bag-of-Words", In demo at ICCV, 2011. [bibtex] [pdf]

Builds upon

[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi]

Action/Event Recognition using Language Models

Our framework for human action recognition which combines a visual model with a language model.

In human action recognition and event recognition, one problem is that the number of actions and events is staggering. Each object can be manipulated using many verbs, resulting in an high number of possible human actions. There are already many words describing events and adjectives can modify events. For example, an Indian wedding is (visually) different from a European wedding. So both the number of actions and events is enormous.

Most visual recognition systems need visual training examples for all classes, which requires a prohibitive human annotation effort. Instead, in this project we instead propose to perform visual recognition on the individual components of the actions/events, and use other sources to learn how actions and events are recognised through their components.

In our ICMR paper we aim to recognise human actions by visual recognition and localisation of an object, and learn from language the most plausible action for each object. We created a new dataset annotating human actions for Pascal VOC 2007, resulting in a action dataset which is restricted by the 20 object categories, but which is unbiased in terms of the frequency of actions that occur with a single object (unlike most action recognition dataset which try to collect an equal amount of examples per category).

In this framework we compared the part-based visual recognition model of Felzenszwalb et al. with our own Selective Search based Bag-of-Words recognition model and found that ours worked better. Additionally we compared two language models, LDA-R and TypeDM and found that TypeDM gives the best results. Finally, we show that the combination of localised objects and the language model yields better results than a state-of-the-art Bag-of-Words implementation.

In our CVPR paper we annotated events for the Pascal VOC 2007 dataset using Faceted Analysis Synthesis Theory, developed by Library and Information Science to organise vast collections of knowledge. The resulting events are perpetually genuine and can be viewed as a subset of universal knowledge. We show the promise of a compositional approach and show that it gives reasonable results for unseen event recognition.

Selective Search plus Bag-of-Words is better than a part-based model in this framework.
For the language model, TypeDM yields better results than ROOTH-LDA in our framework.

Papers

Main

[3]D.T. Le, J.R.R. Uijlings, R. Bernardi, "Exploiting Language Models for Visual Recognition", In EMNLP, 2013. [bibtex] [pdf]
[2]D.T. Le, R. Bernardi, J.R.R. Uijlings, "Exploiting Language Models to Recognize Unseen Actions", In ICMR, 2013. [bibtex] [pdf] [doi]
[1]J. Stöttinger, J.R.R. Uijlings, A.K. Pandey, N. Sebe, F. Giunchiglia, "(Unseen) Event Recognition via Semantic Compositionality", In CVPR, 2012. [bibtex] [pdf] [doi]

Builds upon

[1]J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi]

Proto-object based Saliency

Our framework for proto-object based saliency.

In this project we propose a novel approach to the task of salient object detection. In contrast to previous salient object detectors that are based on a spotlight attention theory, we follow an object-based attention theory and incorporate the notion of an object directly into our saliency measurements. Particularly, we consider proto-objects as units of the analysis, where a proto-object is a connected image region that can be converted into a plausible object or object-part once a focus of attention reaches it. As the object-based attention theory suggests, we start with segmenting a complex image into proto-objects using the Selective Search methodology and then assess saliency for each proto-object. The most salient proto-object is considered as being a salient object.

We distinguish two types of object saliency. Firstly, an object is salient if it differs from its surrounding, which we call center-surround saliency. Secondly, an object is salient if it contains rare or outstanding details, which we measure by integrated saliency. We demonstrate that these two types of object saliency have complementary characteristics; moreover, the combination of the two performs at the level of state-of-the-art in salient object detection.

Examples of our algorithm. On the left is the input image, in the middle the saliency map, and on the right the most salient object determined by our method.

Papers

Main

[2]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, "Salient object detection: From pixels to segments", In Image and Vision Computing, 2013. [bibtex] [pdf] [doi]
[1]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, N. Sebe, "A proto-object-based computational model for visual saliency", In Journal of Vision, 2013. (to appear) [bibtex]

Builds upon

[1]J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi]

Real-time Bag-of-Words

Recommended pipeline for accurate Bag-of-Words.
Pipeline for real-time Bag-of-Words.

In this project we review techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis we use the Bag-of-Words algorithm that in the 2008 benchmarks of TRECVID and PASCAL lead to the best performance scores. We divide the evaluation in three steps: (1) Descriptor Extraction, where we evaluate SIFT, SURF, DAISY, and Semantic Textons. (2) Visual Word Assignment, where we compare a k-means visual vocabulary with a Random Forest and evaluate subsampling, dimension reduction with PCA, and division strategies of the Spatial Pyramid. (3) Classification, where we evaluate the chi-square, RBF, and Fast Histogram Intersection kernel for the SVM.

Apart from the evaluation, we accelerate the calculation of densely sampled SIFT and SURF, accelerate nearest neighbour assignment, and improve accuracy of the Histogram Intersection kernel. We also show that vertical divisions in the Spatial Pyramid influence performance negatively. We conclude by discussing whether further acceleration of the Bag-of-Words pipeline is possible.

Our results lead to a 7-fold speed increase without accuracy loss, and a 70-fold speed increase with 3% accuracy loss. The latter system does classification in real-time, which opens up new applications for automatic concept classification. For example, this system permits 5 standard desktop PCs to automatically tag for 20 classes all images that are currently uploaded to Flickr.

Visual Word Assignment speed Random Forest vs k-means.
Experiment with several spatial divisions of the image. Vertical divisions do not work, which makes sense while using complete image representations as mirroring the image over the vertical axis does not change the image content.

Software

Matlab pcode is available for the real-time dense SURF and fast dense SIFT code as described in our journal paper (see below). Average computational Performance on the 300×500 images of the Pascal VOC 2007 dataset on a single core of a 3.16 Ghz Intel Code Duo E8500 processor is as follows:

  • 14 milliseconds per image for SURF
  • 77 milliseconds per image for SIFT

The software can be downloaded here.

Papers

Main

[2]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi]
[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-time Bag-of-Words, Approximately", In CIVR, 2009. (best paper award) [bibtex] [pdf] [doi]

Used in

[7]J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi]
[6]V. Yanulevskaya, J.R.R. Uijlings, E. Bruni, A. Sartori, F. Bacci, N. Sebe, E. Zamboni, D. Melcher, "In the eye of the beholder: Employing statistical analysis and eye tracking for analyzing abstract paintings", In ACM Multimedia (long paper), 2012. [bibtex] [pdf] [doi]
[5]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "The Visual Extent of an Object", In International Journal of Computer Vision, 2012. [bibtex] [pdf] [doi]
[4]A.J. Anderson, E. Bruni, J.R.R. Uijlings, U. Bordignon, M. Baroni, M. Poesio, "Representational Similarity Between Brain Activity Elicited by Concrete Nouns and Image Based Semantic Models", In Vision and Language Workshop, 2012. [bibtex] [pdf]
[3]J.R.R. Uijlings, A.W.M. Smeulders, "Visualising Bag-of-Words", In demo at ICCV, 2011. [bibtex] [pdf]
[2]J.R.R. Uijlings, O. de Rooij, D. Odijk, A. Smeulders, M. Worring, "Instant Bag-of-Words Served on a Laptop", In demo at ICMR, 2011. [bibtex] [pdf] [doi]
[1]R. Mattivi, J.R.R. Uijlings, F. de Natale, N. Sebe, "Exploitation of Time Constraints for (Sub-) Event Recognition", In ACM Workshop on Modeling and Representing Events (J-MRE’11), 2011. [bibtex] [pdf] [doi]

Analysing Abstract Art

Visualisation of which parts of the abstract painting evokes positive emotions (yellow) and negative emotions (blue) according to the Bag-of-Words algorithm.

Most artworks are explicitly created to evoke a strong emotional response. During the centuries there were several art movements which employed different techniques to achieve emotional expressions conveyed by artworks. Yet people were always consistently able to read the emotional messages even from the most abstract paintings. Can a machine learn what makes an artwork emotional?

In this project we consider a set of 500 abstract paintings from Museum of Modern and Contemporary Art of Trento and Rovereto (MART), where each painting was scored as carrying a positive or negative response on a Likert scale of 1-7. We employ a state-of-the-art recognition system to learn which statistical patterns are associated with positive and negative emotions. Additionally, we dissect the classification machinery to determine which parts of an image evokes what emotions. This opens new opportunities to research why a specific painting is perceived as emotional. In this project we confirmed long-known observations in art: bright colours evoke positive emotions, dark colours tend to evoke negative emotions. Smooth lines are generally positive while chaotic texture is generally perceived as negative.

Additionally, with help of an eye-tracking experiment we show that positive parts of painting attracts the most attention: even in paintings with a negative emotional content people still prefer to look at the positive parts.

Another visualisation of which parts of the abstract painting evokes positive emotions (yellow) and negative emotions (blue) according to the Bag-of-Words algorithm.
Human eye-fixations plotted as red dots over the visualisation which shows the positive and negative emotional parts of the image. For many negative images (such as this one) people tend to look at the positive aspects more often.

Papers

Main

[1]V. Yanulevskaya, J.R.R. Uijlings, E. Bruni, A. Sartori, F. Bacci, N. Sebe, E. Zamboni, D. Melcher, "In the eye of the beholder: Employing statistical analysis and eye tracking for analyzing abstract paintings", In ACM Multimedia (long paper), 2012. [bibtex] [pdf] [doi]

Builds upon

[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "The Visual Extent of an Object", In International Journal of Computer Vision, 2012. [bibtex] [pdf] [doi]

Relevance Feedback using Fisher Kernel

This paper proposes a novel approach to relevance feedback based on the Fisher Kernel representation in the context of multimodal video retrieval. The Fisher Kernel representation describes a set of features as the derivative with respect to the log-likelihood of the generative probability distribution that models the feature distribution. In the context of relevance feedback, instead of learning the generative probability distribution over all features of the data, we learn it only over the top retrieved results. Hence during relevance feedback we create a new Fisher Kernel representation based on the most relevant examples. In addition, we propose to use the Fisher Kernel to capture temporal information by cutting up a video in smaller segments, extract a feature vector from each segment, and represent the resulting feature set using the Fisher Kernel representation. We evaluate our method on the MediaEval 2012 Video Genre Tagging Task, a large dataset, which contains 26 categories in 15.000 videos totalling up to 2.000 hours of footage. Results show that our method significantly improves results over existing state-of-the-art relevance feedback techniques. Furthermore, we show significant improvements by using the Fisher Kernel to capture temporal information, and we demonstrate that Fisher kernels are well suited for this task.

Papers

[3]N. Rostamzadeh, G. Zen, I. Mironica, J.R.R. Uijlings, N. Sebe, "Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation", In ICIAP, 2013. [bibtex] [pdf]
[2]I. Mironica, J.R.R. Uijlings, N. Rostamzadeh, B. Ionescu, N. Sebe, "Time Matters! Capturing Variation in Time in Video using Fisher Kernels", In ACM Multimedia, 2013. [bibtex] [pdf]
[1]I. Mironica, B. Ionescu, J.R.R. Uijlings, N. Sebe, "Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval", In ICMR, 2013. [bibtex] [pdf] [doi]

Distributional Semantics

The current trend in image analysis and multimedia is to use information extracted from text and text processing techniques to help vision-related tasks, such as automated image annotation and generating semantically rich descriptions of images. In this work, we claim that image analysis techniques can "return the favour" to the text processing community and be successfully used for a general-purpose representation of word meaning. We provide evidence that simple low-level visual features can enrich the semantic representation of word meaning with information that cannot be extracted from text alone, leading to improvement in the core task of estimating degrees of semantic relatedness between words, as well as providing a new, perceptually-enhanced angle on word semantics. Additionally, we show how distinguishing between a concept and its context in images can improve the quality of the word meaning representations extracted from images.

Papers

Main

[1]E. Bruni, J.R.R. Uijlings, M. Baroni, N. Sebe, "Distributional Semantics with Eyes: Using Image Analysis to Improve Computational Representations of Word Meaning", In ACM Multimedia (long paper), 2012. [bibtex] [pdf] [doi]

Builds upon

[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi]

Sparse Learning

The explosive growth of digital images requires effective methods to manage these images. Among various existing methods, automatic image annotation has proved to be an important technique for image management tasks, e.g., image retrieval over large-scale image databases. Automatic image annotation has been widely studied during recent years and a considerable number of approaches have been proposed. However, the performance of these methods is yet to be satisfactory, thus demanding more effort on research of image annotation. In this paper, we propose a novel semi supervised framework built upon feature selection for automatic image annotation. Our method aims to jointly select the most relevant features from all the data points by using a sparsity-based model and exploiting both labeled and unlabeled data to learn the manifold structure. Our framework is able to simultaneously learn a robust classifier for image annotation by selecting the discriminating features related to the semantic concepts. To solve the objective function of our framework, we propose an efficient iterative algorithm. Extensive experiments are performed on different real-world image datasets with the results demonstrating the promising performance of our framework for automatic image annotation.

Papers

[3]Z. Ma, F. Nie, Y. Yang, J.R.R. Uijlings, N. Sebe, "Web Image Annotation via Subspace-Sparsity Collaborated Feature Selection", In IEEE Transactions on Multimedia, 2012. [bibtex] [pdf] [doi]
[2]Z. Ma, F. Nie, Y. Yang, J.R.R. Uijlings, N. Sebe, A.G. Hauptmann, "Discriminating Joint Feature Analysis for Multimedia Data Understanding", In IEEE Transactions on Multimedia, 2012. [bibtex] [pdf] [doi]
[1]Z. Ma, Y. Yang, F. Nie, J.R.R. Uijlings, N. Sebe, "Exploiting the Entire Feature Space with Sparsity for Automatic Image Annotation", In ACM Multimedia (long paper), 2011. [bibtex] [pdf] [doi]

List of publications

Below is my complete list of publications. Alternatively, view My Google Scholar Profile.

Refereed Articles
[7]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, "Salient object detection: From pixels to segments", In Image and Vision Computing, 2013. [bibtex] [pdf] [doi]
[6]V. Yanulevskaya, J.R.R. Uijlings, J.M. Geusebroek, N. Sebe, "A proto-object-based computational model for visual saliency", In Journal of Vision, 2013. (to appear) [bibtex]
[5]J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, "Selective Search for Object Recognition", In International Journal of Computer Vision, 2013. [bibtex] [pdf] [doi]
[4]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "The Visual Extent of an Object", In International Journal of Computer Vision, 2012. [bibtex] [pdf] [doi]
[3]Z. Ma, F. Nie, Y. Yang, J.R.R. Uijlings, N. Sebe, "Web Image Annotation via Subspace-Sparsity Collaborated Feature Selection", In IEEE Transactions on Multimedia, 2012. [bibtex] [pdf] [doi]
[2]Z. Ma, F. Nie, Y. Yang, J.R.R. Uijlings, N. Sebe, A.G. Hauptmann, "Discriminating Joint Feature Analysis for Multimedia Data Understanding", In IEEE Transactions on Multimedia, 2012. [bibtex] [pdf] [doi]
[1]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-Time Visual Concept Classification", In IEEE Transactions on Multimedia, 2010. [bibtex] [pdf] [doi]
Refereed Conference Papers
[13]D.T. Le, J.R.R. Uijlings, R. Bernardi, "Exploiting Language Models for Visual Recognition", In EMNLP, 2013. [bibtex] [pdf]
[12]D.T. Le, R. Bernardi, J.R.R. Uijlings, "Exploiting Language Models to Recognize Unseen Actions", In ICMR, 2013. [bibtex] [pdf] [doi]
[11]N. Rostamzadeh, G. Zen, I. Mironica, J.R.R. Uijlings, N. Sebe, "Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation", In ICIAP, 2013. [bibtex] [pdf]
[10]I. Mironica, J.R.R. Uijlings, N. Rostamzadeh, B. Ionescu, N. Sebe, "Time Matters! Capturing Variation in Time in Video using Fisher Kernels", In ACM Multimedia, 2013. [bibtex] [pdf]
[9]I. Mironica, B. Ionescu, J.R.R. Uijlings, N. Sebe, "Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval", In ICMR, 2013. [bibtex] [pdf] [doi]
[8]V. Yanulevskaya, J.R.R. Uijlings, E. Bruni, A. Sartori, F. Bacci, N. Sebe, E. Zamboni, D. Melcher, "In the eye of the beholder: Employing statistical analysis and eye tracking for analyzing abstract paintings", In ACM Multimedia (long paper), 2012. [bibtex] [pdf] [doi]
[7]J. Stöttinger, J.R.R. Uijlings, A.K. Pandey, N. Sebe, F. Giunchiglia, "(Unseen) Event Recognition via Semantic Compositionality", In CVPR, 2012. [bibtex] [pdf] [doi]
[6]E. Bruni, J.R.R. Uijlings, M. Baroni, N. Sebe, "Distributional Semantics with Eyes: Using Image Analysis to Improve Computational Representations of Word Meaning", In ACM Multimedia (long paper), 2012. [bibtex] [pdf] [doi]
[5]K E A van de Sande, J.R.R. Uijlings, T Gevers, A.W.M. Smeulders, "Segmentation as Selective Search for Object Recognition", In ICCV, 2011. [bibtex] [pdf] [doi]
[4]Z. Ma, Y. Yang, F. Nie, J.R.R. Uijlings, N. Sebe, "Exploiting the Entire Feature Space with Sparsity for Automatic Image Annotation", In ACM Multimedia (long paper), 2011. [bibtex] [pdf] [doi]
[3]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "Real-time Bag-of-Words, Approximately", In CIVR, 2009. (best paper award) [bibtex] [pdf] [doi]
[2]J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, "What is the Spatial Extent of an Object?", In CVPR, 2009. [bibtex] [pdf] [doi]
[1]K. Oinonen, M. Theune, A. Nijholt, J.R.R. Uijlings, "Designing a Story Database for Use in Automatic Story Generation", In International Conference on Entertainment Computing, 2006. [bibtex] [pdf] [doi]
Workshop Papers
[5]A.J. Anderson, E. Bruni, J.R.R. Uijlings, U. Bordignon, M. Baroni, M. Poesio, "Representational Similarity Between Brain Activity Elicited by Concrete Nouns and Image Based Semantic Models", In Vision and Language Workshop, 2012. [bibtex] [pdf]
[4]R. Mattivi, J.R.R. Uijlings, F. de Natale, N. Sebe, "Exploitation of Time Constraints for (Sub-) Event Recognition", In ACM Workshop on Modeling and Representing Events (J-MRE’11), 2011. [bibtex] [pdf] [doi]
[3]C.G.M. Snoek, K.E.A. van de Sande, O. de Rooij, B. Huurnink, J.R.R. Uijlings, M. van Liempt, M. Bugalho, I. Trancoso, F. Yan, M.A. Tahir, K. Mikolajczyk, J. Kittler, M. de Rijke, J-M Geusebroek, T. Gevers, M. Worring, D.C. Koelma, A.W.M. Smeulders, "The MediaMill TRECVID 2009 Semantic Video Search Engine", In Proceedings of the 7th TRECVID Workshop, Gaithersburg, USA, 2009. [bibtex] [pdf]
[2]C.G.M. Snoek, K.E.A. van de Sande, O. de Rooij, B. Huurnink, J.C. van Gemert, J.R.R. Uijlings, J. He, X. Li, I. Everts, V. Nedovic, M. van Liempt, R. van Balen, F. Yan, M.A. Tahir, K. Mikolajczyk, J. Kittler, M. de Rijke, J-M Geusebroek, T. Gevers, M. Worring, A.W.M. Smeulders, D.C. Koelma, "The MediaMill TRECVID 2008 Semantic Video Search Engine", In Proceedings of the 6th TRECVID Workshop, Gaithersburg, USA, 2008. [bibtex] [pdf]
[1]C.G.M. Snoek, I. Everts, J.C. van Gemert, J-M Geusebroek, B. Huurnink, D.C. Koelma, M. van Liempt, O. de Rooij, K.E.A. van de Sande, A.W.M. Smeulders, J.R. R. Uijlings, M. Worring, "The MediaMill TRECVID 2007 Semantic Video Search Engine", In Proceedings of the 5th TRECVID Workshop, Gaithersburg, USA, 2007. [bibtex] [pdf]
Demo Papers
[3]R. Mattivi, J.R.R. Uijlings, F. de Natale, N. Sebe, "Categorization of a collection of pictures into structured events", In demo at ICMR, 2012. [bibtex] [doi]
[2]J.R.R. Uijlings, A.W.M. Smeulders, "Visualising Bag-of-Words", In demo at ICCV, 2011. [bibtex] [pdf]
[1]J.R.R. Uijlings, O. de Rooij, D. Odijk, A. Smeulders, M. Worring, "Instant Bag-of-Words Served on a Laptop", In demo at ICMR, 2011. [bibtex] [pdf] [doi]
Other Publications
[2]J.R.R. Uijlings, "The What and Where in Visual Object Recognition", PhD thesis, University of Amsterdam, 2011. [bibtex] [pdf]
[1]J.R.R. Uijlings, "Designing a Virtual Environment for Story Generation", Master's thesis, University of Amsterdam, 2006. [bibtex] [pdf]

About Me

I am working as a post-doctoral researcher in the MHUG group of Nicu Sebe in the University of Trento, Italy.

My main research focus is on object recognition. In this context I increased computational efficiency of the Bag-of-Words system (best paper award CIVR 2009) and investigated closely which parts of the image are used for classification (IJCV 2012). More recently I am working on improving image classification by using Bag-of-Word features on local image windows (ICCV 2011) and on using localised objects combined with language or knowledge databases for human action recognition (ICMR 2013) or event recognition (CVPR 2012).

Other projects in which I am involved include saliency estimation (JoV 2013), Visual Distributional Semantics: improving the meaning of words through computer vision (ACM MM 2012), analysing emotion in art (ACM MM 2012), object instance recognition, temporal representations of video, and using motion in video for object recognition.

Together with Koen van de Sande we were the driving force behind the team that won the Pascal VOC 2012 Detection challenge, the ImageNet Large Scale 2011 Detection challenge, and the Pascal VOC 2008 Classification challenge. We received a honourable mentions for the Pascal Classification challenge in 2009, 2011, and 2012, and we received a honourable mention for the Pascal Detection challenge in 2010.

I received my master degree in Artificial Intelligence at the University of Amsterdam in 2006 on designing a virtual world for automatic story generation. I obtained my PhD degree in computer vision in December 2011 at the Intelligent Systems Lab Amsterdam, University of Amsterdam, under supervision of Remko Scha (Remko Scha on last.fm) and Arnold Smeulders. My dissertation was on the what and where in visual object recognition.