Latex, Adobe Photoshop
Finding views with good photo composition is a challenging task for machine learning methods. A key difficulty is the lack of well annotated large scale datasets. Most existing datasets only provide a limited number of annotations for good views, while ignoring the comparative nature of view selection. In this work, we present the first large scale Comparative Photo Composition dataset, which contains over 1 Million comparative view pairs annotated using a cost-effective crowdsourcing workflow. We show that these comparative view annotations are essential for training a robust neural network model for composition. In addition, we propose a novel knowledge transfer framework to train a fast view proposal network, which runs at 75+ FPS and achieves state-of-the-art performance in image cropping and thumbnail generation tasks on three benchmark datasets. The superiority of our method is also demonstrated in a user study on a challenging experiment, where our method significantly outperforms the baseline methods in producing diversified well-composed views.
Learned region sparsity has achieved state-of-the-art performance in classification tasks by exploiting and integrating a sparse set of local information into global decisions. The underlying mechanisms resemble how people sample information from the image with their eye-movements when making similar decisions. In this paper we enhance the learned region sparsity model with the biologically plausible mechanism of Inhibition of Return, to impose diversity on the selected regions. We investigated whether these mechanisms of sparsity and diversity correspond to visual attention by testing our model on three different types of visual search tasks. We report state-of-the-art results in predicting the location of human visual attention, even though we only trained on image-level labels without object location annotation. Notably the enhanced model's classification performance remains the same as the original. This work sheds some light on the possible visual attention mechanisms in the brain and argues for inclusion of attention-based mechanisms for improving computer vision techniques.
The success of an image classification algorithm largely depends on how it incorporates local information in the global decision. Popular approaches such as averagepooling and max-pooling are suboptimal in many situations. In this paper we propose Region Ranking SVM (RRSVM), a novel method for pooling local information from multiple regions. RRSVM exploits the correlation of local regions in an image, and it jointly learns a region evaluation function and a scheme for integrating multiple regions. Experiments on PASCAL VOC 2007, VOC 2012, and ILSVRC2014 datasets show that RRSVM outperforms the methods that use the same feature type and extract features from the same set of local regions. RRSVM achieves similar to or better than the state-of-the-art performance on all datasets.
Estimating the precise pose of a 3D model in an image is challenging; explicitly identifying correspondences is difficult, particularly at smaller scales and in the presence of occlusion. Exemplar classifiers have demonstrated the potential of detection-based approaches to problems where precision is required. In particular, correlation filters explicitly suppress classifier response caused by slight shifts in the bounding box. This property makes them ideal exemplar classifiers for viewpoint discrimination, as small translational shifts can often be confounded with small rotational shifts. However, exemplar based pose-by-detection is not scalable because, as the desired precision of viewpoint estimation increases, the number of exemplars needed increases as well. We present a training framework to reduce an ensemble of exemplar correlation filters for viewpoint estimation by directly optimizing a discriminative objective. We show that the discriminatively reduced ensemble outperforms the state-of-the-art on three publicly available datasets and we introduce a new dataset for continuous car pose estimation in street scene images.
Eye movements are a widely used measure of overt shifts of attention, but this measure is often limited by poor agreement in peoples’ gaze, which can vary significantly in the context of free viewing. In this work we ask whether the level of scanpath agreement among participants during scene viewing, quantified using a modified version of MultiMatch method (Dewhurst et al., 2012), can be predicted using a Deep Neural Network (DNN). Specifically, using image features extracted from the last convolutional layer of a DNN trained for object recognition, we found a linear weighting such that positive regressor weights indicated the presence of image features resulting in greater gaze agreement among viewers. Image regions corresponding to these features were then found by back-propagating the features to the image space using the probabilistic Selective Tuning Attention model (Zhang et al., 2016, ECCV). Combining these regions from all positively weighted features yielded an activation map reflecting the image features important for predicting scanpath consistency among people freely viewing scenes. The model was trained on a randomly selected 80% of the MIT1003 dataset (Judd et al, 2009) and tested on the remaining 20%, repeated 10 times. We found that this linear regressor model was able to predict for each image the level of agreement in the viewers’ scanpaths (r = 0.3, p < .01). Consistent with previous findings, in qualitative analyses we also found that the features of text, faces, and bodies were especially important in predicting gaze agreement. This work introduces a novel method for predicting scanpath agreement, and for identifying the underlying image features important for creating agreement in collective viewing behavior. Future work will extend this approach to identify those features of a target goal that are important for producing uniformly strong attentional guidance in the context of visual search tasks