Contact Info.

  •    LAB: 631-632-2469
  •    zijwei 'at'
  •    Room 138, New Computer Science Building
  •    Computer Science, Stony Brook University, Stony Brook, NY11794-2424

Education & Experience

  • 2017 Spring
    Computer Vision Intern @ Adobe Research
    Advisor(s): Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomír Měch
    Topics: Guided Composition on Mobile Devices
  • 2014 - Now
    Graduate Student & Research Assistant @ Stony Brook Univeristy
    Advisor(s): Dimitris Samaras & Minh Hoai
    Topics: Computer Vision with Human Data
  • 2013 - 2014
    Research Associate @ Carnegie Mellon University
    • Real Time Sign Detection for the Visually Impaired
    • Object Recogniton and Localization Using CAD Models
    • Children Autism Detection Using Multiple Visual Cues
  • 2012 - 2013
    Master of Robotics @ Carnegie Mellon University
    Advisor:Prof. Mel Siegel
  • Summer'12
    Intern @ ABB. Shanghai R&D Department
    Topic: multi-view guided grasping system
  • 2011 - 2012
    Graduate Student @ Nanjing University of Science and Technology
    Advisor:Prof. Chunxia Zhao and Prof. Mel Siegel
  • 2007 - 2011
    Bachelor of Computer Science @ Nanjing University of Science and Technology
    GPA:3.53/4.0, Ranking: 6/82



  • General Languages:

    C++, Java

    Python, Matlab

  • Libraries & SDKs:

    MS-MFC, OpenCV, OpenGL

    Android, OpenRave

  • Deep Learning Frameworks:

    Torch, TensorFlow, MatConvNet


  • O.S

    Linux, MacOS

  • Editing

    Latex, Adobe Photoshop

  • Others


    Adobe Premiere

TA Experience

  • CSE525 S2016

    Introduction to Robotics

  • CSE525 S2015

    Introduction to Robotics

  • CSE214 S2015

    Computer Science II

  • CSE110 F2014

    Intro. to Computer Science


Kiwon Yun

Ph.D Candidate in CS @ SBU

+ Link

Le Hou

Ph.D Candidate in CS @ SBU

+ Link

Yang Wang

Ph.D Candidate in CS @ SBU

+ Link

Hossein Adeli

Ph.D Candidate in Psychology @ SBU

Research Interests

  • Object Recognition
  • Gaze Enabled Object Recognition
  • Human Visual Perception Modeling Using Deep Neural Network
  • Visual Interestingness Detection in Videos


[1] Good View Hunting: Learning Photo Composition from View Pairs.

Zijun Wei, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomir Mech, Min Hoai and Dimitris Samaras
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
Paper will be out soon


Finding views with good photo composition is a challenging task for machine learning methods. A key difficulty is the lack of well annotated large scale datasets. Most existing datasets only provide a limited number of annotations for good views, while ignoring the comparative nature of view selection. In this work, we present the first large scale Comparative Photo Composition dataset, which contains over 1 Million comparative view pairs annotated using a cost-effective crowdsourcing workflow. We show that these comparative view annotations are essential for training a robust neural network model for composition. In addition, we propose a novel knowledge transfer framework to train a fast view proposal network, which runs at 75+ FPS and achieves state-of-the-art performance in image cropping and thumbnail generation tasks on three benchmark datasets. The superiority of our method is also demonstrated in a user study on a challenging experiment, where our method significantly outperforms the baseline methods in producing diversified well-composed views.

[2] Learned Region Sparsity and Diversity Also Predict Visual Attention.

Zijun Wei*, Hossein Adeli*, Greg Zelinsky, Minh Hoai, Dimitris Samaras
Advances in Neural Information Processing Systems (NIPS) 2016


Learned region sparsity has achieved state-of-the-art performance in classification tasks by exploiting and integrating a sparse set of local information into global decisions. The underlying mechanisms resemble how people sample information from the image with their eye-movements when making similar decisions. In this paper we enhance the learned region sparsity model with the biologically plausible mechanism of Inhibition of Return, to impose diversity on the selected regions. We investigated whether these mechanisms of sparsity and diversity correspond to visual attention by testing our model on three different types of visual search tasks. We report state-of-the-art results in predicting the location of human visual attention, even though we only trained on image-level labels without object location annotation. Notably the enhanced model's classification performance remains the same as the original. This work sheds some light on the possible visual attention mechanisms in the brain and argues for inclusion of attention-based mechanisms for improving computer vision techniques.

[3] Region Ranking SVM for Image Classification.

Zijun Wei, Minh Hoai
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
Go to project page for more information


The success of an image classification algorithm largely depends on how it incorporates local information in the global decision. Popular approaches such as averagepooling and max-pooling are suboptimal in many situations. In this paper we propose Region Ranking SVM (RRSVM), a novel method for pooling local information from multiple regions. RRSVM exploits the correlation of local regions in an image, and it jointly learns a region evaluation function and a scheme for integrating multiple regions. Experiments on PASCAL VOC 2007, VOC 2012, and ILSVRC2014 datasets show that RRSVM outperforms the methods that use the same feature type and extract features from the same set of local regions. RRSVM achieves similar to or better than the state-of-the-art performance on all datasets.

[4] 3D Pose-by-Detection of Vehicles via Discriminatively Reduced Ensembles of Correlation Filters

Yair Movshovitz-Attias, Vishnu Naresh Boddeti, Zijun Wei, and Yaser Sheikh
Proceedings of British Machine Vision Conference (BMVC) 2014


Estimating the precise pose of a 3D model in an image is challenging; explicitly identifying correspondences is difficult, particularly at smaller scales and in the presence of occlusion. Exemplar classifiers have demonstrated the potential of detection-based approaches to problems where precision is required. In particular, correlation filters explicitly suppress classifier response caused by slight shifts in the bounding box. This property makes them ideal exemplar classifiers for viewpoint discrimination, as small translational shifts can often be confounded with small rotational shifts. However, exemplar based pose-by-detection is not scalable because, as the desired precision of viewpoint estimation increases, the number of exemplars needed increases as well. We present a training framework to reduce an ensemble of exemplar correlation filters for viewpoint estimation by directly optimizing a discriminative objective. We show that the discriminatively reduced ensemble outperforms the state-of-the-art on three publicly available datasets and we introduce a new dataset for continuous car pose estimation in street scene images.

[*] Predicting Scanpath Agreement during Scene Viewing using Deep Neural Networks

Zijun Wei, Hossein Adeli, Greg Zelinsky, Minh Hoai, Dimitris Samaras
Vision Science Society(VSS), 2017


Eye movements are a widely used measure of overt shifts of attention, but this measure is often limited by poor agreement in peoples’ gaze, which can vary significantly in the context of free viewing. In this work we ask whether the level of scanpath agreement among participants during scene viewing, quantified using a modified version of MultiMatch method (Dewhurst et al., 2012), can be predicted using a Deep Neural Network (DNN). Specifically, using image features extracted from the last convolutional layer of a DNN trained for object recognition, we found a linear weighting such that positive regressor weights indicated the presence of image features resulting in greater gaze agreement among viewers. Image regions corresponding to these features were then found by back-propagating the features to the image space using the probabilistic Selective Tuning Attention model (Zhang et al., 2016, ECCV). Combining these regions from all positively weighted features yielded an activation map reflecting the image features important for predicting scanpath consistency among people freely viewing scenes. The model was trained on a randomly selected 80% of the MIT1003 dataset (Judd et al, 2009) and tested on the remaining 20%, repeated 10 times. We found that this linear regressor model was able to predict for each image the level of agreement in the viewers’ scanpaths (r = 0.3, p < .01). Consistent with previous findings, in qualitative analyses we also found that the features of text, faces, and bodies were especially important in predicting gaze agreement. This work introduces a novel method for predicting scanpath agreement, and for identifying the underlying image features important for creating agreement in collective viewing behavior. Future work will extend this approach to identify those features of a target goal that are important for producing uniformly strong attentional guidance in the context of visual search tasks

Recent Projects @ SBU

  • image

    Predicting Visual Attention Using Learned Region Sparsity and Diversity

    Joint work with:

    Hossein Adeli, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

    When performing visual serach tasks, people sample information from the image with their eye-movements. A sparse set of regions are used for decision instead of the whole image. Learned region sparsity has also achieved state-of-the-art performance in classification tasks in computer vision community. In this project we draw connections between the two sparsity models and investigat whether these mechanisms of sparsity and diversity in computer vision correspond to visual attention sparsity by testing our model on different types of visual tasks. We expect to shed some light on the possible visual attention mechanisms in the brain and argues for inclusion of attention-based mechanisms for improving computer vision techniques.

  • image

    Incorporating Gaze Information into Image Detection

    Joint work with:

    Kiwon Yun, Dimitris Samaras, Gregory Zelinsky, Tamara Berg

    We believe that users' visual behaviors during natural viewing of images contain rich information about the content of images. This project aims at boosting object detection performance by incorporating information from observers' eye movements. While primary results have been promising, we are exploring more appropriate ways of encoding these behavioral information from multiple observers to imporve the performance of modern object detection algorithms such as R-CNN.

Misc. Projects @ SBU

  • image

    Painterly Rendering

    CSE 528 Course Project

    Customizing Painterly Rendering Styles Using Stroke Process