Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation

Hao Tang1,2*    Dan Xu3*    Nicu Sebe1,4    Yanzhi Wang5    Jason J. Corso6    Yan Yan2   

1University of Trento    2Texas State University    3University of Oxford    4Huawei    5Northeastern University    6University of Michigan   

in CVPR 2019 (Oral)

Paper | Code | Presentation | Slides | Video | Poster

Overview of the proposed SelectionGAN. Stage I presents a cycled semantic-guided generation sub-network which accepts images from one view and conditional semantic maps and simultaneously synthesizes images and semantic maps in another view. Stage II takes the coarse predictions and the learned deep semantic features from stage I, and performs a fine-grained generation using the proposed multi-channel attention selection module.

Abstract

Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN explicitly utilizes the semantic information and consists of two stages. In the first stage, the condition image and the target semantic map are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using a multi-channel attention selection mechanism. Moreover, uncertainty maps automatically learned from attentions are used to guide the pixel loss for better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top datasets show that our model is able to generate significantly better results than the state-of-the-art methods.

paper thumbnail

Paper

arxiv, 2019.

Citation

Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J. Corso and Yan Yan.
"Multi-Channel Attention Selection GAN with Cascaded Semantic Guidancefor Cross-View Image Translation". In CVPR, 2019. Bibtex

Multi-Channel Attention Selection Module


The multi-scale spatial pooling pools features in different receptive fields in order to have better generation of scene details; the multi-channel attention selection aims at automatically select from a set of intermediate diverse generations in a larger generation space to improve the generation quality.


Coarse-to-Fine Generation

Results generated by our SelectionGAN in 256×256 resolution in a2g direction on Dayton dataset. These samples were randomly selected for visualization purposes. It is obvious that the coarse-to-fine generation model is able to generate sharper results and contains more details than the one-stage model.



Visualization of Uncertainty Map

Results generated by our SelectionGAN in 256×256 resolution in a2g direction on CVUSA dataset. These samples were randomly selected for visualization purposes. As we can see that most textured regions are similar in our generation images, while the junction/edge of different regions is uncertain, and thus the model learns to highlight these parts.



Arbitrary Cross-View Image Translation

Arbitrary cross-view image translation on Ego2Top dataset. Given an image and some novel semantic maps, SelectionGAN is able to generate the same scene but with different viewpoints.



State-of-the-art Comparisons on Ego2Top Dataset

Results generated by different methods in 256×256 resolution on Ego2Top dataset. These samples were randomly selected for visualization purposes.



State-of-the-art Comparisons on CVUSA Dataset

Results generated by different methods in 256×256 resolution on CVUSA dataset. These samples were randomly selected for visualization purposes.



State-of-the-art Comparisons on Dayton Dataset in a2g Direction

Results generated by different methods in 256×256 resolution in a2g direction on Dayton dataset. These samples were randomly selected for visualization purposes.



State-of-the-art Comparisons on Dayton Dataset in g2a Direction

Results generated by different methods in 256×256 resolution in g2a direction on Dayton dataset. These samples were randomly selected for visualization purposes.


Code, Data and Trained Models

Please visit our github repo.


Acknowledgement

This research was partially supported by National Institute of Standards and Technology Grant 60NANB17D191 (YY, JC), Army Research Office W911NF-15-1-0354 (JC) and gift donation from Cisco Inc (YY).


Related Work