Cascading Modular Network

A Unified Architecture for Multimodal Image Synthesis

Shichong Peng

Simon Fraser University

Alireza Moazeni

Simon Fraser University

Ke Li

Simon Fraser University






Multimodal vs Unimodal Prediction

Below we show the advantages of multimodal prediction compared to unimodal prediction.


We propose a modular architecture that captures different levels of detail in a coarse-to-fine manner.

Hierarchical Sampling

We propose hierarchical sampling, which is a more efficient sampling strategy for Implicit Maximum Likelihood Estimation (IMLE).

16x Super-Resolution

We use our method to increase the width and height of input images by a factor of 16x. Toggle for our results (CAM-Net), RFB-ESRGAN and conditional IMLE (cIMLE).

Image Colourization

We use our method to colourize a grayscale image. Toggle for our results (CAM-Net) and those of Colorful Image Colorization, Let there be Color, Learning Representations for Automatic Colorization and cIMLE.

Image Synthesis From Scene Layouts

We use our method to generate diverse images from scene layouts. Toggle for our results (CAM-Net) and cIMLE.

Image Decompression

We use our method to recover plausible images from a badly compressed image. Toggle for our results (CAM-Net), DnCNN and cIMLE.

Modelling Joint vs Marginal Distributions

The marginal distribution captures the variability in one variable, whereas the joint distribution captures variability across multiple variables. The marginal distribution alone does not capture correlations between variables. Below we show a case where modelling just the marginal distributions leads to spurious samples.

The joint distribution is visualized at the centre, whereas the marginal distributions are visualized around the boundary. Red points represent samples from the joint distribution and pink points are sampled from independent marginal distributions. As shown above, pink points may fall outside the probable regions of the joint distribution.

In the case of colourization, the colours of nearby pixels are highly correlated. Zhang et al. proposed a method that models marginal distributions only. Below we compare the different samples from Zhang et al. and CAM-Net which models the joint distribution. As shown, samples from marginal distributions (Zhang et al.) are spatially inconsistent whereas samples from the joint distribution (CAM-Net) are not.

Implicit Maximum Likelihood Estimation (IMLE)

Below we show a conceptual illustration of how IMLE works.