Turn a Horse into a Zebra and vice versa with the Magic of Self-Supervised Learning

11 min readMay 3, 2022


This article explains how CycleGAN (aka Cycle-consistent GAN) works, which is well-known for the demo that translates horse into zebra and vice versa.

In the previous article, we discussed pix2pix which does similar image translations. In fact, the same people that worked on pix2pix developed CycleGAN to overcome problems in pix2pix. So, let’s first see what kind of problems exist in pix2pix, which is useful knowledge for us to better understand CycleGAN.

The Inconvenient Truth about Pix2Pix

It Needs Pairs of Images for Training

In pix2pix, it is possible to convert the contents of an image into a different style, which is called “image-to-image translation”. For example, you can generate a photo-like image from a sketch image. However, since pix2pix uses supervised learning, we must have a lot of pairs of images for training.

In the above example paired image sets, x1 is paired with y1, x2 is paired with y2, and so on. For an input (condition) image xi, there must be a corresponding target (label) image yi. We need lots of paired images to train a model that can robustly handle unseen input images. However, there aren't so many image-to-image translation datasets since it requires time and effort to prepare such datasets. Although it is a common issue in any supervised learning that we need to collect many labeled data, it is especially troublesome for image-to-image translation cases due to the need for paired images.

One-way Image Generation Training

In pix2pix, we train one generator network in one-way image generation. For example, let’s suppose that a generator translates from a black-and-white sketch into a colored image. If we want to perform a reverse image-to-image translation (from a colored image to a black-and-white image), we need to separately…