Emily Sandvik, Cynthia Richey


Our final project involved training two autoencoders: one for Pokémon images and one for My Little Pony original characters (OCs). The end goal was to contort the learning process in such a way that the autoencoder could learn to convert Pokémon images into MLP OCs. It worked, kind of!


zebstrika 2.PNG

(Admittedly, this example is very cherry-picked.)


We got the ponies from a zip file of DeviantArt images some guy posted. They needed to be preprocessed, which is what the green bars are. The Pokémon dataset is more comprehensive and uniform because it’s standardized, from here.

The architecture of our model is as follows:

Screen Shot 2022-06-11 at 7.50.35 PM.png

The two decoders were trained separately. The model was trained using ADAM with a learning rate of $1e-4$. Additionally, images had noise applied to them during training to help encourage the model to generalize better to unseen input. This was done by multiplying each value in the image by $(1 + 0.12 \times sample)$, where $sample$ is a distinct sample from the standard normal distribution. The performance of the model is fairly sensitive to the $0.12$ hyperparameter.

Since the purpose of the model was to generate images that look good, we did not keep a holdout set for testing purposes. However, choosing some Pokémon at random from the training set gives this result:

Screen Shot 2022-06-11 at 7.56.13 PM.png

Admittedly, this leaves something to be desired—our model does not do a very good job for most Pokémon.


Technically, calling the split part of the network a “decoder” is a misnomer, since it compresses from a $15\times15\times16$ image to a $1024\times1$ vector. The reason for this is that the overwhelming majority of the parameters in this network are in the fully-connected layers between convolutional and dense layers, and the best performance was achieved by making the split parts of the model as powerful as possible. The shared convolutional elements of the network are essential for forcing the decoders to learn at least some things in common with each other, since otherwise they learn completely separate ways of representing the image, even with very noisy input.

At first, the encoders had a fully connected layer at the end, making it a “true” encoder. There was also no noising of the input image. Here is an example of what the images looked like:

top row: input image, second row: output from Pokémon model, third row: error in top 2 rows, bottom row: output from the pony model

top row: input image, second row: output from Pokémon model, third row: error in top 2 rows, bottom row: output from the pony model