Text to Image Generation (Reverse image captioning): This task is just the reverse of image captioning task. Here, we are trying to generate the images that best fits the textual description.
MIT License
This task is just the reverse of image captioning task. Here, we are trying to generate the images that best fit the textual description.
Here's the sample output of the network.
In addition, please pip install
the following packages:
tensorboard tensorboardX tensorboard-pytorch
python-dateutil
easydict
pandas
torchfile
wget
Tkinter
You will first need to download the model files and word embeddings. The embedding files (utable and btable) are quite large (>2GB) so make sure there is enough space available. The encoder vocabulary can be found in dictionary.txt.
wget http://www.cs.toronto.edu/~rkiros/models/dictionary.txt
wget http://www.cs.toronto.edu/~rkiros/models/utable.npy
wget http://www.cs.toronto.edu/~rkiros/models/btable.npy
wget http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz
wget http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz.pkl
wget http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz
wget http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz.pkl
[percentage, omega, copyright, beta, ampersand, sigma, right arrow, down arrow, octagon, minus, euro, cube, up arrow, thunder, ellipse, plus, circle, question mark, square braces, curly braces, left arrow, semicolon, heptagon, less than, infinity, rectangle, heart, root, set union, pie, hashtag, double horizontal arrow, square, cloud, pound, asterisk, dollar, pentagon, star, multiplication, double vertical arrow, phi, cent, hexagon, equality, alpha, images, lambda, triangle, set intersection, greater than, exclamation mark]
git clone https://github.com/aditya30394/Reverse-Image-Captioning.git
Reverse-Image-Captioning
.chmod 777
This step may not be necessary.python download_skipthought.py}
in terminal to download necessary skip-thought model files. Please note that these files are very large (approx. 4GB in total) so make sure you have enough space. Alternatively, you can also use wget command to download the files - see the GitHub Readme for exact commands.python gui_rev_img_cap_model.py
in terminal to start the GUI.Generate Image
button. See the Step 7 for the list of available shapes. You can specify left-right and top-bottom combination to describe your images. See the figure at the top of this readme file to see examples. Don't use punctuation marks like comma, period, etc. in your description.[percentage, omega, copyright, beta, ampersand, sigma, right arrow, down arrow, octagon, minus, euro, cube, up arrow, thunder, ellipse, plus, circle, question mark, square braces, curly braces, left arrow, semicolon, heptagon, less than, infinity, rectangle, heart, root, set union, pie, hashtag, double horizontal arrow, square, cloud, pound, asterisk, dollar, pentagon, star, multiplication, double vertical arrow, phi, cent, hexagon, equality, alpha, images, lambda, triangle, set intersection, greater than, exclamation mark]
Exit
button to close the GUI.data/images
folder (already present in the repo).data
folder.git clone https://github.com/aditya30394/Reverse-Image-Captioning.git
Reverse-Image-Captioning
.chmod 777
This step may not be necessary.data
folder. This is extremely important.python main.py --log_step 80
in the terminal to start the training.Pretrained Model
checkpoints
.checkpoints
.python main.py --log_step 80 --resume_epoch 184
to resume the trainingBase code for the DNN model can be found below:
The base model is based on an earlier paper - Generative Adversarial Text to Image Synthesis. The model described in the paper uses a pre-trained text embedding, trained using character-level processing (Char-CNN-RNN), which was learned using images and corresponding description together. The resulting embedding is a vector of size 1024. In my work, I have replaced this character-level encoding with much more robust Skip-Thought Vectors(Code). These vectors encode the text description of images into a vector of 4800 dimensions. I use a reducer layer which takes this big vector and returns a vector of 1024 dimensions. This final vector is then used in the GAN. The parameters of this layer are learned during the training.
Following is the diagram showing high-level design of the neural network:
In my project work, I have replaced the said pre-trained embedding with skip-thought sentence embedding which is a vector of size 4800. I added a dense layer which takes this huge vector and reduces it to 1024 dimension so that I can reuse the base model. Figure below shows the overview of skip thoughts model. It consists of three components:
The decoders are trained to minimize the reconstruction errors of the previous i.e. x(i-1) and next i.e. x(i+1) sentence using the fixed size embedding z(i) - given by encoder.
Usefulness: Similar to Word2Vec vectors which generate close vectors for the words which have a similar meaning, this model generates the vector representations of sentences that are semantically same. In this projects, I am describing positions of objects/shapes in images. Let's take one example : alpha on the left and star on the right
. The same sentence can be written in other ways like : alpha to the left of star
, star to the right of alpha
, star on the right and alpha on the left
. All these sentences convey the same information and a character level encoding will not be sufficient to capture this. Skip-thought vectors for all these sentences will be closer to each other and thereby making the GAN model more robust.
In the skip-thought vector paper, authors have evaluated the vectors with linear models on 8 tasks: semantic-relatedness, paraphrase detection, image-sentence ranking and 5 standard classification benchmarks. Authors thereby have shown empirically that skip-thoughts yield generic representations that perform robustly across all tasks considered. This is another reason for considering these vectors in the project.
References
This project is licensed under the MIT License - see the LICENSE file for details