Combine CNN and RNN knowledge to build a network that automatically produces captions when given an input image. Python, PyTorch.
This implementation of the CVND-Image-Captioning-Project is built for Udacity's Computer Vision Nanodegree.
Libraries
Create (and activate) a new environment, named cv-nd
with Python 3.6. If prompted to proceed with the install (Proceed [y]/n)
type y.
conda create -n cv-nd python=3.6
source activate cv-nd
conda create --name cv-nd python=3.6
activate cv-nd
At this point your command line should look something like: (cv-nd) <User>:P1_Facial_Keypoints <user>$
. The (cv-nd)
indicates that your environment has been activated, and you can proceed with further package installations.
Install PyTorch and torchvision; this should install the latest version of PyTorch.
conda install pytorch torchvision -c pytorch
conda install pytorch-cpu -c pytorch
pip install torchvision
Install a few required pip packages, which are specified in the requirements text file (including OpenCV).
pip install -r requirements.txt
Dstaset
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
make
cd ..
Under Annotations, download:
Under Images, download:
The project is structured as a series of Jupyter notebooks and .py
files:
The final output is one senense per image. In 3_Inference.ipynb, results are given for some images chosen randomly.
Overall, the model identifies objects and describes them correctly but may give partly correct explanation of some images.
Correctly captioned images
Almost correctly captioned images