RNN text generation using Keras for word and character level models.
MIT License
Recurrent neural network (RNN) text generation using Keras. Generating text with neural networks is fun, and there are a ton of projects and standalone scripts to do it.
This project does not provide any groundbreaking features over what it already out there, but attempts to be a good, well documented place to start playing with text generation within the Keras framework. It handles the nitty-gritty details of loading a text corpus and feeding it into a Keras model.
Supports both a character-level model and a word-level model (with tokenization). Supports saving a model and model metadata to disk for later sampling. Supports using a validation set. Uses stateful RNNs within Keras for more efficient sampling.
pip install tensorflow-gpu # Or tensorflow or Theano
pip install keras colorama
# Train on the included Shakespeare corpus with default parameters
python train.py
# Sample the included Shakespeare corpus with default parameters
python samply.py
# Train with long samples, more layers, more epochs, and live sampling
python train.py --seq-length 100 --num-layers 4 --num-epochs 100 --live-sample
# Sample with a random seed for 500 characters and more random output
python sample.py --length 500 --temperature 2.0
# Train on a new dataset with a word level model and larger embedding
python train.py --data-dir ~/datasets/twain --word-tokens --embedding-size 128
# Sample new dataset with a custom seed
python sample.py --data-dir ~/datasets/twain --seed "History doesn't repeat itself, but"
There are two invokable scripts, train.py
and sample.py
, which should be run
in succession. Each operates on a data directory whose contents are as follows:
train.py
train.py
train.py
and required by sample.py
train.py
and required by sample.py
The input.txt
file should contain whatever texts you would like to train the
RNN on, concatenated into a single file. The text processing is by default
newline aware, so if you files contain hard wrapped prose, you may want to
remove the wrapping newlines. The validate.txt
file should be formatted
similarly to the input.txt. It is totally optional, but useful to monitor
for overfitting, etc.
There are two main modes to process the input--a character-level model and
a word-level model. Under the character level model, we will simply lowercase
the input text and feed it into the RNN character by character. Under the word
level model, the input text will be split into individual word tokens and each
token will be given a separate value before being fed into the RNN. Word will be
tokenized roughly following the Penn Treebank approach. By default we will
heuristically attempt to "detokenize" the text after sampling, but this can be
disabled with --pristine-output
.
input.txt
file.--pristine-output
.--seq-length
before feeding them into the--seq-length
.--seq-step
characters (or words). For example, a--seq-length
of 50 and --seq-step
of 25 would pull each character in themodel.h5
and model.pkl
files generated by train.py
.Why not just use char-rnn-tensorflow or word-rnn-tensorflow?
If your goal is just computational speed or low memory footprint, go with those projects! Pretty much the appeal here is using Keras. If you want an easy declarative framework to try new approaches, this is a good place to start.
There are also a few additional features here, such as fancier word tokenization and support for a hold out validation set, that may be of use depending on your application.
Can we add a command line flag for a different optimizer, RNN cell, etc.?
Most of command line flags exposed are to work with different datasets of varying sizes. If you want to change the structure of the RNN, just change the code. That's where Keras excels.
Can I use a different tokenization scheme for my word level model?
Yep! Pass the --pristine-input
flag and use a fancier tokenizer as a
preprocessing step. Tokens will be formed by calling text.split()
on the
input.