A simple, extensible Markov chain generator.
MIT License
Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.
Some reasons:
Simplicity. "Batteries included," but it is easy to override key methods.
Models can be stored as JSON, allowing you to cache your results and save them for later.
Text parsing and sentence generation methods are highly extensible, allowing you to set your own rules.
Relies only on pure-Python libraries, and very few of them.
Tested on Python 3.7, 3.8, 3.9, and 3.10.
pip install markovify
import markovify
# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
text = f.read()
# Build the model.
text_model = markovify.Text(text)
# Print five randomly-generated sentences
for i in range(5):
print(text_model.make_sentence())
# Print three randomly-generated sentences of no more than 280 characters
for i in range(3):
print(text_model.make_short_sentence(280))
Notes:
The usage examples here assume you are trying to markovify text. If you would like to use the underlying markovify.Chain
class, which is not text-specific, check out the (annotated) source code.
Markovify works best with large, well-punctuated texts. If your text does not use .
s to delineate sentences, put each sentence on a newline, and use the markovify.NewlineText
class instead of markovify.Text
class.
If you have accidentally read the input text as one long sentence, markovify will be unable to generate new sentences from it due to a lack of beginning and ending delimiters. This issue can occur if you have read a newline delimited file using the markovify.Text
command instead of markovify.NewlineText
. To check this, the command [key for key in txt.chain.model.keys() if "___BEGIN__" in key]
command will return all of the possible sentence-starting words and should return more than one result.
By default, the make_sentence
method tries a maximum of 10 times per invocation, to make a sentence that does not overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None
. To increase or decrease the number of attempts, use the tries
keyword argument, e.g., call .make_sentence(tries=100)
.
By default, markovify.Text
tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio
and/or max_overlap_total
to the make_sentence
method. Alternatively, this check can be disabled entirely by passing test_output
as False.
State size is a number of words the probability of a next word depends on.
By default, markovify.Text
uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:
text_model = markovify.Text(text, state_size=3)
With markovify.combine(...)
, you can combine two or more Markov chains. The function accepts two arguments:
models
: A list of markovify
objects to combine. Can be instances of markovify.Chain
or markovify.Text
(or their subclasses), but all must be of the same type.weights
: Optional. A list — the exact length of models
— of ints or floats indicating how much relative emphasis to place on each source. Default: [ 1, 1, ... ]
.For instance:
model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)
model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])
This code snippet would combine model_a
and model_b
, but, it would also place 50% more weight on the connections from model_a
.
Once a model has been generated, it may also be compiled for improved text generation speed and reduced size.
text_model = markovify.Text(text)
text_model = text_model.compile()
Models may also be compiled in-place:
text_model = markovify.Text(text)
text_model.compile(inplace = True)
Currently, compiled models may not be combined with other models using markovify.combine(...)
.
If you wish to combine models, do that first and then compile the result.
Starting with v0.7.2
, markovify.Text
accepts two additional parameters: well_formed
and reject_reg
.
Setting well_formed = False
skips the step in which input sentences are rejected if they contain one of the 'bad characters' (i.e. ()[]'"
)
Setting reject_reg
to a regular expression of your choice allows you change the input-sentence rejection pattern. This only applies if well_formed
is True, and if the expression is non-empty.
markovify.Text
The markovify.Text
class is highly extensible; most methods can be overridden. For example, the following POSifiedText
class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works; however, be warned: pos_tag
is very slow.)
import markovify
import nltk
import re
class POSifiedText(markovify.Text):
def word_split(self, sentence):
words = re.split(self.word_split_pattern, sentence)
words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
return words
def word_join(self, words):
sentence = " ".join(word.split("::")[0] for word in words)
return sentence
Or, you can use spaCy which is way faster:
import markovify
import re
import spacy
nlp = spacy.load("en_core_web_sm")
class POSifiedText(markovify.Text):
def word_split(self, sentence):
return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]
def word_join(self, words):
sentence = " ".join(word.split("::")[0] for word in words)
return sentence
The most useful markovify.Text
models you can override are:
sentence_split
sentence_join
word_split
word_join
test_sentence_input
test_sentence_output
For details on what they do, see the (annotated) source code.
It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text
model, use my_text_model.to_json()
. For example:
corpus = open("sherlock.txt").read()
text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.
reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(280)
>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'
You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size
metadata — via my_text_model.chain.to_json()
.
markovify.Text
models from very large corporaBy default, the markovify.Text
class loads, and retains, your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). However, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a)
tell Markovify not to retain the original:
with open("path/to/my/huge/corpus.txt") as f:
text_model = markovify.Text(f, retain_original=False)
print(text_model.make_sentence())
And (b)
read in the corpus line-by-line or file-by-file and combine them into one model at each step:
combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
for filename in filenames:
with open(os.path.join(dirpath, filename)) as f:
model = markovify.Text(f, retain_original=False)
if combined_model:
combined_model = markovify.combine(models=[combined_model, model])
else:
combined_model = model
print(combined_model.make_sentence())
markovify
to generate random Reddit submissions and comments based on a subreddit's previous activity. [code]markovify
-powered quiz that challenges you to tell the difference between "two file titles relating to matters of [Australian] national security" — one real and one fake. [code]markovify
-powered Twitter bot attached to a printer. Presented by Helen J Burgess at Babel Toronto 2015. [code]Have other examples? Pull requests welcome.
Many thanks to the following GitHub users for contributing code and/or ideas:
Initially developed at BuzzFeed.