A basic generative text algorithm based on Markov Chains. Processes a given text file of words to predict further text based on the input.
This is a representation of a very simple text prediction algorithm which processes a text file and returns K words of predicted text based on a "seed" file provided in a command line argument. The implementation uses a directed graph to assign weights to each word based on the frequency of that word coming after the previous one, and repeats this process over and over to string a sentence together. This is known as a "Markov Chain" model, and is a very simple way to predict text. More info on Markov Chains can be found here.
The program requires either 3 or 4 arguments passed into the args[]
parameter when running the program.
The first three arguments are always the same. The inclusion of the 4th argument changes the program's behavior.
Note: The program will not run unless the three arguments are passed properly.
.txt
file. The text file should contain a list of words. All punctuation,all
or one
, or should not be present at all
all
is passed in, the output will be generated by randomly selecting words based on probabilityone
is passed in, the output will always select the most probable word for the one it's looking at.A valid command line call for this function would be the following:
javac comprehensive/TextGenerator.java comprehensive/GenerativeModel.java
java comprehensive.TextGenerator sample.txt hello 10 one
Or, without the 4th argument it would be:
javac comprehensive/TextGenerator.java comprehensive/GenerativeModel.java
java comprehensive.TextGenerator sample.txt hello 4
To properly show the requirements the assignment is under, I've included the broad strokes of the assignment description below which detail grading structure and code functionality requirements.
All text shown below is directly taken from the assignment page.
You have been asked to create a program that can take a text file as input and generate some text as output that has some similar characteristics as the input. It is not necessary that the output represent correct language or even be meaningful, but it needs to contain similar words in similar patterns. As a part of your program, you need to create at least two Java classes. One of these represents the model, which processes the input and provides text generation capability. The other represents the command-line application that allows the user to interface with the model. Specific requirements are detailed below.
To simplify the issues related to dealing with capitalization and punctuation (periods at the end of sentences, apostrophes for possessives/contractions, etc), we'll tweak the input when building our model to reduce the number of different words we consider. For the purposes of this assignment, words cannot contain whitespace or punctuation. If a sequence of characters without whitespace contains punctuation characters, the "word" is defined as the characters before the first punctuation character, if any.
As an example, in this string: "I'm a little Teapot short and stout. Here's my handle, 'ere's my spout" the words are: [i, a, little, teapot, short, and, stout, here, my, handle, my, spout]
All letters are lowercased, and anything after a punctuation character is dropped. Note that "'ere's" contributes no words to the output because there are no letters before the apostrophe.
This "word simplification" strategy should help us produce more interesting outputs from smaller training input files.
For this assignment, we are defining the probability of a word coming after another as the number of times this occurred in the input text divided by the total number of words that came after the first word. This is a number between 0.0 and 1.0 inclusive. This means that somewhere in your code you need to be keeping track of this probability for each possible two-word sequence. There are many ways to accomplish that. Consider the most efficient data structures and algorithms for this application while considering the required functionality.
Create a new class named TextGenerator in a new package called comprehensive. The user starts your program by running this class (i.e., by executing TextGenerator's main method). There are several command-line arguments required that must match these specifications exactly and in this order.
Note that your solution must handle both three and four argument cases.
We will not test your program with invalid command-line inputs (input files will always exist, seed words are always included in the input, K will always be a positive number, the 4th argument, if it exists will be "all" or "one") The output must be printed to System.out and follow these specifications. The only printed output is the generated text. Don't print anything else before or after the text. Word must be separated by whitespace (spaces, newlines, tabs are all OK) Output should only consist of lower-case English letters, numbers 0 to 9, and underscores. There should be no other punctuation.
You may create any other new classes as needed. You should avoid putting all of your logic into the TextGenerator class. That class should mainly serve as a user interface for your application. You are free to use any of the Java Collection classes. Make sure that all code files required by your program are in the comprehensive package. Because all grading occurs outside of Eclipse/IntelliJ, you must make certain that your program compiles and runs correctly from the command-line without Eclipse.
For example:
javac comprehensive/TextGenerator.java comprehensive/OtherNeededClass.java
java comprehensive.TextGenerator input.txt word 25 all
In this example, your working directory contains both the input.txt file and your comprehensive package (folder). The compile line will include all .java files in your package.
Command-line arguments can also be specified in Eclipse or IntelliJ, using the Run Configurations menu.
Take care to design your solution to be as efficient as possible. See the section below for details of how your text generator is evaluated for running-time efficiency.
NOTE: It is intentional that you are being given no guidance as to how to solve the problem. It is critical that you gain experience solving a problem "from scratch," designing the structure of your classes and methods, as well as choosing the best data structures and algorithms for the problem. Because the readers of your code have no assumptions about how it is organized, you must document it well.
This program was written in conjunction with my partner for the class, Jorden Dickerson. He is an incredible programmer, and he's been a joy to work with over the past months! I owe much of the impeccable programming done here to him and our teamwork. Thank you Jorden :) your inclusion has been much appreciated.