Pipeline for Analyzing Text Data: Acquire, Preprocess, Analyze
A general pipeline for analyzing text data: Acquire, preprocess, process and analyze text data.
You can get text data from scraping, APIs, searchable pdfs, images of paper, etc. Some examples:
Preprocess text for text-as-data analysis.
Depending on the need, remove stop words, punctuation, capitalization, special characters, and stem.
Output a simple or stratified random sample of a csv, and only the columns you need. Get summary of crucial aspects of the data. Takes a csv.
Create a term-document-matrix and get some information about the matrix including frequent and infrequent terms. Options available for removing sparse terms etc.
Scripts are released under the MIT License.