Open Source Ecosystems

Text as Data

A general pipeline for analyzing text data: Acquire, preprocess, process and analyze text data.

Get Text

You can get text data from scraping, APIs, searchable pdfs, images of paper, etc. Some examples:

Get text from searchable pdfs. e.g. Get data from Wisconsin Ads storyboards using Python
Get text from images of text using Tesseract from Python
Get text from images of text using Abbyy FineReader Cloud OCR from R
Get text from images of text using Captricity OCR from R
Get Congressional Speech Data using Capitol Words API from the Sunlight Foundation

Preprocess Text

Preprocess text for text-as-data analysis.

Depending on the need, remove stop words, punctuation, capitalization, special characters, and stem.

preprocess_csv takes a csv with 'raw' text and outputs a csv with processed text.

Get Summary of the Data, Subset Data

Output a simple or stratified random sample of a csv, and only the columns you need. Get summary of crucial aspects of the data. Takes a csv.

Summarize and Subset.

Get TDM

Create a term-document-matrix and get some information about the matrix including frequent and infrequent terms. Options available for removing sparse terms etc.

Get TDM, TF-IDF, Summary.

Sentiment Analysis in Python

Basic sentiment analysis using AFINN

Analyze Text in R

Classify text in R using SVM or Lasso. See Basic Text Classifier
Worked out example of how to model words as a function of ideology using Congressional Speech. See Speech Learn