text-as-data

Pipeline for Analyzing Text Data: Acquire, Preprocess, Analyze

Stars
8

Text as Data

A general pipeline for analyzing text data: Acquire, preprocess, process and analyze text data.

  1. Get Text
  2. Preprocess Text
  3. Subset, Take a Random Sample, Summarize
  4. Create tdm/tf-idf
  5. Analyze

Get Text

You can get text data from scraping, APIs, searchable pdfs, images of paper, etc. Some examples:

Preprocess Text

Preprocess text for text-as-data analysis.

Depending on the need, remove stop words, punctuation, capitalization, special characters, and stem.

  • preprocess_csv takes a csv with 'raw' text and outputs a csv with processed text.

Get Summary of the Data, Subset Data

Output a simple or stratified random sample of a csv, and only the columns you need. Get summary of crucial aspects of the data. Takes a csv.

Get TDM

Create a term-document-matrix and get some information about the matrix including frequent and infrequent terms. Options available for removing sparse terms etc.

Sentiment Analysis in Python

Analyze Text in R

  • Classify text in R using SVM or Lasso. See Basic Text Classifier
  • Worked out example of how to model words as a function of ideology using Congressional Speech. See Speech Learn

License

Scripts are released under the MIT License.

Badges
Extracted from project README
MIT Build Status
Related Projects