autoextractor

A toolkit for clustering web pages based on various similarity measures.

APACHE-2.0 License

Stars
4

Moved to https://github.com/uscdataScience/autoextractor

Auto Extractor

An intelligent extractor library which learns the structures of the input web pages and then figures out a strategy for scraping the structured content

NOTE : The project is under active development, as a result the README is out of sync with the codebase.

TODO: update this file with the description of all new features.

Example Usage:

1. Structural Similarity Between HTML/XML documents

2. Clustering based on style and structure

Developers:

References :

  • K. Zhang and D. Shasha. 1989. "Simple fast algorithms for the editing distance between trees and related problems". SIAM J. Comput. 18, 6 (December 1989), 1245-1262.
  • Jarvis, R.A.; Patrick, Edward A., "Clustering Using a Similarity Measure Based on Shared Near Neighbors," in Computers, IEEE Transactions on , vol.C-22, no.11, pp.1025-1034, Nov. 1973