autoextractor

A toolkit for clustering web pages based on various similarity measures.

APACHE-2.0 License

Stars

4

View Code on GitHub View on X

Ecosystems: Java

Moved to https://github.com/uscdataScience/autoextractor

Auto Extractor

An intelligent extractor library which learns the structures of the input web pages and then figures out a strategy for scraping the structured content

NOTE : The project is under active development, as a result the README is out of sync with the codebase.

TODO: update this file with the description of all new features.

Example Usage:

1. Structural Similarity Between HTML/XML documents

2. Clustering based on style and structure

Developers:

References :

K. Zhang and D. Shasha. 1989. "Simple fast algorithms for the editing distance between trees and related problems". SIAM J. Comput. 18, 6 (December 1989), 1245-1262.
Jarvis, R.A.; Patrick, Edward A., "Clustering Using a Similarity Measure Based on Shared Near Neighbors," in Computers, IEEE Transactions on , vol.C-22, no.11, pp.1025-1034, Nov. 1973

Related Projects

HtmlExtractor

HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件。

07 Aug 2014 157

barefoot

Java map matching library for integrating the map into software and services with state-of-the-ar...

05 Jun 2014 666

PinkPony

Mining from git history with Pink Pony!

SimpleCommonCrawlExtractor

Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries

java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n...

17 Apr 2014 2,689

gecco

Easy to use lightweight web crawler（易用的轻量化网络爬虫）

12 Dec 2015 2,501

IndexTextCollect

Indexing TREC corpora and Wikipedia using Lucene

commoncrawl-fetcher-lite

Simplified version of a common crawl fetcher

focused-clustering

clust4j

A suite of classification clustering algorithm implementations for Java. A number of partitional,...

11 Nov 2015 148

similarity

similarity: Text similarity calculation Toolkit for Java. 文本相似度计算工具包，java编写，可用于文本相似度计算、情感分析等任务，开箱即用。

09 Nov 2016 1,416

clustering-benchmark

09 Jun 2015 157

crawl-eval