autoextractor

A toolkit for clustering web pages based on various similarity measures.

APACHE-2.0 License

Stars

View Code on GitHub View on X

Ecosystems: Java

Issue Statistics

Past Year

All Time

Total Pull Requests

Merged Pull Requests

Total Issues

Time to Close Issues

N/A

Related Projects

commoncrawl-fetcher-lite

Simplified version of a common crawl fetcher

17 Mar 2023 9

focused-clustering

23 Aug 2014 22

gecco

Easy to use lightweight web crawler（易用的轻量化网络爬虫）

12 Dec 2015 2,501

IndexTextCollect

Indexing TREC corpora and Wikipedia using Lucene

17 Dec 2013 11

SimpleCommonCrawlExtractor

Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries

21 Jul 2015 4

similarity

similarity: Text similarity calculation Toolkit for Java. 文本相似度计算工具包，java编写，可用于文本相似度计算、情感分析等任务，开箱即用。

09 Nov 2016 1,416

crawl-eval

03 May 2016 6

barefoot

Java map matching library for integrating the map into software and services with state-of-the-ar...

05 Jun 2014 666

clustering-benchmark

09 Jun 2015 157

clust4j

A suite of classification clustering algorithm implementations for Java. A number of partitional,...

11 Nov 2015 148

PinkPony

Mining from git history with Pink Pony!

11 Jul 2019 9

java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n...

17 Apr 2014 2,689

HtmlExtractor

HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件。

07 Aug 2014 157