A toolkit for clustering web pages based on various similarity measures.
APACHE-2.0 License
Simplified version of a common crawl fetcher
Easy to use lightweight web crawler(易用的轻量化网络爬虫)
Indexing TREC corpora and Wikipedia using Lucene
Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries
similarity: Text similarity calculation Toolkit for Java. 文本相似度计算工具包,java编写,可用于文本相似度计算、情感分析等任务,开箱即用。
Java map matching library for integrating the map into software and services with state-of-the-ar...
A suite of classification clustering algorithm implementations for Java. A number of partitional,...
Mining from git history with Pink Pony!
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n...
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件。