SeimiCrawler | Java Ecosystem Directory

Bot releases are visible (Hide)

SeimiCrawler - V2.1.4 Latest Release

Published by zhegexiaohuozi over 1 year ago

支持扩展自定义SeimiDownloader 方便更灵活的定制自己的数据获取需求
默认走系统下载器，针对特殊请求，可以自行指定自定义的下载器，如：

public class MyCoustomDownloader implements SeimiDownloader {
    @Override
    public Response process(Request request) throws Exception {
        Response seimiResponse = new Response();
        seimiResponse.setSeimiHttpType(SeimiHttpType.OK_HTTP3);
        seimiResponse.setRealUrl(request.getUrl());
        seimiResponse.setUrl(request.getUrl());
        seimiResponse.setRequest(request);
        seimiResponse.setMeta(request.getMeta());
        seimiResponse.setBodyType(BodyType.TEXT);
        String content = webGetDo(request);
        seimiResponse.setContent(content);
        return seimiResponse;
    }

    @Override
    public Response metaRefresh(String s) throws Exception {
        //看自己情况，可以不实现不处理
        return null;
    }

    @Override
    public int statusCode() {
        return 200;
    }

    @Override
    public void addCookies(String s, List<SeimiCookie> list) {
        //to do
    }
}

其中 webGetDo() 是自定义逻辑，这里没有列出来，仅作示意，你可以随意实现你想实现的逻辑。

Request next = Request.build(url, MyCrawler::parseDetail);
next.setDownloader(MyCoustomDownloader.class);
push(next);

支持通过Jvm参数-Dseimi.crawler.thread-num=xx自定义每个Crawler的工作线程数，最小值为1

SeimiCrawler - v2.1.3

Published by zhegexiaohuozi over 1 year ago

跟进JsoupXpath 最新版本 https://github.com/zhegexiaohuozi/JsoupXpath/releases/tag/v2.5.2

SeimiCrawler -

Published by zhegexiaohuozi almost 5 years ago

升级部分依赖版本
支持json request body，Request对象中支持设置 jsonBody发起json request请求

<dependency>
  <groupId>cn.wanghaomiao</groupId>
  <artifactId>SeimiCrawler</artifactId>
  <version>2.1.2</version>
</dependency>

spring boot demo 默认关闭Redis 队列演示，避免误会

SeimiCrawler -

Published by zhegexiaohuozi over 5 years ago

升级依赖版本
Apache httpclient 重定向优化
修复已知问题

SeimiCrawler -

Published by zhegexiaohuozi over 6 years ago

完美支持SpringBoot，可以尽情的集成SpringBoot现有生态，demo参考
回调函数支持方法引用，设置起来更自然

    push(Request.build(s.toString(),Basic::getTitle));

非SpringBoot模式全局配置项通过SeimiConfig进行配置，包括 Redis集群信息，SeimiAgent信息等，SpringBoot模式则通过SpringBoot标准模式配置

常规模式：

SeimiConfig config = new SeimiConfig();
config.setSeimiAgentHost("127.0.0.1");
//config.redisSingleServer().setAddress("redis://127.0.0.1:6379");
Seimi s = new Seimi(config);
s.goRun("basic");

SpringBoot模式，在application.properties中配置

seimi.crawler.enabled=true
# 指定要发起start请求的crawler的name
seimi.crawler.names=basic,test

seimi.crawler.seimi-agent-host=xx
seimi.crawler.seimi-agent-port=xx

#开启分布式队列
seimi.crawler.enable-redisson-queue=true
#自定义bloomFilter预期插入次数，不设置用默认值 （）
#seimi.crawler.bloom-filter-expected-insertions=
#自定义bloomFilter预期的错误率，0.001为1000个允许有一个判断错误的。不设置用默认值（0.001）
#seimi.crawler.bloom-filter-false-probability=

分布式队列改用Redisson实现，底层依旧为redis，去重引入BloomFilter以提高空间利用率，一个线上的BloomFilter调参模拟器地址
JDK要求 1.8+
JsoupXpath同步升级至基于Antlr4重构的2.0版本，带来更为强大的Xpath语法支持

SeimiCrawler -

Published by zhegexiaohuozi about 7 years ago

修复 @Dreamerdream pr的版本没有考虑向下兼容问题

SeimiCrawler -

Published by zhegexiaohuozi about 7 years ago

修复分布式队列DefaultRedisQueue中json反序列化useSeimiAgent永远为false的bug @Dreamerdream

SeimiCrawler -

Published by zhegexiaohuozi over 7 years ago

修复异常次数超过最大重试次数后，无法进入异常处理器
增加当异常请求被提交给异常处理器超过三次后，再不对其进行处理

SeimiCrawler -

Published by zhegexiaohuozi over 7 years ago

bug fix

SeimiCrawler -

Published by zhegexiaohuozi almost 8 years ago

中文参数在框架层强制统一进行utf8编码的urlEncode，最大程度减少乱码请求
Request请求在去重处理时，将区分范围扩大到所设定的请求参数

SeimiCrawler -

Published by zhegexiaohuozi almost 8 years ago

支持在Request对象中，通过header(map)来自定义本次请求的header，以及支持通过seimiCookies来自定义cookies，自定义cookies会直接进入cookiesStore，对同域下第二次请求依然有效
优化默认启动方式，改造cn.wanghaomiao.seimi.boot.Run支持CommandLineParser，可以使用 -c 和-p来传参，其中-c用来指定crawlernames，多个用','分隔，-p指定一个端口，可以选择性的启动一个内嵌的http服务，并开启使用内嵌http接口
maven-seimicrawler-plugin打包插件升级为1.3.0，完善Linux下的脚本，并增加启动配置文件，可以到maven-seimicrawler-plugin主页详细查看
默认下载器改为Apache Httpclient,备用为下载器OkHttp3实现
优化部分代码
demo日志默认全部输出至控制台

SeimiCrawler -

Published by zhegexiaohuozi about 8 years ago

OkhttpDownloader支持处理contentType头中没有指定编码的中文页面
支持通过@Crawler注解中的httpTimeOut属性自定义http请求的超时时间，默认15000ms

附件中的demo通过maven-seimicrawler-plugin打包生成。如果不熟悉maven可以直接使用里面的lib目录设置依赖，同时也可以直接运行示例查看效果。运行方法可以阅读maven-seimicrawler-plugin进行了解。

SeimiCrawler -

Published by zhegexiaohuozi over 8 years ago

可通过实现SeimiCrawler的List<Request> startRequests();来实现更复杂的起始触发请求
SemiQueue按需加载
修复抓取文件类型数据返回时尝试匹配meta refresh时产生的问题

SeimiCrawler - v1.0.0

Published by zhegexiaohuozi over 8 years ago

http请求处理器重构，并默认改由okhttp3实现，且支持通过@Crawler注解中的httpType自由切换为apache httpclient
部分代码优化
支持通过seimiAgent获取页面快照（png/pdf）
升级JsoupXpath版本至v0.3.1

这一版是SeimiCrawler比较重大的一次更新，伴之而来的亦是更强悍的抓取体验。

Badges

Extracted from project README

Related Projects

maven-framework-project

基于maven的多框架和多视图融合技术(Struts1、Struts2、Spring、SpringMVC、Hibernate、Ibatis、MyBatis、Spring Data JPA、DWR)

22 Oct 2012 212

gecco

Easy to use lightweight web crawler（易用的轻量化网络爬虫）

12 Dec 2015 2,501

springboot-cloud

springboot + springcloud build micro service

26 Jun 2017 481

base-admin

Base Admin一套简单通用的后台管理系统，主要功能有：权限管理、菜单管理、用户管理，系统设置、实时日志，实时监控，API加密，以及登录用户修改密码、配置个性菜单等

17 Sep 2019 2,382

XSnow

💮基于RxJava2+Retrofit2精心打造的Android基础框架，包含网络、上传、下载、缓存、事件总线、权限管理、数据库、图片加载，基本都是项目中必用功能，每个模块充分解耦，可自由拓展。

16 Jan 2017 1,704

Java-Notes

计算机科学基础知识、Java开发、后端/服务端、面试相关 computer-science/Java-development/backend/interview

24 Mar 2018 1,547

spring-boot-api-project-seed

一个基于Spring Boot & MyBatis的种子项目，用于快速构建中小型API、RESTful API项目~

23 Jun 2017 9,579

maven-seimicrawler-plugin

Package seimicrawler project so that can be fast and standalone deployed.It is based on maven-war...

04 Jan 2016 14

bean-searcher

🔥🔥🔥 A read-only ORM focusing on advanced query, naturally supports joined tables, and avoids DTO/...

13 Jun 2017 1,129

goodsKill

🐲基于SpringCloud 2023.x + Dubbo 3.x构建的模拟秒杀微服务项目，集成了Elasticsearch🔍、Gateway、Mybatis-Plus、Sharding-JDB...

10 Dec 2016 1,899

My-Blog

A simple & beautiful blogging system implemented with spring-boot & thymeleaf & mybatis My Blog 是...

04 Mar 2019 3,644

JavaKeeper

✍️ Java 工程师必备架构体系知识总结：涵盖分布式、微服务、RPC等互联网公司常用架构，以及数据存储、缓存、搜索等必备技能

03 Sep 2019 1,866

Ehviewer_CN_SXJ

ehviewer，用爱发电，快乐前行

09 Nov 2020 13,367

mica

Spring Cloud 微服务开发核心工具集。工具类、验证码、http、redis、ip2region、xss 等，开箱即用。 🔝 🔝 记得右上角点个star 关注更新！

23 Jan 2019 2,112

quick-fix

应用内存服务访问, 应用内数据访问订正工具

30 Dec 2018 40