web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
MIT License
npm install -g typescript
Recommended IDEA(Ultimate version)
Nodejs And Javascript Configuration in IDEA
ppspider_example github address https://github.com/xiyuan-fengyu/ppspider_example
Warning: git is required and the executable file path of git should be set in IDEA
Click "Terminal" on the bottom side of IDEA to open a terminal and run the following command
npm install
Run 'tsc' in terminal Or ContextMenu on package.json -> Show npm Scripts -> Double click 'auto build' tsc is a TypeScript compiler which can auto compile the ts file to js file after any ts file change
Run lib/quickstart/App.js Open http://localhost:9000 in the browser to check the ppspider's status
https://github.com/xiyuan-fengyu/ppspider_docker_deploy/blob/master/README.en.md
Declare like
export function TheDecoratorName(args) { ... }
Usage
@TheDecoratorName(args)
Decorator is similar of look with java Annotation, but Decorator is stronger. Decorator can provide meta data by parameters and modify the target or descriptor to change behavior of class or method In ppspider, many abilities are provided by Decorator
export function Launcher(theAppInfo: AppInfo) { ... }
Launcher of a ppspider app Params type
export type AppConfig = {
// all cache files and the db file will be saved to the workplace folder. You can save the data files to this folder too.
workplace: string;
// file path to save the running status,default is workplace + "/queueCache.json"
queueCache?: string;
// database url, nedb or mongodb is supported. When the app will generate a little amount of data, use nedb, the url format: nedb://nedbDirPath; Otherwise, mongodb is recommended, the url format: mongodb://username:password@host:port/dbName. The default value is "nedb://" + appInfo.workplace + "/nedb"
dbUrl?: string;
// import all task class
tasks: any[];
// import all DataUi class
dataUis?: any[];
// all workerFactory instances, only PuppeteerWorkerFactory, NoneWorkerFactory is provided at present
workerFactorys: WorkerFactory<any>[];
// the port for web ui,default 9000
webUiPort?: number | 9000;
// logger setting
logger?: LoggerSetting;
}
export function OnStart(config: OnStartConfig)
A job executed once after app start, but you can execute it again by pressing the button in webUI. The button will be found after the queue name in webUI's Queue panel.
Params type
export type OnStartConfig = {
// urls to crawl
urls: string | string[];
// if set true, this queue will not run after startup
running?: boolean;
// config of max paralle num, can be a number or a object with cron key and number value
parallel?: ParallelConfig;
// the execute interval between jobs, all paralle jobs share the same exeInterval
exeInterval?: number;
// make a random delta to exeInterval
exeIntervalJitter?: number;
// Task timeout, in milliseconds, default: 300000, negative number means never timeout
timeout?: number;
maxTry?: number;
// description of this sub task type
description?: string;
// default BloonFilter,the job won't execute again after save and restart. If you want to re-execute, use NoFilter
filterType?: Class_Filter;
// default content of job.datas
defaultDatas?: any;
}
export function OnTime(config: OnTimeConfig) { ... }
A job executed at special times resolved by cron expression Params type
export type OnTimeConfig = {
urls: string | string[];
cron: string; // cron expression
running?: boolean;
parallel?: ParallelConfig;
exeInterval?: number;
exeIntervalJitter?: number;
timeout?: number;
maxTry?: number;
description?: string;
defaultDatas?: any;
}
Those two should be used together, @AddToQueue will add the function's result to job queue, @FromQueue will fetch jobs from queue to execute
@AddToQueue
export function AddToQueue(queueConfigs: AddToQueueConfig | AddToQueueConfig[]) { ... }
@AddToQueue accepts one or multi configs Config type:
export type AddToQueueConfig = {
// queue name
name: string;
// queue provided: DefaultQueue(FIFO), DefaultPriorityQueue
queueType?: QueueClass;
// filter provided: NoFilter(no check), BloonFilter(check by job's key)
filterType?: FilterClass;
}
You can use @AddToQueue to add jobs to a same queue at multi places, the queue type is fixed at the first place, but you can use different filterType at each place.
The method Decorated by @AddToQueue shuold return a AddToQueueData like.
export type CanCastToJob = string | string[] | Job | Job[];
export type AddToQueueData = Promise<CanCastToJob | {
[queueName: string]: CanCastToJob
}>
If @AddToQueue has multi configs, the return data must like
Promise<{
[queueName: string]: CanCastToJob
}>
PuppeteerUtil.links is a convenient method to get all expected urls, and the return data is just AddToQueueData like.
@FromQueue
export function FromQueue(config: FromQueueConfig) { ... }
export type FromQueueConfig = {
// queue name
name: string;
running?: boolean;
parallel?: ParallelConfig;
exeInterval?: number;
exeIntervalJitter?: number;
// Task timeout, in milliseconds, default: 300000, negative number means never timeout
timeout?: number;
maxTry?: number;
description?: string;
defaultDatas?: any;
}
@AddToQueue @FromQueue example
export function JobOverride(queueName: string) { ... }
Modify job info before inserted into the queue. You can set a JobOverride just once for a queue.
A usage scenario is: when some urls with special suffix or parameters navigate to the same page, you can modify the job key to some special and unique id taken from the url with a JobOverride. After that, jobs with duplicate keys will be filtered out.
Actually, sub task type OnStart/OnTime is also managed by queue whose name just likes OnStart_ClassName_MethodName or OnTime_ClassName_MethodName, so you can set a JobOverride to it. JobOverride example
export function Serializable(config?: SerializableConfig) { ... }
export function Transient() { ... }
@Serializable is used to mark a class, then the class info will keep during serializing and deserializing. Otherwise, the class info will lose when serializing.
@Transient is used to mark a field which will be ignored when serializing and deserializing. Warn: static field will not be serialized. These two are mainly used to save running status. You can use @Transient to ignore fields which are not related with running status, then the output file will be smaller in size. example
export function RequestMapping(url: string, method: "" | "GET" | "POST" = "") {}
@RequestMapping is used to declare the HTTP rest interface, providing the ability to dynamically add tasks remotely. Returning the crawl results requires self-implementation (such as asynchronous url callbacks). RequestMapping example
仿造 java spring @Bean @Autowired 的实现,提供实例依赖注入的功能 example
You can define you own tab page in UI(http://localhost:webPort) by this which can extend support for data visualization and user interaction You should import the DataUiClass in @Launcher appConfig.dataUis
There is a built-in DataUi DbHelperUi which can support db search
example 2 Add Dynamic Job On UI
example 3 Web Page Screenshot Long web page screenshot is also supported
set page's view port to 1920 * 1080
inject jquery to page, jquery will be invalid after page refresh or navigate, so you should call it after page load.
parse json in jsonp string
enable/disable image load
listen response with special url, max listen num is supported
listen response with special url just once
download image with special css selector
get all expected urls
count doms with special css selector
Find dom nodes by jQuery(selector) and specify random id if not existed, finally return the id array. A usage scenario is: In puppeteer, all methods of Page to find doms by css selector finally call document.querySelector / document.querySelectorAll which not support some special css selectors, howerver jQuery supports. Such as: "#someId a:eq(0)", "#someId a:contains('next')". So we can call specifyIdByJquery to specify id to the dom node and keep the special id returned, then call Page's method with the special id.
scroll to bottom
Parse cookie string to SetCookie Array,than set coookie through page.setCookie(...cookieArr) How to get cookie string? open interested url in chrome-> press F12 to open devtools -> Application panel -> Storage:Cookies: -> cookie detail in the left panel -> choose all by mouse,press Ctrl+c to copy all You will get something similar to the following
PHPSESSID ifmn12345678 sm.ms / N/A 35
cid sasdasdada .sm.ms / 2037-12-31T23:55:55.900Z 27
Set dynamic proxy for a single page
PuppeteerUtil example Single Page Proxy
Recode all requests during page opening NetworkTracing example
Nedb is supported by NedbDao. Mongodb is supported by MongodbDao. You can set the "dbUrl" in @Launcher parameters, then use appInfo.db to visist db. All metheds defined in src/common/db/DbDao. Nedb is a server-less database, no need to install the server end, the data will be persisted to the local file. Url format: nedb://nedbDirectoryPath. When the amount of data is large, the data query speed is slow, and it will take a lot of time to load data each time you restart the application, so use it only if the data is small. Mongodb needs to install the mongo server. The url format: mongodb://username:password@host:port/dbName. It is recommended to save a large amount of data.
After the application is started, a job collection is automatically created to save the job info during execution.
Use logger.debug, logger.info, logger.warn or logger.error to print log. Those functions are defined in src/common/util/logger.ts. The output logs contain extra info: timestamp, log level, source file position.
logger.debugValid && logger.debug("test debug");
logger.info("test info");
logger.warn("test warn");
logger.error("test error");
simple typescript/js code can be debugged in IDEA
The inject js code can be debugged in Chromium. When building the PuppeteerWorkerFactory instance, set headless = false, devtools = true to open Chromium devtools panel. Inject js debug example
import {Launcher, PuppeteerWorkerFactory} from "ppspider";
import {TestTask} from "./tasks/TestTask";
@Launcher({
workplace: __dirname + "/workplace",
tasks: [
TestTask
],
workerFactorys: [
new PuppeteerWorkerFactory({
headless: false,
devtools: true
})
]
})
class App {
}
import {Job, OnStart, PuppeteerWorkerFactory} from "ppspider";
import {Page} from "puppeteer";
export class TestTask {
@OnStart({
urls: "http://www.baidu.com",
workerFactory: PuppeteerWorkerFactory
})
async index(page: Page, job: Job) {
await page.goto(job.url());
const title = await page.evaluate(() => {
debugger;
const title = document.title;
console.log(title);
return title;
});
console.log(title);
}
}
In addition, when developing DataUi, you can debug it in browser
http://www.runoob.com/jquery/jquery-syntax.html
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md
https://github.com/louischatriot/nedb
https://docs.mongodb.com/manual/reference/method/js-collection/
https://antv.alipay.com/zh-cn/g2/3.x/demo/index.html G2 integrated in web ui for data visualization in DataUi
https://v3.bootcss.com/css/ The web ui integrates bootstrap and jquery which make it easy to use the bootstrap and jquery to write the ui interface directly in DataUi.
open http://localhost:9000 in browser
Queue panel: view and control app status
Job panel: search jobs and view details
When running an app in idea under debug mode after a long time, the application may get stucked on a code line just like a breakpoint stop. It is due to low memory, just add a node paramter "--max-old-space-size=8192" to solve it. This situation has often appeared in previous versions, mainly due to nedb write/read process or serialization/deserialization process of QueueManager. In the new version (v2.1.2), it has been optimized.
When using the v2.2.0+ version, the workerFactory property is removed from @OnStart, @OnTime, @FromQueue. If you want to use the puppeteer page to crawl the website, You can import the Page class via
import {Page} from "ppspider";
and then declare a page: Page parameter in the parameter list of the callback function. If use Import {Page} from "puppeteer" to import Page, the imported Page is just an interface, which cannot be determined at runtime by reflect-metadata, and the page instance will not be injected successfully. This error is checked during startup.
2020-12-07 v2.2.4-preview.1607350101966
2020-04-07 v2.2.3
2019-09-04 v2.2.3-preview.1578363288631
2019-09-04 v2.2.3-preview.1577332807380
2019-09-04 v2.2.3-preview.1574909694087
2019-09-04 v2.2.3-preview.1569208986875
2019-07-31 v2.2.2-preview
2019-06-22 v2.2.1
Rewrite the way to inject the worker instance through the reflection mechanism provided by typescript and reflect-metadata during calling method decorated by @OnStart, @OnTime, @FromQueue. The workerFactory property of @OnStart, @OnTime, @FromQueue is removed. The freamwork check the parameter types of the decorated method to determine whether the job parameter needs to be passed, whether the worker instance needs to be passed (if true, which worker type is the correct one). The order and number of parameters are no longer fixed.
But there are also restrictions, in the parameters, at most one with the Job type, and at most one with the worker type which has the corresponding WorkerFactory definition (only Page is currently provided). Be careful that the class Page is provided in the ppspider package, not the interface Page defined in @types/puppeteer.
Because of this change, some code needs to be upgraded. You need to remove the workerFactory property in @OnStart, @OnTime, @FromQueue. If you want to use page: Page in the method decorated by @OnStart, @OnTime, @FromQueue, you need to import {Page} from "ppspider" instead of "puppeteer", other parameters except job: Job should be removed. The order and name of the parameters can be defined freely. If the job: Job is not used in the method, you can also remove this parameter.
Fixed a bug: @AddToQueue does not work without @OnStart / @OnTime / @FromQueue.
Add deployment scheme based on docker
2019-06-13 v2.1.11
2019-06-06 v2.1.10
2019-06-03 v2.1.9
2019-06-02 v2.1.8
2019-05-28 v2.1.6
2019-05-24 v2.1.3
2019-05-21 v2.1.2
2019-05-09 v2.0.5
2019-05-08 v2.0.4
2019-04-29 v2.0.3
2019-04-29 v2.0.2
2019-04-22 v2.0.1
2019-04-04 v2.0.0
appInfo.eventBus.emit(Events.QueueManager_InterruptJob, JOB_ID, "your interrupt reason");
2019-01-28 v0.1.22
2018-12-24 v0.1.21
2018-12-10 v0.1.20
2018-11-19 v0.1.19
2018-09-19 v0.1.18
mainMessager.emit(MainMessagerEvent.QueueManager_QueueToggle_queueName_running, queueNameRegex: string, running: boolean)
2018-08-24 v0.1.17
2018-07-31 v0.1.16
2018-07-30 v0.1.15
2018-07-27 v0.1.14
2018-07-24 v0.1.13
2018-07-23 v0.1.12
2018-07-19 v0.1.11
2018-07-16 v0.1.8