crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

BSD-3-CLAUSE License

Stars
11.3K

Bot releases are visible (Hide)

crawlab - v0.6.3-dev Latest Release

Published by tikazyq 4 months ago

What's Changed

New Contributors

Full Changelog: https://github.com/crawlab-team/crawlab/compare/v0.6.2...v0.6.3-dev

crawlab - v0.6.3

Published by tikazyq about 1 year ago

crawlab - v0.6.2

Published by tikazyq over 1 year ago

Web Crawler Management Platform Crawlab v0.6.2 Official Release

Overview

Crawlab v0.6.2 is the latest iterative version of Crawlab v0.6.x, bringing a series of improvements, including bug fixes, feature enhancements, and enhanced functionality for environment variables.

Changelog

Bug Fixes

Feature Enhancements

Community

If you find Crawlab helpful for your daily development or your company, please consider starring it on GitHub. If you encounter any issues, feel free to raise them as issues on GitHub. Additionally, you're welcome to contribute to the development of Crawlab. You can also join the Crawlab technical discussion group by adding WeChat account tikazyq1, where you can communicate and discuss with other developers regarding technical development and deployment usage.

References

crawlab - v0.6.1

Published by tikazyq over 1 year ago

What's Changed

New Contributors

Full Changelog: https://github.com/crawlab-team/crawlab/compare/v0.6.0...v0.6.1

crawlab - v0.6.0-1

Published by tikazyq almost 2 years ago

What's Changed

New Contributors

Full Changelog: https://github.com/crawlab-team/crawlab/compare/v0.6.0...v0.6.0-1

crawlab - v0.6.0

Published by tikazyq over 2 years ago

Change Log (v0.6.0)

Overview

As a major release, v0.6.0 is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

Enhancements

Backend

  • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
  • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
  • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
  • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
  • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
  • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
  • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.
  • Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info [here](https://docs.crawlab.cn/en/guide/plugin/).
  • Git Integration. Git integration is implemented as a built-in feature.
  • Scrapy Integration. Scrapy integration is implemented as a plugin [spider-assistant](https://docs.crawlab.cn/en/guide/plugin/plugin-spider-assistant).
  • Dependency Integration. Dependency integration is implemented as a plugin [dependency](https://docs.crawlab.cn/en/guide/plugin/plugin-dependency).
  • Notifications. Notifications feature is implemented as a plugin [notification](https://docs.crawlab.cn/en/guide/plugin/plugin-notification).

Frontend

  • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
  • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
  • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
  • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
  • Nav Tabs. Support multiple nav tabs for viewing different pages.
  • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
  • Detail Navigation. Sidebar navigation in detail pages.
  • Enhanced Dashboard. More stats charts in home page dashboard.

Miscellaneous

crawlab - v0.6.0-beta.20211224

Published by tikazyq almost 3 years ago

Change Log (v0.6.0-beta.20211224)

Overview

This is the third beta release for the next major version v0.6.0. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

Enhancement

  • Internationalization. Support Chinese.
  • CLI Upload Spider. #1020
  • Official Plugins. Allow users to install official plugins on Crawlab web UI.
  • More Documentation. Added documentation for plugins and CLI.

Bug Fixes

TODOs

  • Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
  • Crontab Editor. Frontend component that visualize the crontab editing.
  • Results Deduplication.
  • Environment Variables.
  • Frontend Utility Enhancement. Advanced features such as saved table customization.
  • Log Auto Cleanup.
  • More Documentation.
  • E2E Tests.
  • Frontend Output File Size Optimization.

What Next

The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

crawlab - v0.6.0-beta.20211120

Published by tikazyq almost 3 years ago

Change Log (v0.6.0-beta.20211120)

Overview

This is the second beta release for the next major version v0.6.0 after the first beta release. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

Enhancement

Backend

  • Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info here.
  • Git Integration. Git integration is implemented as a built-in feature.
  • Scrapy Integration. Scrapy integration is implemented as a plugin spider-assistant.
  • Dependency Integration. Dependency integration is implemented as a plugin dependency.
  • Notifications. Notifications feature is implemented as a plugin notification.
  • Documentation Site. Set up documentation site.

Frontend

  • Bug Fixing.

TODOs

  • Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
  • Crontab Editor. Frontend component that visualize the crontab editing.
  • Results Deduplication.
  • Environment Variables.
  • Internationalization. Support Chinese.
  • Frontend Utility Enhancement. Advanced features such as saved table customization.
  • Log Auto Cleanup.
  • More Documentation.

What Next

The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

crawlab - v0.6.0-beta.20210803

Published by tikazyq about 3 years ago

Change Log (v0.6.0-beta.20210803)

Overview

This is the beta release for the next major version v0.6.0. It recommended NOT to use it in production as it is not fully tested and thus not stable enough. Futhermore, more features including those not ready in the beta release (e.g. Git, Scrapy, Notification) are planned to be integrated into the live version, in the form of plugins.

Enhancement

As a major release, v0.6 (including beta versions) is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

Backend

  • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
  • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
  • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
  • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
  • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
  • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
  • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.

Frontend

  • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
  • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
  • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
  • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
  • Nav Tabs. Support multiple nav tabs for viewing different pages.
  • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
  • Detail Navigation. Sidebar navigation in detail pages.
  • Enhanced Dashboard. More stats charts in home page dashboard.

TODOs

As you may be aware that this is a beta release, some of the existing useful features such as Git and Scrapy integration may not be available. However, we are trying to include them in the official v0.6.0 release, as some of their core functionalities are already ready in the code base, and we will add to the stable version only if they are fully tested.

  • Plugin Framework. Advanced features will exist in the form of plugins, or pluggable modules.
  • Git Integration. To be included as a plugin.
  • Scrapy Integration. To be included as a plugin.
  • Notifications. To be included as a plugin.
  • Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
  • Crontab Editor. Frontend component that visualize the crontab editing.
  • Results Deduplication.
  • Environment Variables.
  • Internationalization. Support Chinese.
  • Frontend Utility Enhancement. Advanced features such as saved table customization.
  • Log Auto Cleanup.
  • Documentation.

What Next

This beta release is only a preview and a test ground for the core functionalies in Crawlab v0.6. Therefore, we will invite you guys to download and run more tests. The official release is expected to be ready after major issues from the beta version are sorted and Plugin Framework and other key features are developed and fully tested. With that beared in mind, a second beta version before the main release will also be possible.

crawlab - v0.5.1

Published by tikazyq about 4 years ago

Features / Enhancement

  • Added error message details.
  • Added Golang programming language support.
  • Added web driver installation scripts for Chrome Driver and Firefox.
  • Support system tasks. A "system task" is similar to normal spider task, it allows users to view logs of general tasks such as installing languages.
  • Changed methods of installing languages from RPC to system tasks.

Bug Fixes

  • Fixed first download repo 500 error in Spider Market page. #808
  • Fixed some translation issues.
  • Fixed 500 error in task detail page. #810
  • Fixed password reset issue. #811
  • Fixed unable to download CSV issue. #812
  • Fixed unable to install node.js issue. #813
  • Fixed disabled status for batch adding schedules. #814
crawlab - v0.5.0

Published by tikazyq over 4 years ago

Features / Enhancement

  • Spider Market. Allow users to download open-source spiders into Crawlab.
  • Batch actions. Allow users to interact with Crawlab in batch fashions, e.g. batch run tasks, batch delete spiders, ect.
  • Migrate MongoDB driver to MongoDriver.
  • Refactor and optmize node-related logics.
  • Change default task.workers to 16.
  • Change default nginx client_max_body_size to 200m.
  • Support writing logs to ElasticSearch.
  • Display error details in Scrapy page.
  • Removed Challenge page.
  • Moved Feedback and Dislaimer pages to navbar.

Bug Fixes

  • Fixed log not expiring issue because of failure to create TTL index.
  • Set default log expire duration to 1 day.
  • task_id index not created.
  • docker-compose.yml fix.
  • Fixed 404 page.
  • Fixed unable to create worker node before master node issue.
crawlab - v0.4.10

Published by tikazyq over 4 years ago

Features / Enhancement

  • Enhanced Log Management. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection.
  • API Token. Allow users to generate API tokens and use them to integrate into their own systems.
  • Web Hook. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes.
  • Auto Install Dependencies. Allow installing dependencies automatically from requirements.txt or package.json.
  • Auto Results Collection. Set results collection to results_<spider_name> if it is not set.
  • Optimized Project List. Not display "No Project" item in the project list.
  • Upgrade Node.js. Upgrade Node.js version from v8.12 to v10.19.
  • Add Run Button in Schedule Page. Allow users to manually run task in Schedule Page.

Bug Fixes

  • Cannot register. #670
  • Spider schedule tab cron expression shows second. #678
  • Missing daily stats in spider. #684
  • Results count not update in time. #689
crawlab - v0.4.9

Published by tikazyq over 4 years ago

Features / Enhancement

  • Challenges. Users can achieve different challenges based on their actions.
  • More Advanced Access Control. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects.
  • Feedback. Allow users to send feedbacks and ratings to Crawlab team.
  • Better Home Page Metrics. Optimized metrics display on home page.
  • Configurable Spiders Converted to Customized Spiders. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders.
  • View Tasks Triggered by Schedule. Allow users to view tasks triggered by a schedule. #648
  • Support Results De-Duplication. Allow users to configure de-duplication of results. #579
  • Support Task Restart. Allow users to re-run historical tasks.

Bug Fixes

  • CLI unable to use on Windows. #580
  • Re-upload error. #643 #640
  • Upload missing folders. #646
  • Unable to add schedules in Spider Page.
crawlab - v0.4.8

Published by tikazyq over 4 years ago

Features / Enhancement

  • Support Installations of More Programming Languages. Now users can install or pre-install more programming languages including Java, .Net Core and PHP.
  • Installation UI Optimization. Users can better view and manage installations on Node List page.
  • More Git Support. Allow users to view Git Commits record, and allow checkout to corresponding commit.
  • Support Hostname Node Registration Type. Users can set hostname as the node key as the unique identifier.
  • RPC Support. Added RPC support to better manage node communication.
  • Run On Master Switch. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes.
  • Disabled Tutorial by Default.
  • Added Related Documentation Sidebar.
  • Loading Page Optimization.

Bug Fixes

  • Duplicated Nodes. #391
  • Duplicated Spider Upload. #603
  • Failure in dependencies installation results in unusable dependency installation functionalities.. #609
  • Create Tasks for Offline Nodes. #622
crawlab - v0.4.7

Published by tikazyq over 4 years ago

Features / Enhancement

  • Better Support for Scrapy. Spiders identification, settings.py configuration, log level selection, spider selection. #435
  • Git Sync. Allow users to sync git projects to Crawlab.
  • Long Task Support. Users can add long-task spiders which is supposed to run without finishing. #425
  • Spider List Optimization. Tasks count by status, tasks detail popup, legend. #425
  • Upgrade Check. Check latest version and notifiy users to upgrade.
  • Spiders Batch Operation. Allow users to run/stop spider tasks and delete spiders in batches.
  • Copy Spiders. Allow users to copy an existing spider to create a new one.
  • Wechat Group QR Code.

Bug Fixes

  • Schedule Spider Selection Issue. Fields not responding to spider change.
  • Cron Jobs Conflict. Possible bug when two spiders set to the same time of their cron jobs. #515 #565
  • Task Log Issue. Different tasks write to the same log file if triggered at the same time. #577
  • Task List Filter Options Incomplete.
crawlab - v0.4.6

Published by tikazyq over 4 years ago

Features / Enhancement

  • SDK for Node.js. Users can apply SDK in their Node.js spiders.
  • Log Management Optimization. Log search, error highlight, auto-scrolling.
  • Task Execution Process Optimization. Allow users to be redirected to task detail page after triggering a task.
  • Task Display Optimization. Added "Param" in the Latest Tasks table in the spider detail page. #295
  • Spider List Optimization. Added "Update Time" and "Create Time" in spider list page.
  • Page Loading Placeholder.

Bug Fixes

  • Lost Focus in Schedule Configuration. #519
  • Unable to Upload Spider using CLI. #524
crawlab - v0.4.5

Published by tikazyq over 4 years ago

Features / Enhancement

  • Interactive Tutorial. Guide users through the main functionalities of Crawlab.
  • Global Environment Variables. Allow users to set global environment variables, which will be passed into all spider programs. #177
  • Project. Allow users to link spiders to projects. #316
  • Demo Spiders. Added demo spiders when Crawlab is initialized. #379
  • User Admin Optimization. Restrict privilleges of admin users. #456
  • Setting Page Optimization.
  • Task Results Optimization.

Bug Fixes

  • Unable to find spider file error. #485
  • Click delete button results in redirect. #480
  • Unable to create files in an empty spider. #479
  • Download results error. #465
  • crawlab-sdk CLI error. #458
  • Page refresh issue. #441
  • Results not support JSON. #202
  • Getting all spider after deleting a spider.
  • i18n warning.
crawlab - v0.4.4

Published by tikazyq almost 5 years ago

Features / Enhancement

  • Email Notification. Allow users to send email notifications.
  • DingTalk Robot Notification. Allow users to send DingTalk Robot notifications.
  • Wechat Robot Notification. Allow users to send Wechat Robot notifications.
  • API Address Optimization. Added relative URL path in frontend so that users don't have to specify CRAWLAB_API_ADDRESS explicitly.
  • SDK Compatiblity. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
  • Enhanced File Management. Added tree-like file sidebar to allow users to edit files much more easier.
  • Advanced Schedule Cron. Allow users to edit schedule cron with visualized cron editor.

Bug Fixes

  • nil retuened error.
  • Error when using HTTPS.
crawlab - v0.4.3

Published by tikazyq almost 5 years ago

Features / Enhancement

  • Dependency Installation. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
  • Pre-install Programming Languages in Docker. Allow Docker users to set CRAWLAB_SERVER_LANG_NODE as Y to pre-install Node.js environments.
  • Add Schedule List in Spider Detail Page. Allow users to view / add / edit schedule cron jobs in the spider detail page. #360
  • Align Cron Expression with Linux. Change the expression of 6 elements to 5 elements as aligned in Linux.
  • Enable/Disable Schedule Cron. Allow users to enable/disable the schedule jobs. #297
  • Better Task Management. Allow users to batch delete tasks. #341
  • Better Spider Management. Allow users to sort and filter spiders in the spider list page.
  • Added Chinese CHANGELOG.
  • Added Github Star Button at Nav Bar.

Bug Fixes

  • Schedule Cron Task Issue. #423
  • Upload Spider Zip File Issue. #403 #407
  • Exit due to Network Failure. #340
  • Cron Jobs not Running Correctly
  • Schedule List Columns Mis-positioned
  • Clicking Refresh Button Redirected to 404 Page
crawlab - v0.4.2

Published by tikazyq almost 5 years ago

Features / Enhancement

  • Disclaimer. Added page for Disclaimer.
  • Call API to fetch version. #371
  • Configure to allow user registration. #346
  • Allow adding new users.
  • More Advanced File Management. Allow users to add / edit / rename / delete files. #286
  • Optimized Spider Creation Process. Allow users to create an empty customized spider before uploading the zip file.
  • Better Task Management. Allow users to filter tasks by selecting through certian criterions. #341

Bug Fixes

  • Duplicated nodes. #391
  • "mongodb no reachable" error. #373