
An image crawler written in Python.


Comic Crawler

Comic Crawler 是用來扒圖的一支 Python Script。擁有簡易的下載管理員、圖書館功能、 與方便的擴充能力。


Comic Crawler is on PyPI. 安裝完 python 後,可以直接用 pip 指令自動安裝。

Install Python

你需要 Python 3.11 以上。安裝檔可以從它的
官方網站 下載。

安裝時記得要選「Add python.exe to path」,才能使用 pip 指令。

Install Deno

Comic Crawler 使用 Deno 來分析需要執行 JavaScript 的網站︰

Windows 10 (1709) 以上的版本,可以直接在 cmd 底下輸入以下指令安裝︰


   winget install deno

Install Comic Crawler

在 cmd 底下輸入以下指令︰


pip install comiccrawler



pip install comiccrawler --upgrade --upgrade-strategy eager

最後在 cmd 底下輸入以下指令執行 Comic Crawler︰


comiccrawler gui

Supported domains




As a CLI tool:


Usage: comiccrawler [--profile=] ( domains | download [--dest=<save_path>] | gui ) comiccrawler (--help | --version)

Commands: domains 列出支援的網址 download 下載指定的 url gui 啟動主視窗

Options: --profile 指定設定檔存放的資料夾(預設為 "~/comiccrawler") --dest 設定下載目錄(預設為 ".") --help 顯示幫助訊息 --version 顯示版本

or you can use it in your python script:

.. code:: python

from comiccrawler.mission import Mission
from comiccrawler.analyzer import Analyzer
from comiccrawler.crawler import download

# create a mission
m = Mission(url="")

# select the episodes you want
for ep in m.episodes:
    if ep.title != "chapter 123":
        ep.skip = True

# download to savepath
download(m, "path/to/save")


.. figure:: :alt: 主視窗

  • 在文字欄貼上網址後點「加入連結」或是按 Enter
  • 若是剪貼簿裡有支援的網址,且文字欄同時是空的,程式會自動貼上
  • 對著任務右鍵,可以選擇把任務加入圖書館。圖書館內的任務,在每次程式啟動時,都會檢查是否有更新。



; 設定下載完成後要執行的程式,{target} 會被替換成任務資料夾的絕對路徑
runafterdownload = 7z a "{target}.zip" "{target}"

; 啟動時自動檢查圖書館更新
libraryautocheck = true

; 檢查更新間隔(單位︰小時)
autocheck_interval = 24

; 下載目的資料夾。相對路徑會根據設定檔資料夾的位置。
savepath = download

; 開啟 grabber 偵錯
errorlog = false

; 每隔 5 分鐘自動存檔
autosave = 5

; 存檔時使用下載時的原始檔名而不用頁碼
; 強列建議不要使用這個選項,見
originalfilename = false

; 自動轉換集數名稱中數字的格式,可以用於補0
; 例︰第1集 -> 第001集
; 詳細的格式指定方式請參考
; 注意︰這個設定會影響檔名中的所有數字,包括檔名中英數混合的ID如instagram
titlenumberformat = {:03d}

; 連線時使用 http/https proxy
proxy =

; 加入新任務時,預設選擇所有集數
selectall = true

; 不要根據各集名稱建立子資料夾,將所有圖片放在任務資料夾內
noepfolder = true

; 遇到重複任務時的動作
; update: 檢查更新
; reselect_episodes: 重新選取集數
mission_conflict_action = update

; 是否驗證加密連線(SSL),預設是 true
verify = false

; 從瀏覽器中讀取 cookies,使用 yt-dlp 的 cookies-from-browser
browser = firefox

; 瀏覽器 profile 的名稱
browser_profile = act3nn7e.default
  • 設定檔位於 ~\comiccrawler\setting.ini。可以在執行時指定 --profile 選項以變更預設的位置。(在 Windows 中 ~ 會被展開為 %HOME%%USERPROFILE%

  • 執行一次 comiccrawler gui 後關閉,設定檔會自動產生。若 Comic Crawler 更新後有新增的設定,在關閉後會自動將新設定加入設定檔。

  • 各別的網站會有自己的設定,通常是要填入一些登入相關資訊

  • 設定檔會在重新啟動後生效。若 ComicCrawler 正在執行中,可以點「重載設定檔」來載入新設定

    .. warning::

    若在執行時,修改設定檔並儲存,接著結束 ComicCrawler,修改會遺失。因為 ComicCrawler 結束前會把設定寫回設定檔。

  • 各別網站的設定不會互相影響。假如在 [DEFAULT] 設 savepath = a;在 [Pixiv] 設 savepath = b,那麼從 pixiv 下載的都會存到 b 資料夾,其它的就用預設值,存到 a 資料夾。

Module example

Starting from version 2016.4.21, you can add your own module to ~/comiccrawler/mods/

.. code:: python

#! python3
This is an example to show how to write a comiccrawler module.


import re
from urllib.parse import urljoin
from comiccrawler.episode import Episode

# The header used in grabber method. Optional.
header = {}

# The cookies. Optional.
cookie = {}

# Match domain. Support sub-domain, which means "" will match
# "*"
domain = ["", ""]

# Module name
name = "Example"

# With noepfolder = True, Comic Crawler won't generate subfolder for each
# episode. Optional, default to False.
noepfolder = False

# If False then setup the referer header automatically to mimic browser behavior.
# If True then disable this behavior.
# Default: False
no_referer = True

# Wait 5 seconds before downloading another image. Optional, default to 0.
rest = 5

# Wait 5 seconds before analyzing the next page in the analyzer. Optional,
# default to 0.
rest_analyze = 5

# User settings which could be modified from setting.ini. The keys are
# case-sensitive.
# After loading the module, the config dictionary would be converted into 
# a ConfigParser section data object so you can e.g. call
# config.getboolean("use_large_image") directly.
# Optional.
config = {
    # The config value can only be str
    "use_largest_image": "true",
    # These special config starting with `cookie__` will be automatically 
    # used when grabbing html or image.
    "cookie_user": "user-default-value",
    "cookie_hash": "hash-default-value"

def load_config():
    """This function will be called each time the config reloads. Optional.

def get_title(html, url):
    """Return mission title.

    The title would be used in saving filepath, so be sure to avoid
    duplicated title.
    return"<h1 id='title'>(.+?)</h1>", html).group(1)

def get_episodes(html, url):
    """Return episode list.

    The episode list should be sorted by date, oldest first.
    If is a multi-page list, specify the URL of the next page in
    get_next_page. Comic Crawler would grab the next page and call this
    function again.

    The `Episode` object accepts an `image` property which can be a list of `Image`.
    However, unlike `get_images`, the `Episode` object is JSON-stringified and saved
    to the disk, therefore you must only use JSON-compatible types i.e. no `Image.get_url`.
    match_list = re.findall("<a href='(.+?)'>(.+?)</a>", html)
    return [Episode(title, urljoin(url, ep_url))
            for ep_url, title in match_list]

def get_images(html, url):
    """Get the URL of all images.
    The return value could be:

    -  A list of image.
    -  A generator yielding image.
    -  An image, when there is only one image on the current page.
    Comic Crawler treats following types as an image:
    -  str - the URL of the image
    -  callable - return a URL when called
    -  comiccrawler.core.Image - use it to provide customized filename.
    While receiving the value, it is converted to an Image instance. See ``comiccrawler.core.Image.create()``.
    If the episode has multi-pages, uses get_next_page to change page.
    Use generator in caution! If the generator raises any error between
    two images, next call to the generator will always result in
    StopIteration, which means that Comic Crawler will think it had crawled
    all images and navigate to next page. If you have to call grabhtml()
    for each image (i.e. it may raise HTTPError), use a list of
    callback instead!
    return re.findall("<img src='(.+?)'>", html)

def get_next_page(html, url):
    """Return the URL of the next page."""
    match ="<a id='nextpage' href='(.+?)'>next</a>", html)
    if match:

def get_next_image_page(html, url):
    """Return the URL of the next page.

    If this method is defined, it will be used by the crawler and ``get_next_page`` would be ignored.
    Therefore ``get_next_page`` will only be used by the analyzer.
def redirecthandler(response, crawler):
    """Downloader will call this hook if redirect happens during downloading
    an image. Sometimes services redirects users to an unexpected URL. You
    can check it here.
    if response.url.endswith("404.jpg"):
        raise Exception("Something went wrong")

def errorhandler(error, crawler):
    """Downloader will call errorhandler if there is an error happened when
    downloading image. Normally you can just ignore this function.
def imagehandler(ext, b):
    """If this function exists, Comic Crawler will call it before writing
    the image to disk. This allow the module to modify the image after
    the download.
    @ext  str, file extension, including ".". (e.g. ".jpg")
    @b    The bytes object of the image.

    It should return a (modified_ext, modified_b) tuple.
    return (ext, b)
def grabhandler(grab_method, url, **kwargs):
    """Called when the crawler is going to make a web request. Use this hook
    to override the default grabber behavior.
    @grab_method  function, could be ``grabhtml`` or ``grabimg``.
    @url          str, request URL.
    @kwargs       other arguments that will be passed to grabber.
    By returning ``None``
    if "/api/" in URL:
       kwargs["headers"] = {"some-api-header": "some-value"}
       return grab_method(url, **kwargs)

def after_request(crawler, response):
    """Called after the request is made."""
    if response.url.endswith("404.jpg"):
        raise Exception("Something went wrong")

def session_key(url):
    """Return a key to identify the session. If the key is the same, the
    session would be shared. Otherwise, a new session would be created.

    For example, you may want to separate the session between the main site
    and the API endpoint.

    Return None to pass the URL to next key function.
    r = urlparse(url)
    if r.path.startswith("/api/"):
       return (r.scheme, r.netloc, "api")


  • Need a better error log system.
  • Support pool in Sankaku.
  • Add module.get_episode_id to make the module decide how to compare episodes.
  • Use HEAD to grab final URL before requesting the image?


