ptt-crawler is a web crawler module designed to scarpe data from Ptt.
MIT License
ptt-crawler 是一個專門用來爬批踢踢(Ptt)各版資料的爬蟲模組。
ptt-crawler is a web crawler module designed to scarpe data from Ptt.
批踢踢(Ptt)是台灣最大的BBS(Bulletin Board System),也是許多台灣大數據分析常參考的資料庫。 不過,大多數Ptt爬蟲都是用python程式所寫。 本人為了在Node.js上爬批踢踢(Ptt)的資料,乾脆就自己用javascript打造一個簡單的爬蟲模組,並且分享給大家使用。
Ptt is the most famous and biggest BBS(Bulletin Board System) in Taiwan and also an import reference database for big data analysis. However, most of ptt crawler modules are written by python. In order to scrape data from Ptt by Node.js, I just create a simple ptt crawler module by javascript and share it to everyone to use.
npm install --save @waynechang65/ptt-crawler
const ptt_crawler = require('@waynechang65/ptt-crawler');
// *** Initialize ***
await ptt_crawler.initialize();
// *** GetResult ***
let ptt = await ptt_crawler.getResults({
board: 'PokemonGO',
pages: 3,
skipPBs: true,
getContents: true
}); // Ptt PokemonGO board, 3 pages, skip fixed bottom posts, scrape content of posts
// *** Close ***
await ptt_crawler.close();
{ titles[], urls[], rates[], authors[], dates[], marks[], contents[] }
git clone https://github.com/WayneChang65/ptt-crawler.git
npm install
npm start
options.board: 欲爬的ptt版名, board name of ptt options.pages: 要爬幾頁, pages options.skipPBs: 是否忽略置底文, skip fix bottom posts options.getContents: 是否爬內文(會花費較多時間), scrape contents
ptt-crawler 雖然是一個小模組,但本人還是希望這個專案能夠持續進步!若有發現臭蟲(bug)或問題,請幫忙在Issue留言告知詳細情形。 歡迎共同開發。歡迎Fork / Pull Request,謝謝。:)
Even though ptt-crawler is a small project, I hope it can be improving. If there is any issue, please comment and welcome to fork and send Pull Request. Thanks. :)