Google Groups raw email crawler and parser
MIT License
Google Groups raw emails crawler and parser. Turbo speed and reliable! The downloaded messages are in RFC 822 format - taken verbatim from the Google servers.
Docker is the simplest option. Go to
Prepend docker run -it --rm vmarkovtsev/ggmbox
to all the commands in the "Usage" section.
Requirements: Python 3 and Scrapy. Download
ggmbox.py
file.
Requirements: Go.
go get -v github.com/vmarkovtsev/ggmbox
scrapy runspider -a name=golang-nuts -o result.json -t json ggmbox.py
Replace "golang-nuts" with the actual group name. The raw emails will be saved by default to the corresponding directory.
scrapy runspider -a name=chromium-dev -a prefix=a/chromium.org -o result.json -t json ggmbox.py
Note the usage of "prefix" argument - it sets the name of the parent. Some groups require that.
./parse golang-nuts > dataset.csv
Replace "golang-nuts" with the actual directory name with raw emails. The plain text threads will
be written to dataset.csv
, one thread per line. Special characters are escaped.
golang-nuts group was fully fetched on 24/02/2018 with 30043 topics and 192654 messages in 3 hours at 1gbps connection speed. The raw emails occupied 1.6 GB on disk.
Compare to 1 day using icy/google-group-crawler, it fetched only 63% and then stopped without any errors reported, or to henryk/gggd, it fetched only 3% within one hour and then unexpectedly stopped too.
It takes 7 seconds to parse 1.6 GB of raw emails on a 32-core machine.
...are welcome! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.
MIT.