scrape-directory-listing

Download all the files from a directory listing such as https://www.ndbc.noaa.gov/data/ocean/.

Install

npm install scrape-directory-listing

yarn add scrape-directory-listing

pnpm add scrape-directory-listing

Hello there! Follow me @linesofcode or visit linesofcode.dev for more cool projects like this one.

Getting started

This example will recursively download all the files from https://www.ndbc.noaa.gov/data/ocean/.

import { scrapeDirectoryListing } from 'scrape-directory-listing';

const res = await scrapeDirectoryListing({
  url: 'https://www.ndbc.noaa.gov/data/ocean',
});

The response will contain an array of objects with the following properties:

{
    item: {
        description: string;
        modifiedAt: number;
        name: string;
        path: string;
        size: number | null;
        type: 'file' | 'directory';
    },
    data: ArrayBuffer;
    headers: Headers;
}

Writing to the file system

import { scrapeDirectoryListing } from 'scrape-directory-listing';
import { writeFile } from 'fs/promises';

const res = await scrapeDirectoryListing({
  url: 'https://www.ndbc.noaa.gov/data/ocean',
});

const first = res[0];

await writeFile('output/' + first.item.name, Buffer.from(first.data));

Custom fetch and concurrency

You can pass fetch function and combine with custom logic, for example to control current number of concurrent requests.

import { scrapeDirectoryListing } from 'scrape-directory-listing';
import pLimit from 'p-limit';

// Limit to 1 request at a time
const limit = pLimit(1);

const res = await scrapeDirectoryListing({
  url: 'https://www.ndbc.noaa.gov/data/ocean',
  fetchFileFn: async (item) => {
    return limit(() => fetch(item.url));
  },
});

Package Rankings

Top 30.8% on Npmjs.org

Related Projects

hardened-fetch

🦾 Hardened Fetch is a tiny wrapper for `global.fetch` which makes working with APIs without SDKs ...

29 Nov 2023 2

patreon-scraper-puppeteer

Patreon Scraper made with Puppeteer TS

24 Aug 2024 2

NodeTyped

Node.js Express Startup Seed with ES6, Typescript, SCSS, EJS, Nodemon, Bootstrap 4, TSLint, TypeDoc

10 Apr 2016 71

websight

🕷A simple but *really* fast crawler built with Node.js & TypeScript

14 Jul 2019 18