nrk-sapmi-crawler

Crawler for NRK Sapmi news bulletins that will be the basis for stopword-sami and an example search engine for content in Sami.

Crawl news bulletins in Northern Sami, Lule Sami and South Sami.

Code is not the cleanest one, but it works well enough, and hopefully will without too much maintenance for the next copule of years. If you just want the datasets, install stopword-sami modul.

Getting a list of article IDs to crawl

import { getList, crawlHeader, readIfExists, calculateIdListAndWrite } from '../index.js'

const southSami = {
 id: '1.13572943',
 languageName: 'Åarjelsaemien',
 url: 'https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items',
 file: './lib/list.southSami.json'
}

// Bringing it all together, fetching URL and reading file, and if new content -> merging arrays and writing
Promise.all([getList(southSami.url, crawlHeader), readIfExists(southSami.file).catch(e => e)])
 .then((data) => {
   calculateListAndWrite(data, southSami.id, southSami.file, southSami.languageName)
 })
 .catch(function (err) {
   console.log('Error: ' + err)
 })

To change user-agent for the crawler

crawlHeader['user-agent'] = 'name of crawler/version - comment (i.e. contact-info)'

Getting the content from a list of IDs

import { crawlContentAndWrite } from 'nrk-sapmi-crawler'
const appropriateTime = 2000

const southSami = {
  idFile: './datasets/list.southSami.json',
  contentFile: './datasets/content.southSami.json'
}


async function crawl () {
  await crawlContentAndWrite(southSami.idFile, southSami.contentFile, appropriateTime)
}

crawl()

Package Rankings

Top 17.33% on Npmjs.org

Badges

Extracted from project README

Related Projects

words-n-numbers

Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags...

29 Jul 2019 11

x-crawl

x-crawl is a flexible Node.js AI-assisted crawler library. Making crawler work more efficient, in...

22 Jan 2023 829

daq-proc

Simple document and query processor that makes search running in the browser and node.js a little...

01 Aug 2019 11

detHOAXicate

A web app to track article sources and crowdsource fact-checking

22 Aug 2016 3

app-search-javascript

Elastic App Search Official JavaScript Client

09 Aug 2019 66

EasySpider

A visual no-code/code-free web crawler/spider易采集：一个可视化浏览器自动化测试/数据采集/爬虫软件，可以无代码图形化的设计和执行爬虫任务。别名：Se...

18 Jul 2020 34,306

websight

🕷A simple but *really* fast crawler built with Node.js & TypeScript

14 Jul 2019 18

toolbox

A collection of tools, APIs and other resources to use in creative coding web projects.

20 Dec 2014 72

stopword-sami

Sami stopword lists for natural language processing. Examples on use could be search engines, mac...

11 Dec 2020 1

sifter.js

A library for textually searching arrays and hashes of objects by property (or multiple propertie...

11 Aug 2013 1,095

web-scraping-for-researchers

Press Cmd + Alt + I

24 Mar 2017 35

eklem-headline-parser

Determines the most relevant keywords from an article headline combined with some article text. F...

13 May 2015 5

search-crawler

Sample web crawler and search engine written in Node.JS and MongoDb

11 May 2014 18

stopword

A module for node.js and the browser that takes in text and strips it of stopwords

28 May 2015 222

i18n-node

Lightweight simple translation module for node.js / express.js with dynamic json storage. Uses co...

25 Mar 2011 3,066