ckan-import

Webapp to automate data import into CKAN and its DataStore.

Stars
5

A stand-alone webapp to automate importing data from various sources into CKAN and its DataStore.

Install and Deployment

Get it from github:

git clone ...

We use foreman (as provided by Heroku)

foreman start -f Procfile-dev

Implementation

Focus on import of Data Package stored in Github

Import of a Generic Data Package

  • Given URL to datapackage.json
    • optionally a specific file(s) that have changed
    • plus the CKAN API Key (our bearer token)
  • Identify target CKAN dataset
    • Use DataPackage.name attribute for Dataset name
    • Identify target resources (by name)
  • Start import process for each data file

Import of a Single Tabular Data Package resource

Require

  • CSV file URL
  • schema (optional?)
  • Target resource id (or dataset name + resource name?)
  • CKAN API Key

Steps:

  • Load CSV file (into memory or do we stream?)
  • (?) Check Schema is valid
  • Convert Schema to DataStore schema
  • Send data to DataStore

Github hook

  • Receive webhook payload
  • Determine if any action needed
  • Boot up Data package import process

Extras / questions

  • Handle data file renames
  • How do we deal with a schema change (e.g. a change in type)
    • Ans: if we drop data resource content and recreate we should be ok ...
  • Do we try to be smart with updates and only push changed rows (probably not)

User Stories

Persona:

  • Data User - less sophisticated (uses Excel but may not know what an API is)
  • Data Wrangler - more sophisticated (knows what an API is)

Import File and get Data API

As a Data Wrangler I want to provide my file and have it imported into CKAN so that I get a Data API

What kind of file?

  • CSV file
  • Excel file
  • GeoJSON file
  • ...

How do I provide

  • web interface
  • API (POST/GET url string or POST file content)

Questions:

  • Do we validate the file?
  • Do we have some process for e.g. tweaking the field types
  • What is the mapping between file and Dataset / Resource

Implementation

  • DataPusher already does most of this
    • What's missing is any kind of edit metadata step
    • No user interface

As a XXX I want to push my data file to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date

  • This is very similar to import file - only difference is we get push notifications (github webhooks). so merge this with that example.

Github import

As a XXX I want to push my tabular data package to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date

  • As it is already a data package importing should be very simple
  • If file is large we may need to worry about queues etc but probably keep it simple for present
  • How do we determine dataset to associate this with in CKAN?

One-Click Create a Dataset

As a XXX I want to provide my file and have it imported into CKAN so that I get a nice Dataset

  • what distinguishes from existing system? Ans: one-click nature

Automated regular import

As a Data Wrangler I want to have my data file automatically re-imported at regular intervals so that the DataStore (and Data API) stays up to date with my data.