Demo project showing how to create a simple web scraping service using AWS Lambda and API Gateway
ISC License
This is a demo project which implements a trivial REST service for queuing web scraping jobs.
It is completely "serverless", designed to use the following Amazon services:
The Lambda functions are written in ES6, with async/await, transpiled using Babel, and bundled using Webpack.
The AWS resources are provisioned using the CloudFormation service, using an add-on custom resource handler to allocate API Gateway resources (which Amazon doesn't support yet for CloudFormation).
Additionally, we use Apex to simplify the uploading of the Lambda functions.
It should cost very little to run.
API: https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod
Web Interface: Under construction
curl -X POST -d url=http://jimpick.com/ https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod/jobs
git clone https://github.com/jimpick/lambda-scraper-queue.git
(https)git clone [email protected]:jimpick/lambda-scraper-queue.git
(git)cd lambda-scraper-queue
npm install
Note: These instructions are copied from: https://github.com/carlnordenfelt/aws-api-gateway-for-cloudformation#setup-iam-permissions
To be able to install the Custom Resource library you require a set of permissions. Configure your IAM user with the following policy and make sure that you have configured your aws-cli with access and secret key.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudformation:CreateStack",
"cloudformation:DescribeStacks",
"iam:CreateRole",
"iam:CreatePolicy",
"iam:AttachRolePolicy",
"iam:GetRole",
"iam:PassRole",
"lambda:CreateFunction",
"lambda:UpdateFunctionCode",
"lambda:GetFunctionConfiguration",
"cloudformation:DeleteStack",
"lambda:DeleteFunction",
"iam:ListPolicyVersions",
"iam:DetachRolePolicy",
"iam:DeletePolicy",
"iam:DeleteRole"
],
"Resource": [
"*"
]
}
]
}
This installs a special AWS Lambda function so that the CloudFormation recipe can provision the API Gateway using custom resources from Carl Nordenfelt's API Gateway for CloudFormation project.
npm run deploy-custom-resource
If successful, a 'service token' will be saved to deploy/state/SERVICE_TOKEN
Copy config.template.js
to config.js
and customize it.
cp config.template.js config.js
The default config.template.js is:
export default {
cloudFormation: 'lambdaScraperQueue',
region: 'us-west-2',
stage: 'prod'
}
cloudFormation: The name of the CloudFormation stack
region: The AWS region
stage: The API Gateway stage to create
npm run create-cloudformation
The command returns immediately, but it will take a while to complete. it's deploying a lot of resources. It's a good idea to watch the CloudFormation task in the AWS Web Console to ensure that it completes without errors.
Note: When working with the CloudFormation recipe, you can also use
npm run update-cloudformation
and npm run delete-cloudformation
When the CloudFormation stack in the previous step has been successfully provisioned (check the AWS Web Console), do this step.
The Custom Resource library currently doesn't support this from CloudFormation, so, for now, we need to do it manually.
Go to "API Gateway" in the Amazon web console, and select the desired API. Click the Deploy API
button, and under Deployment Stage
, select New Stage
. Enter prod
for the Stage Name
, and click the Deploy
button.
npm run save-cloudformation
This will create a file in deploy/state/cloudFormation.json
npm run setup-apex
This generates build/apex/project.json
npm run compile-lambda
This will use webpack and babel to compile the source code in src/server/lambdaFunctions
into build/apex/functions
The webpack configuration is in deploy/apex/webpack.config.es6.js
npm run deploy-lambda
This will run apex deploy
in the build/apex
directory to upload the compiled lambda functions.
Alternatively, if you want to execute the compile and deploy steps in one command, you can run: npm run deploy
npm run test
This will run both the local tests, and remote test which test the deployed API and lambda functions.
The local tests can be run as npm run test-local
, and the remote tests can be run as npm run test-remote
.
You can tail the CloudWatch logs:
npm run logs
This just executes apex logs -f
in build/apex
npm run post-url
Submits a job to the API that scrapes http://jimpick.com/
You should be able to see lambda output in the logs (after a few seconds delay). Also, you should be able to see the files in S3 via the AWS Web Console.
I'm using Apex, but just for uploading the functions. I haven't investigated the other projects yet.