- what is lambda
- why is lambda (for data science)
- how is lambda (to actually use)
- when is lambda (the right choice)
---~
-
serverless
- it runs on servers, you just don't deal with that
-
scaleable
- it only costs when running, you just pay more as it does more
-
micro-service
- it only does one small thing, you just have lots of different ones
---~
- Start a new lambda
- Set up lambda with user code - "cold start"
- Accept event 1 and process
- Wait
- Accept event 2 and process - "warmed up"
- Timeout and kill lambda
There can be multiple concurrent machines too
---~
- Pay only for what you use
- Manage only what you have to
- Deal with extra/new/bursty traffic seamlessly
---~
- Invocation payload (request and response)
- 6 MB (synchronous)
- 256 KB (asynchronous)
- Deployment package size
- 50 MB (zipped, for direct upload)
- 250 MB (unzipped, including layers)
- 3 MB (console editor)
---~
- data-scientists != dev-ops professionals
- but our work needs to be 'released'
- all data projects != ensemble xg-boost Keras TPU shenanigans
- "No ML is easier to manage than no ML" © @julsimon
- data-projects != single-goal monolithic systems
- separate concerns, code bases and complication
---~
- write your python
- lambda your python
- ???
- profit
from scipy import stats
np.random.seed(12345678)
x = np.random.random(10)
y = 1.6*x + np.random.random(10)
slope, intercept, r_value, p_value, std_err =
stats.linregress(x, y)
- event driven
- an
event
is passed to ahandler function
- an
- json formatted
events
arejson
handler functions
returnjson
---~
import json
from scipy import stats
import numpy as np
def lambda_handler(event, context):
np.random.seed(12345678)
x = np.random.random(10)
y = 1.6*x + np.random.random(10)
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
return_body = {
"m": slope, "c": intercept,"r2": r_value ** 2,
"p": p_value, "se": std_err
}
return {"body": json.dumps(return_body)}
---~
---~
---~
json
is built in by default- so it
boto3
- so it
PROBLEM
- lambda doesn't
pip install ....
SOLUTION
- use
layers
- numpy, scipy are published by aws
---~
---~
PROBLEM
- New requirement needs pandas
SOLUTION
- Create custom
layer
- pre-compiled code on a specific path deployed as a
.zip
- for 'any package' * using some shell and docker
- * YMMV
- pre-compiled code on a specific path deployed as a
---~
requirements.txt
pandas==0.23.4
pytz==2018.7
get_layer_packages.sh
#!/bin/bash
export PKG_DIR="python"
rm -rf ${PKG_DIR} && mkdir -p ${PKG_DIR}
docker run --rm -v $(pwd):/foo -w /foo lambci/lambda:build-python3.6 \
pip install -r requirements.txt --no-deps -t ${PKG_DIR}
---~
execute.sh
chmod +x get_layer_packages.sh
./get_layer_packages.sh
zip -r pandas.zip . -i "python/*"
Then upload + create as layer with aws-cli
or manually with console
---~
'any package' *
pandas
pymysql
lambda
needs to be inside the same VPC
statsmodels
PROBLEM
- How does your team use your work?
SOLUTION
- use
api gateway
- AWS service that puts REST api in front of the lambda
---~
---~
---~
Get help from an adult (dev-ops professional)
---~
If you can't find an adult
- be careful about exposing the api
- not obvious how and where it can be accessed
- resource policies
- not obvious how and where it can be accessed
swagger
is an api templating syntax- cloud formation
- click the 'deploy api' button after every change
- use multiple stages
PROBLEM
- copy-pasta code into console is bad
SOLUTION
- use
AWS SAM
cli- local development + testing with docker
- 'cloudy' deployment with cloudformation cli
---~
---~
Usage: sam [OPTIONS] COMMAND [ARGS]...
Commands:
local Run your Serverless application locally
for quick development & testing.
logs Fetch logs for a function
deploy Deploy an AWS SAM application. This is an alias
for 'aws cloudformation deploy'.
build Build your Lambda function code
publish Publish a packaged AWS SAM template to the AWS
Serverless Application Repository.
init Initialize a serverless application.
validate Validate an AWS SAM template.
package Package an AWS SAM application. This is an alias
for 'aws cloudformation package'.
---~
Workflow
sam init
sam local generate-event apigateway aws-proxy
sam build
- ⇵
sam local invoke -e event.json
---~
alias playitsam='sam build && sam local invoke -e event.json'
alias playitagainsam='sam build && sam local invoke -e'
---~
sam validate
sam package
sam deploy
---~
Transform: 'AWS::Serverless-2016-10-31'
Resources:
RegressionFunction:
# This resource creates a Lambda function.
Type: 'AWS::Serverless::Function'
Properties:
# This function uses the python 3.7 runtime.
Runtime: python3.7
# This is the Lambda function's handler.
Handler: app.lambda_handler
# The location of the Lambda function code.
CodeUri: ./regression
# Event sources to attach to this function. In this case, we are attaching
# one API Gateway endpoint to the Lambda function. The function is
# called when a HTTP request is made to the API Gateway endpoint.
Events:
RegressionApi:
# Define an API Gateway endpoint that responds to HTTP GET at /regression
Type: Api
Properties:
Path: /regression
Method: GET
This enables CI/CD, which is a Good Thing ™
---~
Get help from an adult (dev-ops professional)
but if you can't, list 'em and flip 'em
aws lambda list-functions | cfn-flip
Surple have 3 lambda data services
---~
- user triggered event
- queries specific data based on user selection
- user facing visualisation
- vpc cold starts
---~
- scheduled for all meters as ETL to DB
- highlight 'out of character' energy use
- user facing visualisation and notifications
---~
- scheduled for all meters as ETL to DB
- highlight 'extreme' energy use
- user email and notifications
- this was extra fun/complicated
- Ask me how
- tweaking cpu load has made more a difference than tweaking timeout
- taking the time to set up SAM correctly has saved at least the time of browser console work alone
- A Cloud Guru is built on lambda (cheaply?)
- And has some great material on it
- Deployment from SageMaker is possible
- CI/CD from GitLab is possible
---~
Good case
- 'traditional' models
- regression, timeseries, hopefully more...
- per 'reasonable' data set
for each
- 'now in a minute'
- (not actually a minute, more like seconds)
- 'bursty'
- some, or lots of people need it then no one does
---~
Bad case
- 'fancy' models
- RAM limits, CPU limits
- whole scale
across all
- immediate response
- can't afford a cold start: 'lambda your lambda'
- 24:7 flat load
- need 100% load 100% of the time
---~
Slides available: https://github.com/DaveParr/snakes_and_lambdas Twitter: @DaveParr
---~
PROBLEM
- The thing I want to use isn't in Python
- or Go, NodeJS, C#, Java
SOLUTION
- use
Runtime Layers
- any language compiled into a layer
- accessed via
bash
or similar that processes event and passes to runtime bakdata/aws-lambda-r-runtime