EMR deployment
EC2, EKS, outpost, serverless
open sources framework
spark, hive
cost optimization
- transient cluster, reserved instance, spot instance, fleets (az, high available)
- aws graviton 2 instance, cost, performance, lower 30%
scaling feature
scaling up, down based on demand.
ec2 enhance
reduce start-up time, task nodes start times, lower cost with spark shuffle, reduce cst, performance with ebs gp3 columns
emr on eks
job template for data engineer simply job by common params, spark sql rsrunner - script directly with api, dynamodb connector
emr 6.9.0 20% opt time of oss spark 3.3.0
zero rename
- ss3 copy file is copy and replace, low performance
- transactional data lakes, record level
- atomic change, read write isolation, high throughput ingestion, small file compactions, row level upset and deletes
transaction data lakes
- acid, record level , sql, spark, flink support
- query: prestodb, trino, flink, hive support
- query cross partition, files
- hudi, iceberg, disaster recovery, concurrency with Glue, merge on read (mor) support, time travel support spark sql, trino sql
- delta lake
EMR serverless
Apache airflow
Security
- isolation, private subnet
- authentication, ldap
- encryption,
- audit: using ranger, aws lake formation
Workflow
spark driver -> pending executor prds -> ca, auto scaling group have node group -> api (node provision)
Karpenter
- replace ca, auto scaling group have node group
- auto select correct node type for processing job, scale out faster, scale in if no more job
emr on eks
- multi version on same cluster, multi az, start job quick no provisioning delay,
- master, core (driver),task (executor) instance go to auto scale's one instance (spot, save cost)
auto pod tuning
auto resize existing pod small to bigger, based on tral time cpu memory utilz, avoid manual tune driver & executor resource
managed apache flink
streaming analysis
challenge
scale data to 1000 node, network config
modernize data platform on eks
infra as code, performance bench report, data workload, spark, fafka, ray
amazon eks
- observability: prometheus, fluent bit, otel
- delivery: argocd, flex, crossplance
- reliability: karpener, sutoscalar, keda
- security: ciium, Gatekeeper
data on eks adoption
cluster manage, addon manage, team manage, workload manage,
virtual cluster
handle k8s namespace (namespace per project)
job run api: branch to submit spark application, spark jar,
app: hive
- jobs: job within app
- workers: drivers and executors for job
Workflow
application scheduler -> pending pods -> karpenter -> ec2
can have different size of nodes
Benefit
cost optimize, consolidation, 1 big node cheapers than 3 small nodes
Link Workshop
https://catalog.workshops.aws
Workshop