AWS Data Science Analysis and ML Pipeline Platform: Databricks Spark, Sagemaker

Written by

in

Analysis & Data Science, Finance & Banking, ML Engineering

Objectives

Cohesive environment for time-effective Data Science experiments on big data
Production env capable ML pipelining

Existing Challenges

MLops challenges. i.e. IAM policies, resource configs for model type, feature stores
Long experimentation cycle time
Data Scientists lack independence in data procuration and library setup
Lack of Production env capable processing
Disparate data sets

Solutions

Exploratory work and development of models are done via SageMaker Studio by Data Scientists
SageMaker Studio SSO and MFA integration and isolated S3 paths satisfy enterprise dev ops compliance
Databricks for big-data batch processing and S3 for training dataset storage
Productionized jobs via Sagemaker py API deployable via traditional existing CICD

Benefits

Time-effective data science work
Productionized models maintainable by staff DE and Ops teams