AWS Data Science Analysis and ML Pipeline Platform: Databricks Spark, Sagemaker

Objectives

  • Cohesive environment for time-effective Data Science experiments on big data
  • Production env capable ML pipelining
Existing Challenges

  • MLops challenges. i.e. IAM policies, resource configs for model type, feature stores
  • Long experimentation cycle time
  • Data Scientists lack independence in data procuration and library setup
  • Lack of Production env capable processing
  • Disparate data sets
Solutions

  • Exploratory work and development of models are done via SageMaker Studio by Data Scientists
  • SageMaker Studio SSO and MFA integration and isolated S3 paths satisfy enterprise dev ops compliance
  • Databricks for big-data batch processing and S3 for training dataset storage
  • Productionized jobs via Sagemaker py API deployable via traditional existing CICD
Benefits

  • Time-effective data science work
  • Productionized models maintainable by staff DE and Ops teams