Category: ML Engineering

AWS Data Science Analysis and ML Pipeline Platform: Databricks Spark, Sagemaker
Objectives
- Cohesive environment for time-effective Data Science experiments on big data
- Production env capable ML pipelining
Existing Challenges
- MLops challenges. i.e. IAM policies, resource configs for model type, feature stores
- Long experimentation cycle time
- Data Scientists lack independence in data procuration and library setup
- Lack of Production env capable processing
- Disparate data sets
Solutions
- Exploratory work and development of models are done via SageMaker Studio by Data Scientists
- SageMaker Studio SSO and MFA integration and isolated S3 paths satisfy enterprise dev ops compliance
- Databricks for big-data batch processing and S3 for training dataset storage
- Productionized jobs via Sagemaker py API deployable via traditional existing CICD
Benefits
- Time-effective data science work
- Productionized models maintainable by staff DE and Ops teams
July 7, 2025