Category: ML Engineering

  • AWS Data Science Analysis and ML Pipeline Platform: Databricks Spark, Sagemaker

    AWS Data Science Analysis and ML Pipeline Platform: Databricks Spark, Sagemaker

    Objectives

    • Cohesive environment for time-effective Data Science experiments on big data
    • Production env capable ML pipelining
    Existing Challenges

    • MLops challenges. i.e. IAM policies, resource configs for model type, feature stores
    • Long experimentation cycle time
    • Data Scientists lack independence in data procuration and library setup
    • Lack of Production env capable processing
    • Disparate data sets
    Solutions

    • Exploratory work and development of models are done via SageMaker Studio by Data Scientists
    • SageMaker Studio SSO and MFA integration and isolated S3 paths satisfy enterprise dev ops compliance
    • Databricks for big-data batch processing and S3 for training dataset storage
    • Productionized jobs via Sagemaker py API deployable via traditional existing CICD
    Benefits

    • Time-effective data science work
    • Productionized models maintainable by staff DE and Ops teams