Goal was to replace Databricks vendor platform and redshift-centric ecosystem to a native AWS EMR,JupyterNotebooks hive-centric and redshift-datamart ecosystem.
Improve data analysis access, and ETL speed, reliability and cost effectivenss
Medallion (bronze, silver, gold) data triage for engagement, ad-sales and content data domains
Existing Challenges
Disparate Data Access: Can’t easily gain access and query data across domains
Database Issues Slow queries. queries lock up or time out. Db load competition. Usage limits
Solutions
AWS EMR w DBT Spark ETL. EMR Jupyter Notebooks for analysis. Git Action w Terraform CICD.
S3 Glue-dbs for lake storage. Redshift datamart reporting storage