Category: Data Platform

  • Big Data and Financials Analysis Platform

    Big Data and Financials Analysis Platform

    Objectives

    • AWS hosted platform focused on the storage, processing, and presentation of customer and product data
    • Spark Vendor Databricks implementation and workflow
    Existing Challenges

    • DB Storage costs
    • Long Pipeline runtimes
    • Multiple biz unit data accessibility and separation
    • PII, SOX, etc. regulatory compliance
    • Public enterprise IT, infosec compliance
    Solutions

    • Databricks for Spark big data processing, Jupyter Notebooks analysis, Hive tables on S3 for SQL
    • S3 for Data Lake source of truth storage and Utility staging and processing storage
    • Redshift for Data Warehouse availability to different business unit dashboarding and reports
    • ECS Containers for bespoke native codebase applications
    • Terraform IaC
    • Jenkins code releases and env separation
    Benefits

    • Analyst and DE accessible
    • Scalable
    • Enterprise compliant
    • Cost manageable’
    • Flexible

  • AWS Matillion, SQS, Lambda, Redshift, CDK

    AWS Matillion, SQS, Lambda, Redshift, CDK

    Objectives

    • Low-code rapid prototype ETL ecosystem for pilot data analyst effort
    Existing Challenges

    • Existing dev resources are tied up
    • ccess and ops bureaucracy takes a while
    Solutions

    • Services, assets and CDK IaC, IAM setup and workflow
    • Data-Mart solution for self-service isolated dataset BI work
    Benefits

    • Enclosed ETL ecosystem analysts can track and use
    • Big data capable

  • AWS Media Billing Platform

    AWS Media Billing Platform

    Objectives

    • Programmatic Ad Sales delivery and billing reporting codebase
    • Fix and Refactor legacy codebase for speed, bug fixes and feature enhancements
    Existing Challenges

    • Debugging BI logic on big-data required python pandas dataframe live debugging. i.e. Analysts can’t view and debug reporting issues without a developer
    • Pipeline execution time is too long
    Solutions

    • Git Action/Terraform CICD, EKS Container, Py Pandas ETL to/from incremental S3 Glue-db tables w Airflow orchestration
    • Reworking of business dataset analysis workflow from Py Pandas DFs to Glue/Athena tables
    Benefits

    • Analysts can now work directly with data via SQL in Glue DB Tables. Once fixes are found the logic can be integrated by developer via normal sprint workflow
    • Pipelines are Reliable and Fast

  • AWS Big Data Media Platform

    AWS Big Data Media Platform

    Objectives

    • Goal was to replace Databricks vendor platform and redshift-centric ecosystem to a native AWS EMR,JupyterNotebooks hive-centric and redshift-datamart ecosystem.
    • Improve data analysis access, and ETL speed, reliability and cost effectivenss
    • Medallion (bronze, silver, gold) data triage for engagement, ad-sales and content data domains
    Existing Challenges

    • Disparate Data Access: Can’t easily gain access and query data across domains
    • Database Issues Slow queries. queries lock up or time out. Db load competition. Usage limits
    Solutions

    • AWS EMR w DBT Spark ETL. EMR Jupyter Notebooks for analysis. Git Action w Terraform CICD.
    • S3 Glue-dbs for lake storage. Redshift datamart reporting storage
    Benefits

    • Long-term Flexibility, Reliability, Robustness
    • Cost Manageable and Flexible