Sourcebae
Website:
sourcebae.com
Job details:
DATA ENGINEER - AGENT DOMAIN EXPERT
Expereince - 5+ Years
Job Location- Pune (Remote)
Position Overview
Serve as the data engineering domain expert for AI agent development. Define requirements for data pipeline patterns, validate generated code quality, and ensure agents produce enterprise-grade solutions following best practices for Bronze-Silver-Gold architecture.
Required Skills
Must Have:
* 5+ years data engineering experience with PySpark and Apache Spark
* Deep expertise in SQL and database design (normalization, Star Schema)
* Production experience with Delta Lake and Databricks
* Strong understanding of Medallion architecture (Bronze-Silver-Gold)
* Experience with data pipeline orchestration (DLT)
Nice to Have:
* Experience with Unity Catalog and data governance
* Knowledge of data quality frameworks (Great Expectations)
* Familiarity with cloud data platforms (AWS Glue, Azure Data Factory, GCP Dataflow)
* Understanding of data mesh and domain-driven data architecture
* Experience with prompt engineering or working with LLMs
Tools & Technologies
* Platforms: Databricks, Apache Spark, Delta Lake
* Languages: PySpark, SQL, Python
* Data Quality: Great Expectations, Soda, dbt tests
* Orchestration: Delta Live Tables, Apache Airflow
* Cloud: AWS (S3, Glue, EMR), Azure (ADF, Synapse), GCP (BigQuery, Dataflow)
* Governance: Unity Catalog, Apache Atlas, Collibra
* Version Control: Git
Key Responsibilities
Pattern Definition & Curation
* Define reusable data engineering patterns (SCD Type 2, deduplication, incremental loads)
* Create reference implementations for Bronze/Silver/Gold layers
* Document best practices for PySpark optimization (broadcast joins, partitioning, Z-ORDER)
* Build pattern library with success criteria and performance benchmarks
Data Model Design
* Design 3rd Normal Form (3NF) normalization logic for Silver layer
* Create Star Schema and Snowflake Schema patterns for Gold layer
* Define Slowly Changing Dimension (SCD) Type 0/1/2/3 implementations
* Build fact table and dimension table classification algorithms
Code Quality & Validation
* Review AI-generated PySpark code for correctness and performance
* Define code quality gates (linting scores, test coverage thresholds)
* Create validation test suites for Bronze/Silver/Gold transformations
* Establish performance benchmarks (query time, storage efficiency)
Databricks & Delta Lake Optimization
* Design Delta Live Tables (DLT) pipeline configurations
* Implement Unity Catalog metadata and governance standards
* Optimize Delta Lake performance (ZORDER, OPTIMIZE, VACUUM)
* Configure auto-scaling clusters and job orchestration
Requirements Translation
* Work with business analysts to capture data pipeline requirements
* Translate business rules into technical specifications for agents
* Define entity relationships, metrics formulas, and transformation logic
* Create requirement templates that agents can consume
Click on Apply to know more.