DATA ENGINEER

Sourcebae

Location: India
Job type: Part-time

Required skills

Python
Airflow
AWS
Apache
Apache Airflow
Apache Spark
Azure
BigQuery
data architecture
data engineer
data pipeline
database
Databricks
Dataflow
GCP
Git
Snowflake
SQL
Unity

About the role

Sourcebae

Website: sourcebae.com
Job details:

DATA ENGINEER - AGENT DOMAIN EXPERT

Expereince - 5+ Years

Job Location- Pune (Remote)

Position Overview

Serve as the data engineering domain expert for AI agent development. Define requirements for data pipeline patterns, validate generated code quality, and ensure agents produce enterprise-grade solutions following best practices for Bronze-Silver-Gold architecture.

Required Skills

Must Have:

* 5+ years data engineering experience with PySpark and Apache Spark

* Deep expertise in SQL and database design (normalization, Star Schema)

* Production experience with Delta Lake and Databricks

* Strong understanding of Medallion architecture (Bronze-Silver-Gold)

* Experience with data pipeline orchestration (DLT)

Nice to Have:

* Experience with Unity Catalog and data governance

* Knowledge of data quality frameworks (Great Expectations)

* Familiarity with cloud data platforms (AWS Glue, Azure Data Factory, GCP Dataflow)

* Understanding of data mesh and domain-driven data architecture

* Experience with prompt engineering or working with LLMs

Tools & Technologies

* Platforms: Databricks, Apache Spark, Delta Lake

* Languages: PySpark, SQL, Python

* Data Quality: Great Expectations, Soda, dbt tests

* Orchestration: Delta Live Tables, Apache Airflow

* Cloud: AWS (S3, Glue, EMR), Azure (ADF, Synapse), GCP (BigQuery, Dataflow)

* Governance: Unity Catalog, Apache Atlas, Collibra

* Version Control: Git

Key Responsibilities

Pattern Definition & Curation

* Define reusable data engineering patterns (SCD Type 2, deduplication, incremental loads)

* Create reference implementations for Bronze/Silver/Gold layers

* Document best practices for PySpark optimization (broadcast joins, partitioning, Z-ORDER)

* Build pattern library with success criteria and performance benchmarks

Data Model Design

* Design 3rd Normal Form (3NF) normalization logic for Silver layer

* Create Star Schema and Snowflake Schema patterns for Gold layer

* Define Slowly Changing Dimension (SCD) Type 0/1/2/3 implementations

* Build fact table and dimension table classification algorithms

Code Quality & Validation

* Review AI-generated PySpark code for correctness and performance

* Define code quality gates (linting scores, test coverage thresholds)

* Create validation test suites for Bronze/Silver/Gold transformations

* Establish performance benchmarks (query time, storage efficiency)

Databricks & Delta Lake Optimization

* Design Delta Live Tables (DLT) pipeline configurations

* Implement Unity Catalog metadata and governance standards

* Optimize Delta Lake performance (ZORDER, OPTIMIZE, VACUUM)

* Configure auto-scaling clusters and job orchestration

Requirements Translation

* Work with business analysts to capture data pipeline requirements

* Translate business rules into technical specifications for agents

* Define entity relationships, metrics formulas, and transformation logic

* Create requirement templates that agents can consume

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.