Talentgigs
Website:
talentgigs.in
Job details:
We are looking for a Senior Data Engineer with deep expertise in building end-to-end dataplatforms using open-source technologies. The ideal candidate must have hands-onexperience designing and implementing scalable, distributed, and event-driven dataarchitectures.This role is strictly focused on open-source data engineering ecosystems and is not suited forprofiles primarily experienced in Databricks, Snowflake, or Azure Data Factory.
Key Responsibilities
Core Responsibilities
1. End-to-End Data Platform Engineering• Architect and implement scalable, distributed data platforms.• Design batch and real-time ingestion frameworks.• Build reusable and modular data engineering frameworks.• Ensure data reliability, scalability, and performance optimization
.2. Data Processing & Orchestration• Develop pipelines using:o Apache Spark (Core, SQL, PySpark)o Apache Flinko Apache Airflow• Implement event-driven architectures using Kafka/Flink ecosystems.• Build robust CDC pipelines.• Implement PySpark Structured Streaming with exactly-once semantics.
3. Lakehouse & Storage Architecture• Design and implement Data Lakehouse solutions using:o Apache Icebergo Delta Lake (including Change Data Feed)• Implement:o Schema evolution strategieso Partitioning & compactiono Metadata management• Optimize storage formats (Parquet, ORC, Avro).
4. Distributed Query & Analytics Layer
Design and optimize distributed SQL query engines using:o Trinoo PrestoDB• Build high-performance analytical data stores using:o StarRocks• Enable interactive analytics and federated query capabilities.• Tune performance for low-latency analytical workloads.5. Infrastructure & Deployment• Deploy Spark/Flink clusters on Kubernetes.• Containerize workloads using Docker.• Implement CI/CD for data pipelines.• Set up observability, logging, and monitoring frameworks.• Work in Linux-based environments.
Required Qualifications
• 7+ years of hands-on data engineering experience.
• Strong expertise in:• Apache Spark (Core + SQL + PySpark)•
PySpark Structured Streaming
• Apache Flink
• Apache Airflow•
Apache Iceberg
• Delta Lake (Change Data Feed)
• Trino or PrestoDB
• StarRocks
• Strong understanding of:
• Event-driven architectures
• Data Lakehouse principles
• Distributed systems design
• Streaming state management
• Exactly-once processing
• Strong SQL and data modeling skills.
• Proficiency in Python.
Click on Apply to know more.