Infosys
Website:
infosys.com
Job details:
Technology->Analytics - Packages->Python - Big Data,Technology->Big Data - Data Processing->PySpark, ETL
Data Pipeline Development
- Develop and maintain scalable batch ETL pipelines using Python and PySpark for data ingestion, transformation, and loading.
- Implement reusable transformation logic, ensuring pipelines are modular, testable, and easy to maintain.
- Optimize Spark jobs for performance (partitioning, caching, joins, shuffles) and cost efficiency. Data Quality & Reliability
- Apply data validation checks, handle schema evolution, and ensure accuracy and completeness of processed datasets.
- Troubleshoot pipeline failures, analyze logs, and implement robust error handling and retry mechanisms.
- Monitor job runs and support operational stability through alerts, runbooks, and timely incident resolution. Collaboration & Delivery
- Work with cross-functional teams to gather requirements, define data mappings, and deliver datasets aligned to business needs.
- Participate in code reviews, follow engineering best practices, and contribute to continuous improvement of standards and tooling.
- Document pipeline logic, dependencies, and operational procedures for smooth handovers and long-term maintainability.
- Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related field (or equivalent practical experience).
- 2–5 years of hands-on experience building data pipelines using Python and PySpark.
- Strong understanding of ETL concepts, data transformations, and handling large-scale datasets.
- Proficiency in writing clean, maintainable code and debugging production issues.
- Working knowledge of data structures, algorithms, and software development best practices.
Click on Apply to know more.