What You’ll Be Doing
- Build highly scalable, available, fault-tolerant distributed data processing systems
(batch and streaming systems) processing over 100s of terabytes of data ingested
every day and petabyte-sized data warehouse and elasticsearch cluster.
- Build quality data solutions and refine existing diverse datasets to simplified
models encouraging self-service
- Build data pipelines that optimize on data quality and are resilient to poor quality
data sources
- Own the data mapping, business logic, transformations and data quality
- Low level systems debugging, performance measurement & optimization on large
production clusters
- Participate in architecture discussions, influence product roadmap, and take
ownership and responsibility over new projects
- Maintain and support existing platforms and evolve to newer technology stacks
and architectures
We’re excited if you have
- Proficiency in Python and pyspark
- Deep understanding of Apache Spark, Spark tuning, creating RDDs, and building
data frames. Create Java/ Scala Spark jobs for data transformation and
aggregation.
- Experience in big data technologies like HDFS, YARN, Map-Reduce, Hive, Kafka,
Spark, Airflow, Presto, etc.
- Experience in building distributed environments using any of Kafka, Spark, Hive,
Hadoop, etc.
- Good understanding of the architecture and functioning of Distributed database
systems
- Experience working with various file formats like Parquet, Avro, etc for large
volumes of data
- Experience with one or more NoSQL databases
- Experience with AWS, GCP
- 5+ years of professional experience as a data or software engineer