We are looking for a Senior Site Reliability Engineer who understands the nuances of production systems. If you care about building and running reliable software systems in production, you'll like working at One2N.
You will primarily work with our startups and mid-size clients. We work on One-to-N kind problems (hence the name One2N), those where Proof of concept is done and the work revolves around scalability, maintainability, and reliability. In this role, you will be responsible for architecting and optimizing our observability and infrastructure to provide actionable insights into performance and reliability.
Responsibilities
- Conceptualise, think, and build platform engineering solutions with a self-serve model to enable product engineering teams.
- Provide technical guidance and mentorship to young engineers.
- Engage in code reviews and help establish best practices for development and operations.
- Design and implement comprehensive monitoring, logging, and alerting solutions to collect, analyze, and visualize data (metrics, logs, traces) from diverse sources.
- Develop custom monitoring metrics, dashboards, and reports to track key performance indicators (KPIs), detect anomalies, and troubleshoot issues proactively.
- Work on Developer Experience (DX) to help engineers improve their productivity.
- Design and implement CI/CD solutions to optimize for velocity and shorten the delivery time.
- Help SRE teams set up on-call rosters and coach them for effective on-call management.
- Automating repetitive manual tasks from CI/CD pipelines, operations tasks, and infrastructure as code (IaC) practices.
- Stay up-to-date with emerging technologies and industry trends in cloud-native, observability, and platform engineering space.
Requirements:
- 6-9 years of professional experience in DevOps practices or software engineering roles, with a focus on Kubernetes on an AWS platform.
- Expertise in observability and telemetry tools and practices, including hands-on experience with some of Datadog, Honeycomb, ELK, Grafana, and Prometheus.
- Working knowledge of programming using Golang, Python, Java, or equivalent.
- Skilled in diagnosing and resolving Linux operating system issues.
- Strong proficiency in scripting and automation to build monitoring and analytics solutions.
- Solid understanding of microservices architecture, containerization (Docker, Kubernetes), and cloud-native technologies.
- Experience with infrastructure as code (IaC) tools such as Terraform.
- Excellent analytical and problem-solving skills, keen attention to detail, and a passion for continuous improvement.
- Strong written, communication, and collaboration skills, with the ability to work effectively in a fast-paced, agile environment.