Report

Site Reliability Engineer

Min Experience

0 years

Location

San Diego, California, United States

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

We are a leading-edge technology consulting firm committed to empowering organizations through the implementation of cloud-native and enterprise DevSecOps transformations. Our team of dedicated experts is driven by a passion for harnessing cutting-edge technologies to deliver unparalleled value to our clients. We specialize in crafting innovative technical solutions grounded in cloud-native principles, containerization, and the implementation of advanced automation-driven DevSecOps practices. At the heart of our ethos lies a relentless pursuit of progress and the establishment of new industry benchmarks. Our unwavering commitment to excellence sets us apart and makes us the preferred choice for our clients. We recognize that delivering exceptional technical solutions necessitates the expertise of renowned professionals. If you share our zeal for constructing cloud-native systems, developing cloud-based applications, and designing automation solutions, and if you are seeking to join a company that stands as a dominant force in the realms of Enterprise DevSecOps and Cloud Native domains, then you've discovered the ideal destination. We cultivate a vibrant, inclusive, and collaborative environment that champions innovation and continuous learning. As a member of our team, you will have the opportunity to engage in exciting projects, tackle intricate challenges, and make a substantial contribution to the advancement of digital transformation for our clients. Come and be a part of a team that thrives on pushing the boundaries of what technology can achieve. This position will primarily focus on providing design and implementation expertise on infrastructure provisioning, management and lifecycle implementation of cloud components and services, containers and other critical concepts of DevSecOps principles. Key Responsibilities: Observability & Monitoring: Design and manage monitoring solutions using Prometheus, Thanos, Grafana, and Mimir to ensure the health and performance of Kubernetes clusters and applications. Logging & Tracing: Implement Loki, Promtail, and OpenTelemetry to collect, process, and analyze logs and traces for debugging and forensic analysis. Kubernetes Operations: Deploy, maintain, and optimize Kubernetes clusters, ensuring observability tools are properly integrated and configured. Incident Response & SLOs: Define SLIs, SLOs, and error budgets, develop alerting strategies using Alertmanager, and automate incident response processes. High Availability & Scalability: Optimize observability stack for high availability in limited connectivity environments, leveraging solutions like Thanos for long-term storage and Minio for object storage. Security & Compliance: Implement observability best practices in compliance with security frameworks and Kubernetes security tools such as NeuVector. Automation & Infrastructure as Code (IaC): Automate observability deployments using Terraform, Helm, and Kubernetes Operators. Collaboration & Documentation: Work closely with DevOps, security, and platform teams to enhance system reliability and maintain comprehensive documentation.

About the company

Skills

kubernetes

prometheus

thanos

grafana

mimir

loki

promtail

opentelemetry

alertmanager

terraform

helm

kubernetes operators