Logile
Website:
logile.com
Job details:
Company Overview
Logile is the leading retail labor planning, workforce management, inventory management and store execution provider deployed in thousands of retail locations across North America, Europe, Australia, and Oceania.
Our proven AI, machine-learning technology and industrial engineering accelerate ROI and enable operational excellence with improved performance and empowered employees. Retailers worldwide rely on Logile solutions to boost profitability and competitive advantage by delivering the best service and products at optimal cost.
From labor standards development and modeling to unified forecasting, storewide scheduling, and time and attendance, to inventory management, task management, food safety, and employee self-service we transform retail operations with a unified store-level solution. Gain the Advantage with The Logic of Retail. One Platform for store planning, scheduling and execution.
For more information, visit www.logile.com
Job Summary
We are seeking a motivated and experienced
Site Reliability Engineer
( SRE)
to join our dynamic engineering team. The ideal candidate will have a strong background to ensure the reliability, scalability, and performance of our infrastructure and applications. The SRE will focus on building robust monitoring systems, automating operations, and bridging the gap between development and operations to achieve high service availability.
Key Responsibilities
Design, implement, and manage observability systems (Prometheus, Grafana, ELK/EFK, Jaeger, Open Telemetry).
Define and maintain SLAs, SLOs, and SLIs for services, ensuring reliability goals are met.
Build automation for infrastructure, monitoring, scaling, and incident response using Terraform, Ansible, and scripting (Python/Bash).
Collaborate with developers to design resilient and scalable systems following SRE best practices.
Lead incident management: monitoring alerts, root cause analysis, postmortems, and continuous improvement.
Implement chaos engineering and fault-tolerance testing to validate system resilience.
Drive capacity planning, performance tuning, and cost optimization across environments.
Ensure security, compliance, and governance in infrastructure monitoring
Job Location & Schedule:
This job is an onsite job at Logile Bhubaneswar Office.
It is expected that the selected candidate will be available to work with some hours of overlap with US working times
Required Skills & Experience
2 -5 years, Strong experience with monitoring, logging, and tracing tools (Prometheus, Grafana, ELK, EFK, Jaeger, Open Telemetry, Loki).
Cloud expertise: AWS, Azure, or GCP monitoring and reliability practices (CloudWatch, Azure Monitor).
Proficiency in Linux system administration and networking fundamentals.
Solid skills in infrastructure automation (Terraform, Ansible, Helm).
Programming/scripting skills: Python, Go, Bash.
Experience with Kubernetes and containerized workloads.
Proven track record in CI/CD and DevOps practices.
Preferred Skills
Experience with chaos engineering tools
(Gremlin, Litmus).
Strong collaboration skills to drive SRE culture across Dev & Ops teams.
Experience with Agile/Scrum environments.
Knowledge of security best practices (DevSecOps).
Compensation And Benefits
The compensation and benefits associated for this role is benchmarked against the best in industry and job location
Standard shift: 1 PM
10 PM (shift allowance applicable for non-standard shifts and as per role).
Shifts starting after 4 PM: Eligible for food allowance/subsidized meals and cab drop.
Shifts starting after 8 PM: Eligible for cab pickup as well.
Click on Apply to know more.