Flag job

Report

Site Reliability Engineering Analyst - Senior I

Min Experience

4 years

Location

Hyderabad

JobType

Full-Time

About the job

Info This job is sourced from a job board

About the role

About FedEx: Located in Hyderabad, India, FedEx ACC India serves as a strategic technology division for FedEx that will focus on developing innovative solutions for our customers and team members across the globe. These solutions will enhance productivity, minimize expenses, and update our technology infrastructure to continue providing the outstanding experiences our customers expect. A Site Reliability Engineer (SRE) is an advanced DevOps role that combines software engineering and Cloud capabilities to ensure the scalability, performance, and reliability of large-scale, cloud-based applications. As applications and infrastructure became complex and cloud-based—a more proactive and software-centric approach is needed to ensure reliability at scale. By combining software engineering and cloud principles, SREs bring a mindset of automation, reliability to operations. The preferred approach to tackle operations challenges with a software engineering perspective, leveraging: Coding Automation Engineering principles By doing so, build resilient, self-healing systems that could scale seamlessly. So how do we do this? Here's what we expect SRE to help IT and Engineering team to mature: Detect issues. Automatically handle failures. Prepare disaster recovery plans. Keep the system up and reliable. Mitigate broken systems and prevent them from causing future disruptions. Responsibilities: An SRE bridges the gap between traditional software engineering and operations to create highly scalable and fault-tolerant systems. As a result, ensure the reliable and efficient operation of an organization's systems and services. Here's an in-depth look into the core responsibilities of site reliability engineers: Ensure system reliability and availability: Efficient systems are the backbone of every secure and breach-free organization. Organizations continuously update their application to provide advanced features to users. But sometimes, their systems become unreliable, which results in unavailability. This is where site reliability engineers help. Here's how SRE ensure systems are reliable: Monitor system issues. Create strategies to detect issues. Address those issues. Design systems to troubleshoot automatically. Write and review post-mortems. Mitigate operational risks: SREs identify, assess, and implement measures to eliminate potential risks that could impact the performance of systems and services. Here is how SRE do it: Collaborate with development teams and other stakeholders to identify potential risks. Once risks are identified, analyze and evaluate potential impact and likelihood of occurrence. Based on the risk assessment, implement various risk mitigation strategies to mitigate operational risks. Once done, continuously monitor and review the effectiveness of their risk strategies. By doing so, SREs maintain system reliability and ensure a positive user experience. Monitor system health: Monitoring means measuring system's health. An SRE uses alerts, tickets, logging mechanisms, and request times to monitor a system's health. This ensures the system is stable and minimizes user disruption. In case a bug occurs, respond immediately to resolve it. However, doing all of this manually is expensive and time-consuming. So, SREs automate this process for systems that handle large amounts of data. Here is how they do it: Study historical trends in terms of performance by using metrics like charts and graphs. Next, they trace the problems with system monitoring tools. Monitor the log files to manage infrastructures at scale. Doing so eliminates manual collection, storage, and visualization of the data. Minimize emergency response: Emergency response is the time site reliability engineers take to respond to problems. This period is known as the Mean Time to Respond (MTTR). It measures the time an SRE takes to fix the incident after it happens. Minimizing the MTTR for reliable systems is necessary to reduce downtime. As an SRE, you can improve this metric by resolving the incidents quickly. Maintain internal tooling: Site reliability engineers maintain internal tools to run complex operations smoothly. These tools help them track severe bugs, maintain CI/CD pipelines, and communicate with other teams. Some of the most widely used internal tools are: Experience in Azure directory / Azure DevOps Communication platforms like MS teams, ServiceNow – ePDSM. Bug tracking platforms such as JIRA, Digital Agility or HP ALM. Deployment strategies such as GitHub Actions Monitoring solutions like Splunk, Grafana. Error logging services such as Kibana, ELK Stack. Documentation tools such as MS SharePoint. Continuous Improvement. Site reliability engineers aim to make systems better every day. For this purpose, collaborate with teams like QA, software engineers, and security engineers to ensure all teams are on the same page. Qualifications: Bachelor's degree in computer science, Engineering, or related field. 3 to 5 years of experience as an SRE or DevOps engineer or Ops Engineer.

About the company

FedEx was founded with a vision of becoming a global leader. We've brought that vision to life by connecting people around the world to goods, services, ideas, and technologies that open up new opportunities and change lives.

Skills

Site Reliability Engineering or SRE
Kubernetes
Jira or ServiceNow
Azure or Azure infrastructure or Azure monitor
Splunk or Grafana or Kibana or App Dynamics
CICD or Jenkins