Wells Fargo
Website:
wellsfargo.com
Job details:
About The Role
Wells Fargo is seeking a Senior Systems Operations Engineer.
This is an
individual contributor role within the
CTO ITSO Application Support organization, with end‑to‑end responsibility for the
operational health and reliability of enterprise monitoring, scheduling, and observability platforms.
The role owns
production support, incident response, platform health, upgrades, and operational readiness, partnering closely with engineering, SRE, and onshore teams to meet defined
SLAs, SLOs, and availability targets.
The position emphasizes
strong ITIL execution combined with
SRE principles, including reliability engineering, proactive issue prevention, automation, and reduction of operational toil. This role is part of the
CTO Application Support team within ITSO, responsible for
run‑the‑business operations of critical
enterprise scheduling, monitoring, and observability platforms. The team ensures
availability, reliability, performance, and resilience of platforms such as
Autosys, Grafana, Splunk, Cribl, ThousandEyes, and other observability and monitoring tools that support production environments across the enterprise. The function operates at the intersection of
ITIL‑aligned service operations and an
SRE mindset, with a strong focus on operational stability, automation, observability, and continuous service improvement.
In This Role, You Will
- Provide hands‑on production support for enterprise monitoring, scheduling, and observability platforms in line with ITIL service operations.
- Ensure platforms meet defined availability, reliability, and performance objectives, excluding approved maintenance windows.
- Drive incident management activities including triage, escalation, coordination, remediation, and post‑incident reviews.
- Contribute to problem management by identifying recurring issues and driving root‑cause fixes.
- Execute change management activities such as patching, upgrades, configuration changes, and platform enhancements following standard controls.
- Proactively monitor platform health and identify risks to stability, capacity, or performance.
- Apply an SRE mindset by identifying opportunities for automation, self‑service, and toil reduction.
- Strengthen observability practices through effective use of metrics, logs, dashboards, alerts, and synthetic monitoring.
- Create, maintain, and enhance runbooks and playbooks to improve mean time to detect (MTTD) and mean time to restore (MTTR).
- Participate in on‑call rotations and provide timely response and ownership during production incidents.
- Collaborate with onshore partners, engineering teams, and platform owners to ensure seamless operations and knowledge transfer.
Required Qualifications
- 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Desired Qualifications
- Hands‑on experience supporting enterprise application platforms in the areas of monitoring, scheduling, or observability (e.g., Autosys, Grafana, Splunk, Cribl, ThousandEyes, or equivalent).
- Strong understanding of ITIL processes, including incident, problem, and change management.
- Experience working in a CTO, ITSO, or large‑scale enterprise application support environment.
- Proven ability to triage complex production issues, perform root cause analysis, and drive issues to closure.
- Solid understanding of SRE and reliability engineering concepts, including observability, error budgets, and automation.
- Ability to operate independently with a strong ownership and accountability mindset.
- Effective communication skills to work across global teams and time zones.
Reference Number
R-530328
Click on Apply to know more.