Principal Engineer

Wells Fargo

Required skills

Website: wellsfargo.com
Job details:
About This Role

We are seeking a Principal Engineer – Site Reliability Engineering (SRE)
The Principal Engineer operates as a hands‑on expert, shaping strategy while directly influencing complex systems, mentoring senior engineers, and solving the hardest reliability and performance challenges.
To serve as a technical authority and reliability architect across critical platforms and applications.
This role drives reliability by design, sets enterprise SRE standards, and partners with engineering, architecture, and operations leadership to embed resilience, observability, and automation at scale.

In This Role, You Will

Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership

Required Qualifications

7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

Job Expectation

Define and evolve enterprise‑level SRE strategy, standards, and reference architectures.
Establish and govern SLIs, SLOs, and error budgets for Tier‑1 and Tier‑2 services; drive adoption across engineering and operations.
Architect highly resilient, scalable, and fault‑tolerant systems across cloud and hybrid platforms.
Lead deep‑dive resiliency, capacity, and performance reviews for critical services.

Design and mature end‑to‑end observability architectures (metrics, logs, traces) aligned to golden signals.
Drive OpenTelemetry‑based standardization and telemetry consistency across platforms.
Partner with performance engineering to execute load, stress, soak, failover, and chaos testing.
Identify systemic performance bottlenecks and lead remediation across applications, middleware, and infrastructure layers.

Lead large‑scale toil identification and elimination initiatives across platforms.
Design and implement automation‑first reliability solutions, including self‑healing patterns, auto‑remediation, and AI‑assisted operations.
Build reusable golden paths, reliability frameworks, and standardized automation patterns.
Champion shift‑left reliability, embedding SRE controls into CI/CD pipelines and design reviews.

Serve as senior technical authority during major incidents; provide deep technical triage and architectural guidance.
Lead blameless postmortems for high‑impact incidents; ensure systemic fixes over tactical remediation.
Drive problem management maturity through trend analysis, recurring issue elimination, and proactive risk reduction.
Influence change management practices to ensure safe, predictable, and observable releases.

Mentor and coach senior engineers, SREs, and platform teams on advanced reliability practices.
Define and maintain SRE maturity models, scorecards, and executive‑level reliability metrics.
Partner with architecture, security, and product leaders to align reliability with business outcomes.
Establish and review runbooks, readiness checklists, dashboards, and reliability reviews for consistency and effectiveness.

Additional Required Qualifications

Deep expertise in AWS, Azure, or GCP (multi‑cloud experience preferred).
Strong understanding of container platforms (Kubernetes/OpenShift) and cloud‑native architectures.

Strong proficiency in Python, Go, for automation, tooling, and platform integrations.
Infrastructure as Code expertise: Terraform, Ansible/Chef; strong Git/GitOps practices.
CI/CD expertise: Azure DevOps, GitHub Actions, Jenkins, GitLab CI.

Advanced hands‑on experience with Prometheus, Grafana, OpenTelemetry, and APM tools (AppDynamics, Aternity, SPLOC, ThousandEyes).
Strong knowledge of capacity planning, DR strategies, chaos engineering, canary and blue‑green deployments.

Expert understanding of Incident, Problem, and Change Management.
Strong experience with on‑call models, runbook automation, and SRE operational best practices.
Excellent communication skills with the ability to influence senior engineering and executive stakeholders.

Reference Number

R-522973 Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.