Principal Engineer
Wells Fargo
- Location
- Bengaluru South, Karnataka, India
- Job type
- Full-time
Required skills
- Python
- AWS
- Ansible
- Azure
- business objectives
- capacity planning
- change management
- communication skills
- database
- DevOps
- GCP
- Git
- information security
- Jenkins
- Kubernetes
- middleware
- SRE
- Terraform
About the role
Wells Fargo
Website:
wellsfargo.com
Job details:
About This Role
- We are seeking a Principal Engineer – Site Reliability Engineering (SRE)
- The Principal Engineer operates as a hands‑on expert, shaping strategy while directly influencing complex systems, mentoring senior engineers, and solving the hardest reliability and performance challenges.
- To serve as a technical authority and reliability architect across critical platforms and applications.
- This role drives reliability by design, sets enterprise SRE standards, and partners with engineering, architecture, and operations leadership to embed resilience, observability, and automation at scale.
In This Role, You Will
- Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
- Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
- Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
- Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
- Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
- Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Required Qualifications
- 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Job Expectation
- Reliability Engineering & Architecture
- Define and evolve enterprise‑level SRE strategy, standards, and reference architectures.
- Establish and govern SLIs, SLOs, and error budgets for Tier‑1 and Tier‑2 services; drive adoption across engineering and operations.
- Architect highly resilient, scalable, and fault‑tolerant systems across cloud and hybrid platforms.
- Lead deep‑dive resiliency, capacity, and performance reviews for critical services.
- Observability & Performance
- Design and mature end‑to‑end observability architectures (metrics, logs, traces) aligned to golden signals.
- Drive OpenTelemetry‑based standardization and telemetry consistency across platforms.
- Partner with performance engineering to execute load, stress, soak, failover, and chaos testing.
- Identify systemic performance bottlenecks and lead remediation across applications, middleware, and infrastructure layers.
- Automation & Toil Reduction
- Lead large‑scale toil identification and elimination initiatives across platforms.
- Design and implement automation‑first reliability solutions, including self‑healing patterns, auto‑remediation, and AI‑assisted operations.
- Build reusable golden paths, reliability frameworks, and standardized automation patterns.
- Champion shift‑left reliability, embedding SRE controls into CI/CD pipelines and design reviews.
- Incident, Problem & Change Leadership
- Serve as senior technical authority during major incidents; provide deep technical triage and architectural guidance.
- Lead blameless postmortems for high‑impact incidents; ensure systemic fixes over tactical remediation.
- Drive problem management maturity through trend analysis, recurring issue elimination, and proactive risk reduction.
- Influence change management practices to ensure safe, predictable, and observable releases.
- Technical Leadership & Enablement
- Mentor and coach senior engineers, SREs, and platform teams on advanced reliability practices.
- Define and maintain SRE maturity models, scorecards, and executive‑level reliability metrics.
- Partner with architecture, security, and product leaders to align reliability with business outcomes.
- Establish and review runbooks, readiness checklists, dashboards, and reliability reviews for consistency and effectiveness.
Additional Required Qualifications
- 7+ years’ experience designing and operating large‑scale distributed systems.
- 7+ years hands‑on experience in SRE, Platform Engineering, or DevOps roles.
- Proven track record of driving enterprise‑scale reliability transformations.
- Cloud & Platforms
- Deep expertise in AWS, Azure, or GCP (multi‑cloud experience preferred).
- Strong understanding of container platforms (Kubernetes/OpenShift) and cloud‑native architectures.
- Automation & Engineering
- Strong proficiency in Python, Go, for automation, tooling, and platform integrations.
- Infrastructure as Code expertise: Terraform, Ansible/Chef; strong Git/GitOps practices.
- CI/CD expertise: Azure DevOps, GitHub Actions, Jenkins, GitLab CI.
- Observability & Reliability Tooling
- Advanced hands‑on experience with Prometheus, Grafana, OpenTelemetry, and APM tools (AppDynamics, Aternity, SPLOC, ThousandEyes).
- Strong knowledge of capacity planning, DR strategies, chaos engineering, canary and blue‑green deployments.
- Operational Excellence
- Expert understanding of Incident, Problem, and Change Management.
- Strong experience with on‑call models, runbook automation, and SRE operational best practices.
- Excellent communication skills with the ability to influence senior engineering and executive stakeholders.
- What Success Looks Like
- Measurable reduction in incidents and operational toil
- Clear SLO adoption and error budget governance across critical services
- Improved MTTR, resiliency, and release confidence
- Scalable, repeatable reliability patterns embedded by default
Reference Number
R-522973
Click on Apply to know more.
This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.