Deltek
Website:
deltek.com
Job details:
Key Responsibilities:
Site Reliability & Platform Engineering
- Design, build, and maintain the infrastructure and tooling that underpins Deltek’s SaaS platforms at scale.
- Drive reliability improvements across the full stack, spanning application-level resilience patterns through to infrastructure-level fault tolerance.
- Uphold and extend our IaC-first engineering culture, where all infrastructure changes are made through code and shipped to production via fully automated CI/CD pipelines.
- Build and improve CI/CD pipelines to support safe, frequent deployments with automated rollback capabilities.
- Develop internal tooling and automation to reduce toil and increase engineering self-service.
Observability & Performance
- Design and maintain comprehensive observability solutions including logging, metrics, tracing, and alerting across our AWS-based infrastructure.
- Proactively identify performance bottlenecks and reliability risks before they impact customers.
- Conduct capacity planning and load testing to ensure systems can scale to meet demand.
Incident Management & On-Call Support
- Participate in and own the on-call rotation, ensuring fair distribution and adequate coverage across the team, and acting as a first responder for production incidents affecting our SaaS platforms.
- Lead incident response: triage, coordinate cross-team resolution, communicate clearly with stakeholders, and drive issues to resolution with a sense of urgency.
- Own post-incident reviews, facilitate blameless post-mortems, identify root causes, and ensure action items are tracked and completed.
- Take pride in leaving systems better than you found them, consistently reducing the frequency and impact of incidents over time.
Collaboration & Engineering Culture
- Partner with software engineering teams to review system designs and architectures with a reliability lens.
- Mentor and provide technical guidance to junior engineers on SRE practices, tooling, and operational excellence.
- Contribute to a strong team culture, supportive, curious, and focused on doing great work while having fun.
Technology Stack:
- JavaScript / Node.js
- C# / .NET
- Python
- Docker & Kubernetes
- PostgreSQL
- Amazon Web Services (AWS)
- Terraform
Qualifications:
Education
- Bachelor’s degree in Computer Science or a related field, or equivalent experience.
Experience
- Minimum of 3-5 years of overall experience in software development, infrastructure engineering, or site reliability engineering.
- 3+ years of hands-on experience in an SRE, DevOps, or platform engineering role in a production SaaS environment.
- 3+ years applying an automation-first approach to problem-solving using configuration management tools and scripting.
- Strong experience with AWS; familiarity with services such as EC2, EKS, RDS, S3, CloudWatch, and IAM.
Technical Skills
- Infrastructure-as-Code expertise with Terraform.
- Proficiency in at least one scripting/programming language (Python, Node.js, or similar) for automation and tooling development.
- Strong understanding of networking fundamentals: DNS, load balancing, TLS, firewalls, and VPCs.
- Experience with CI/CD pipelines and deployment automation.
- Solid understanding of relational databases (PostgreSQL preferred) including query performance and operational concerns.
- Hands-on experience with observability tooling (e.g., Prometheus/Grafana, CloudWatch, or similar).
Soft Skills
- Strong communication skills: able to explain complex systems clearly, write crisp incident reports, and influence technical decisions across teams.
- Calm under pressure, able to lead effectively during high-severity incidents.
- Passion for reliability, operational excellence, and building systems that just work.
- Commitment to reducing toil through thoughtful automation and process improvement.
- Blameless, growth-oriented mindset with a focus on continuous improvement.
Click on Apply to know more.