About this opportunity:
BNEW RCE Lab Operations provides lab infrastructure, networks, and hardware services that enable R&D teams to configure and deploy design and test environments globally.
Hardware Management (HWM) provides hardware asset management for R&D, covering the full asset lifecycle to support data-driven decisions and maximize value.
Join Ericsson as a DevSecOps Engineer and help build, secure, and operate COS (Central Observability System), the platform behind our global observability services used by thousands of developers and testers worldwide.
We’re hiring a DevSecOps Engineer to evolve and run COS (Central Observability System), Ericsson’s global observability platform. Join our journey into secure automation and AI-enabled operations.
COS is used daily by thousands of developers and testers worldwide. You’ll own the DevSecOps setup end-to-end: secure CI/CD, infrastructure/configuration as code, operational readiness, and continuous improvement. You’ll also help introduce AI, from AIOps workflows to MLOps/LLMOps pipelines keeping features safe, observable, and reliable.
Tech stack spans cloud-native microservices (e.g., Go, Svelte, Kubernetes), telemetry backends (e.g., Cortex), and data systems (e.g., Postgres/Cassandra/Kafka).
You like automation, clear guardrails, and measurable outcomes. You turn telemetry into action and help teams move fast without compromising security or reliability.
What you will do
- Build and run COS as a secure cloud-native platform (containers, Kubernetes, OpenStack), using infrastructure/configuration as code.
- Introduce AI-enabled operations (AIOps) and MLOps/LLMOps practices: automate detection/diagnosis, and set up safe, observable delivery and monitoring of AI components.
- Execute deployments, upgrades, and configuration changes; troubleshoot by reproducing issues, restoring service, and running performance/load tests as needed.
- Improve secure CI/CD for platform and services: testing, scanning, policy checks, controlled releases, and compliance (risk assessments, audits, configuration baselines).
- Apply SRE practices to improve availability, scalability, and performance, and drive proactive monitoring and reliability improvements.
- Take operational ownership with the team: runbooks, on-call readiness, incident/problem management, SLA follow-up, access management, and service performance reporting.