Embrace Software Inc
Website:
embracesoftwareinc.com
Job details:
About Us
We're rebuilding vertical software with AI — from the inside.
Embrace owns the software running inside 16% of the Fortune 500, 45+ state agencies, and 450+ banks and credit unions. We acquire entrenched vertical software businesses and rebuild them around AI — products, operations, go-to-market, all of it. Our Venture Lab launches new AI-native products into those same markets, using the distribution and customer relationships our portfolio already owns.
You'll ship AI into production against real workflows, real customers, and a P&L you can see move within a quarter.
We hire people who want scope, speed, and ownership, and who are tired of working on AI that never reaches a customer. If you want to spend the next five years shipping into software that already runs the economy, talk to us.
Job Description
This is a remote position.
Embrace Technology Group is the unified engineering organization across the Embrace portfolio, encompassing our Venture AI Labs. We build and modernize software products across six regulated industry verticals, and we are reshaping how that work gets done — AI-first, forward-deployed, and outcome-driven. Our engineers ship real products to real customers, fast.
We are looking for a CloudOps Engineer to operate and continuously improve the reliability, security, scalability, observability, and cost efficiency of our Azure-hosted SaaS products. Our products run across dev, QA, staging, and production environments, with infrastructure managed in Terraform and CI/CD automated through GitHub Actions.
You will partner with engineering teams to ensure our SaaS platforms and AI-enabled solutions are deployed consistently, monitored effectively, secured properly, and operated reliably in production.
Environment and Technology Context- Microsoft Azure-hosted SaaS products across dev, QA, staging, and production.
- Terraform for infrastructure as code and repeatable provisioning.
- GitHub Actions for application and infrastructure CI/CD.
- Azure services: Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Speech-to-Text, and Azure Arc.
- AI-enabled capabilities: STT workloads, LLM integrations, AI service endpoints, quotas, usage and latency monitoring, and cost controls.
Key ResponsibilitiesCloud Infrastructure Operations- Manage and support Azure infrastructure across dev, QA, staging, and production.
- Maintain operational health of Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, and Azure Arc.
- Ensure resources are provisioned, monitored, maintained, and retired per company standards.
- Support environment setup for new products, customers, and integrations.
- Identify and resolve infrastructure issues affecting performance, reliability, availability, or security.
Terraform and Infrastructure as Code- Build and maintain Terraform modules and environment configurations.
- Ensure infrastructure changes are version-controlled, peer-reviewed, tested, and approved.
- Manage Terraform state, workspaces, variables, secrets, and deployment workflows.
- Detect and resolve drift between Terraform and deployed Azure resources.
- Standardize naming, tagging, resource group structure, environment isolation, and module patterns.
- Support scalable provisioning of new SaaS environments using reusable templates.
GitHub Actions and CI/CD- Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments.
- Support CI/CD pipelines across multiple SaaS products and environments.
- Implement promotion flows from dev to QA to staging to production.
- Add deployment safeguards: environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails.
- Manage pipeline secrets, service principals, managed identities, and deployment credentials.
- Improve build and deployment reliability and traceability.
AI Service Operations- Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads.
- Support production operations for LLM integrations and AI-enabled product features.
- Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost.
- Help define operational standards for AI workloads: access control, logging, alerting, failover, usage governance, and provider disruption handling.
- Partner with engineering to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side disruptions.
- Support secure handling of AI secrets, endpoints, keys, managed identities, and private network access.
Monitoring, Alerting, and Observability- Implement and maintain monitoring with Azure Monitor, Log Analytics, and Application Insights.
- Build dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health.
- Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies.
- Improve signal quality by reducing noise and ensuring alerts are actionable.
- Partner with engineering to define SLIs, SLOs, and production health metrics.
Incident Response and Production Support- Participate in production incident response for infrastructure, deployments, integrations, and platform services.
- Triage and resolve issues across Azure services, CI/CD, Terraform, networking, databases, messaging, and AI integrations.
- Create and maintain runbooks for common operational issues.
- Support root cause analysis and post-incident reviews.
- Implement preventive actions after incidents to improve reliability.
- Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures.
Security, Identity, and Access Management- Implement cloud security best practices across Azure environments.
- Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions.
- Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access.
- Support secret rotation, certificate management, and secure configuration management.
- Enforce network security via private endpoints, firewalls, IP restrictions, and environment-specific access rules.
- Support audit and compliance readiness for SOC 2, ISO 27001, or similar frameworks.
Database, Storage, and Messaging Operations- Support Azure PostgreSQL operations: backups, restores, performance monitoring, connection limits, HA, and capacity planning.
- Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends.
- Support Azure Service Bus operations: queue/topic monitoring, dead-letter handling, retry behavior, and throughput.
- Support SignalR operational health, connection metrics, and scaling behavior.
Cost Management and Optimization- Monitor Azure spend across products, environments, services, and customers where applicable.
- Implement tagging standards to support cost allocation by product, environment, customer, or business unit.
- Build cost dashboards, budget alerts, anomaly detection, and recurring cost reviews.
- Identify underutilized resources and recommend right-sizing opportunities.
- Review AI service costs, LLM and token usage, STT usage, storage growth, database sizing, and environment costs.
- Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules.
Reliability, Backup, and Disaster Recovery- Define and maintain backup and recovery procedures for critical cloud services.
- Test database restores and validate backup reliability.
- Help define RTOs and RPOs for production systems.
- Support disaster recovery planning for SaaS products and customer-facing services.
- Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews.
Documentation and Operational Standards- Create and maintain CloudOps documentation, runbooks, deployment guides, and environment standards.
- Define standards for naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions patterns, and production changes.
- Document procedures for cloud services, CI/CD workflows, AI services, and incident response.
- Enable engineering teams with reusable patterns, templates, and self-service guidance.
Requirements
Required Qualifications- 5+ years of hands-on experience operating production workloads in Microsoft Azure.
- Strong experience with Terraform and infrastructure as code.
- Experience building and maintaining CI/CD pipelines using GitHub Actions.
- Experience with containerized workloads, preferably Azure Container Apps or similar.
- Experience with Azure Monitor, Log Analytics, and Application Insights.
- Experience with Azure PostgreSQL or similar managed relational databases.
- Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices.
- Experience troubleshooting production incidents across infrastructure, deployments, networking, and cloud services.
- Comfortable scripting in Bash, PowerShell, Python, or similar.
- Strong documentation, communication, and cross-functional collaboration skills.
Preferred QualificationsExperience in any of the following is a plus:
- Operating AI-enabled applications or Azure AI services.
- Azure AI Foundry, Azure OpenAI, Speech-to-Text, or LLM-based integrations.
- Monitoring AI service usage, quotas, latency, throttling, token consumption, and cost.
- Azure Service Bus, SignalR, Storage Accounts, Static Web Apps, and Azure Arc.
- Multi-product or multi-tenant SaaS platforms.
- SOC 2, ISO 27001, or similar compliance frameworks.
- FinOps, cloud cost governance, or Azure cost optimization.
- Designing production support processes, incident response workflows, on-call rotations, and operational runbooks.
Benefits
- Competitive salary commensurate with experience.
- Opportunities for career advancement and professional development.
- Experience collaborating with a diverse, global team within a remote work setting.
Click on Apply to know more.