Calsoft
Website:
calsoftinc.com
Job details:
Role - HW Fleet Ops Engineer
As a Hardware Fleet Operations Engineer, you will own the operation, reliability, and lifecycle management of large-scale lab and test infrastructure fleets (arrays, servers, switches, and supporting hardware). A key part of this role is to design and build Python-based automation to rehab, reconfigure, and return hardware to service automatically, reducing manual toil and improving testbed availability.
You will collaborate with development, DevOps, SRE, and release teams to ensure that hardware fleets are always ready to run CI/CD pipelines and validation workloads, and that incidents are detected, triaged, and auto-remediated wherever possible. You will be on call to get alert if the Fleet goes below the threshold limit.
Responsibilities • Operate and own large hardware test fleets (FlashArray, FlashBlade, servers, switches, PDUs, etc.) across global labs.
• Design and implement Python automation to:
● Detect unhealthy hardware and testbeds.
● Run standardized rehab workflows (power cycle, re-cable checks, firmware checks, OS/image re-provisioning, config reset, etc.).
● Safely return capacity to the shared pool with minimal human intervention. • Build and maintain tooling and services that integrate fleet state with:
● CI/CD systems (e.g., Jenkins).
● Knowledge on jenkins groovy pipeline implementation
● Experience with tools like ansible / dominator for handling the o/s & patch roll-out
● Scheduling and reservation systems.
● Monitoring/observability platforms.
• Develop and maintain runbooks and automation APIs for common operational tasks (bring-up, decommission, reprovisioning, network reconfig, rack/slot moves).
• Implement and improve health checks and SLOs for hardware fleets (availability, utilization, rehab throughput, mean time to rehab).
• Analyze incident trends (SEVs, repeated failures, noisy devices) and drive continuous improvements in automation, process, and hardware standards.
• Collaborate with SRE, DevOps, and feature teams to ensure testbed readiness for releases, including support for special configurations and scale/topology tests.
• Participate in on-call rotation for lab/testbed operations following a follow-the-sun model, including incident response and root cause analysis (RCA).
• Work closely with internal lab team on deployment of new fleet and sometime migration.
• Document systems, APIs, and workflows and contribute to building a comprehensive runbook and automation library. Minimum Qualifications
• Strong Python programming skills, with experience building automation tools, services, or scripts for infrastructure or hardware operations.
• Solid understanding of Linux systems, shell scripting (e.g., Bash), and basic system administration. • Experience operating one or more of the following at scale: Storage arrays, servers, or data center hardware. Lab/testbed or CI infrastructure for large engineering organizations.
• Hands-on experience with: REST/gRPC APIs, CLI tools, or SDKs to manage hardware platforms. Version control (Git) and basic CI/CD concepts.
• Demonstrated ability to: Systematically debug hardware or infrastructure issues. Design safe automation (idempotent, with guardrails and rollback). Communicate clearly with cross-functional teams and drive incidents to closure.
• Strong sense of ownership, urgency, and drive to reduce manual toil through automation.
• Willingness to participate in a daytime on-call rotation (e.g., 8am–8pm local time for one week every few months).
Mandatory skill set
Python Automation Testing
Storage Domain
Hardware Infrastructure / Lab Infrastructure / Hardware Testing / Fleetops
Hardware test fleets (FlashArray, FlashBlade, servers, switches, PDUs, etc.)
Location – Any Calsoft location
Work mode – Hybrid
Please share updated profile on natasha.joshi@calsoftinc.com
Click on Apply to know more.