-
Design, develop, and maintain automation to onboard new hardware devices into Jump's HPC data centers, including servers, network switches, rack PDUs, CDUs, and environmental sensors.
-
Build end-to-end provisioning workflows that take hardware from racked-and-cabled through discovery, configuration, validation, and production-ready state with minimal manual intervention.
-
Extend and adapt onboarding automation as new hardware platforms and device types are introduced.
Data Center Tooling Development
-
Develop tools for power and cooling capacity planning—enabling the operations and planning teams to model current utilization, forecast growth, and identify constraints before they become problems.
-
Build outage simulation tooling to model the impact of power, cooling, or network failures across HPC facilities and validate redundancy/failover configurations.
-
Develop and maintain operational tooling that supports day-to-day data center workflows such as hardware lifecycle tracking, data center inventory/spares, change management, and diagnostics.
Monitoring & Metrics Integration
-
Build and maintain monitoring integrations for HPC data center infrastructure—pulling telemetry from servers, switches, PDUs, CDUs, environmental sensors, and facility systems into centralized observability platforms.
-
Integrate metrics feeds from colocation and data center providers into Jump's monitoring stack, normalizing data for alerting and capacity reporting.
-
Work with the Operations Lead to implement the monitoring and alerting strategy, translating requirements into deployed, production-grade instrumentation.
Cross-Team Collaboration
-
Work very closely with the HPC Planning, Engineering, and Operations leads to understand tooling and monitoring needs and bring their vision to fruition.
-
Partner with HPC Engineering on integration points between data center automation and compute/storage/network provisioning systems.
-
Translate operational pain points and manual processes into automated, maintainable solutions.
Systems Maintenance & Reliability
-
Own the reliability and lifecycle of all systems and tools you develop—monitor for failures, respond to issues, and iterate based on operational feedback.
-
Maintain comprehensive documentation for all tooling, automation workflows, and integrations.
-
Participate in large, coordinated maintenance operations, including during evenings and weekends.
AI-Driven Development
-
Use AI tools daily across all aspects of the role: writing and reviewing code, analyzing data, debugging, generating documentation, and accelerating development velocity.
-
Identify opportunities to apply AI to data center operations problems—anomaly detection, predictive capacity planning, intelligent alerting, and beyond.
Additional duties as assigned or needed.
Skills You'll Need:
-
5+ years of professional experience in production engineering, infrastructure automation, or site reliability engineering, preferably in HPC or large-scale data center environments.
-
Proven track record of building and shipping production automation and tooling—not just scripts, but maintained, reliable systems.
-
Experience automating hardware provisioning and lifecycle management (servers, network devices, power/cooling infrastructure).
-
Strong understanding of data center infrastructure: power distribution, cooling systems (air and liquid), environmental monitoring, and structured cabling.
-
Experience integrating with hardware management interfaces (IPMI/BMC/Redfish, SNMP, vendor APIs) for discovery, configuration, and telemetry collection.
-
Demonstrates a high level of energy, results driven, and able to work under pressure with tight deadlines.
Technical Skills:
-
High proficiency in Golang and at least one additional language (e.g., Python). You will write a lot of code in this role.
-
Strong Linux systems knowledge—you should live in Linux. Proficient with system administration, networking, storage, process management, log analysis, and troubleshooting at the OS level.
-
Experience with Grafana for building dashboards, alerting, and visualization of infrastructure metrics. Experience with Prometheus, InfluxDB, or similar observability platforms and building custom integrations/exporters.
-
Experience with configuration management and infrastructure-as-code tools (SaltStack, Ansible, Terraform, or similar).
-
Solid understanding of networking concepts: L2/L3 protocols, VLANs, BGP, SNMP, and switch/router configuration (Arista, Cisco).
-
Experience with APIs and data integration—consuming vendor APIs, normalizing heterogeneous data sources, building data pipelines for metrics and reporting.
-
Experience with ClickHouse and MySQL—writing queries, designing schemas, and building tooling that reads from and writes to these databases.
-
Experience with GitHub for version control, code review, CI/CD workflows, and collaborative development.
-
Demonstrated heavy use of AI tools (e.g., LLM-based coding assistants, AI-driven analytics) in a professional setting. You should already be using AI daily and be eager to push its application further.
-
A compulsion to perform root cause analysis.
-
Excellent written and verbal communication skills with the ability to work across a global engineering team.
-
Extremely high personal standards for work quality.
-
Reliable and predictable availability, including ability to work evenings and weekends as required.
-
Bachelor's degree preferred.