Flag job

Report

Cloud Infrastructure Trainee

Salary

₹10 - 20 LPA

Min Experience

0 years

Location

Bangalore

JobType

full-time

About the role

Simplismart Infrastructure Trainee Engineer JD

About Simplismart

A bit about our product - Simplismart is an MLOps platform with 3 major suites:

  • Training suite: Assemble and train any model, including LLMs, vision, audio, tabular, and tree models.
  • Deployment suite: Most companies fail to make models production-ready. Our proprietary model deployment suite is 6x faster than HuggingFace’s enterprise suite and 12x faster than replicate.ai. Users can easily deploy (auto-scale) models trained on Simplismart (more optimised), import any model from HuggingFace, or even a Pytorch/Tensorflow artefact: Tensorflow, Pytorch, ONNX, JAX.
  • Observability suite: Monitor model health, including load, latency, uptime, data drift, and concept drift.

Position Overview:
As a Cloud Engineer, you will contribute to building a highly available, global, multi-cloud PaaS platform using open-source technologies to support Simplismart’s rapid growth. This system encompasses diverse environments (Kubernetes, VMs, bare metal compute) and provides a cohesive and reliable abstraction for running AI workloads. You will be able to work with cutting-edge technologies and solve complex problems.

To be successful in this role, you need to be deeply technical, possess strong communication and collaboration skills, and have experience in infrastructure-as-code. Proficiency with tools like Terraform and Ansible and strong software development fundamentals is essential. Additionally, you should have a good understanding of systems knowledge and troubleshooting abilities.

Requirements:

  1. 1-2 years of experience writing high-performance, well-tested, production-quality code and platform engineering.
  2. Proficiency in at least one backend programming language (Python desired; C++ is a plus)
  3. Demonstrated experience with high-performance or distributed cloud microservices architectures.
  4. Ideally, you should have experience building and operating globally using multiple cloud providers such as AWS, Azure, or GCP.
  5. A good understanding of low-level operating systems concepts, including multi-threading, memory management, networking and storage, performance, and scale.
  6. Pragmatic, methodical, well-organized, detail-oriented, and self-starting.
  7. Experience with Kubernetes, containerization, Terraform and Ansible.
  8. Experience with Pytorch or Tensorflow is a plus. (not necessary)
  9. Knowledge of GPU programming, NCCL and CUDA is a plus.

Responsibilities:

  1. Designing the high-level architecture of the MLOps platform from the ground up.
  2. Handling formalisation of diverse GPU-based workloads.
  3. Developing a robust internal system for continuous deployment of various services and modules in diverse environments.
  4. Create frameworks for reliable and fault tolerant systems for mission-critical workloads.

Skills and Attributes:

  1. Deep technical expertise.
  2. Strong communication and collaboration skills.
  3. Experience in infrastructure-as-code (Terraform, Ansible).
  4. Strong software development fundamentals.
  5. Good systems knowledge and troubleshooting abilities.
  6. Ability to work independently and as part of a team.
  7. Proactive and self-motivated.

Why should you join SimpliSmart?

Well, let's break away from the conventional perks and instead focus on what you WON’T experience here:

  • Legacy System Headaches: You won't have to endlessly grapple with outdated legacy systems that hinder your productivity and creativity.
  • Bossy Culture: At SimpliSmart, we believe in collaboration and empowerment, not hierarchy. You won't have a boss breathing down your neck but instead, colleagues who support your growth.
  • Dark Circles: Late nights and overwork are not the norm here. We prioritize work-life balance, ensuring you won't be sporting those tired, dark circles under your eyes.
  • Stagnation: Say goodbye to redundant and stagnant tasks. We thrive on innovation and dynamic challenges that keep you engaged and motivated.

About the company

About us
Fastest inference for generative AI workloads. Simplified orchestration via a declarative language similar to terraform. Deploy any open-source model and take advantage of Simplismart’s optimised serving. With a growing quantum of workloads, one size does not fit all; use our building blocks to personalise an inference engine for your needs.

API vs In-house

Renting AI via third-party APIs has apparent downsides: data security, rate limits, unreliable performance, and inflated cost. Every company has different inferencing needs: One size does not fit all. Businesses need control to manage their cost <> performance tradeoffs. Hence, the movement towards open-source usage: businesses prefer small niche models trained on relevant datasets over large generalist models that do not justify ROI.

Need for MLOps platform

Deploying large models comes with its hurdles: access to compute, model optimisation, scaling infrastructure, CI/CD pipelines, and cost efficiency, all requiring highly skilled machine learning engineers. We need a tool to support this advent towards generative AI, as we had tools to transition to cloud and mobile. MLOps platforms simplify orchestration workflows for in-house deployment cycles. Two off-the-shelf solutions readily available:

  1. Orchestration platforms with model serving layer: do not offer optimised performance for all models, limiting user’s ability to squeeze performance
  2. GenAI Cloud Platforms: GPU brokers offering no control over cost

Enterprises need control. Simplismart’s MLOps platform provides them with building blocks to prepare for the necessary inference. The fastest inference engine allows businesses to unlock and run each model at performant speed. The inference engine has been optimised at three levels: the model-serving layer, infrastructure layer, and a model-GPU-chip interaction layer, while also enhanced with a known model compilation technique.

Skills

cloud
aws OR azure OR GCP