Website:
yoitconsulting.com
Job details:
AI Research Engineer - Audio & Vision AI
Location: South Asia (Remote)
Job Type: Full-Time | Permanent
Experience: 3-7 Years
Salary: upto 35 LPA
Must-Haves
- Preferred candidates who have completed Bachelor's or Master's from IITs
- 3+ years of hands-on experience in Machine Learning / Deep Learning (PyTorch, TensorFlow)
- Strong mathematical foundation in signal processing, time-series analysis, and statistics
- Proven experience with audio or visual data — music, speech, motion, or similar perceptual domains
- Familiarity with MIR (Music Information Retrieval) or Computer Vision tasks such as:
- Pitch detection
- Beat tracking
- Timbre classification
- Speech analysis
- Pose estimation
- Gesture recognition
- Motion tracking
- Experience with model optimization and deployment (TorchScript, ONNX, TensorRT)
- Strong Python skills and familiarity with:
- NumPy
- pandas
- Librosa
- Essentia
- OpenCV
- MediaPipe
- Stable career history with minimum 3 years of stability in an organization
About The Role
As an AI Research Engineer, you’ll design and develop machine learning systems capable of understanding and evaluating human performance — starting with sound (music, speech) and later expanding into vision (movement, gestures).
You’ll collaborate with Audio/Vision Engineers, Backend Developers, and Subject Matter Experts (SMEs) such as musicians, dancers, and coaches to transform expert intuition into measurable, scalable AI-driven feedback systems.
This role sits at the intersection of AI, creativity, education, and real-time human interaction.
Key Responsibilities
- Research, design, and train AI models for real-time performance evaluation across domains
- Implement and optimize deep learning architectures including CNNs, RNNs, and Transformers
- Build scalable pipelines for clean, real-time feature extraction from audio and vision data
- Collaborate with SMEs to define “performance quality” metrics and label datasets
- Develop evaluation frameworks to compare AI predictions with expert feedback
- Experiment with cross-modal fusion (audio + vision) for synchronized analysis
- Optimize models for low-latency inference on web/mobile devices using ONNX, TensorRT, and TF Lite
- Document research findings, prototype outcomes, and contribute to internal knowledge sharing
Nice to Have
- Research or published work in Audio AI, Multimodal AI, or Performance Evaluation
- Experience building real-time ML inference systems
- Background in music, performing arts, or educational AI
- Familiarity with AWS, GCP, MLflow, or DVC
- Strong curiosity and creativity in experimenting with human-centered AI
What You’ll Achieve
- Shape the foundation of an AI Learning Intelligence Platform
- Transform expert artistic insights into measurable AI systems
- Build models that can listen, see, and guide learners globally
- Contribute to AI systems spanning music, dance, public speaking, and chess
Why Join Us
- Work at the intersection of AI, Art, and Education
- Collaborate with passionate technologists and subject matter experts
- Creative freedom to explore cutting-edge AI models
- Build products with meaningful global impact
- Competitive compensation, equity opportunities, and strong career visibility
Click on Apply to know more.