Research Scientist – VLM Generalist

Stability AI
RemotePosted 27 February 2026

Job Description

<p><strong>Research Scientist – VLM Generalist</strong></p> <p><strong>Location:</strong> Remote </p> <p><strong>About the Role</strong></p> <p>We’re looking for a Research Scientist with deep expertise in <strong>training and fine-tuning large Vision-Language and Language Models (VLMs / LLMs)</strong> for downstream multimodal tasks. You’ll help push the next frontier of models that reason across <strong>vision, language, and 3D</strong>, bridging research breakthroughs with scalable engineering.</p> <p><strong>What You’ll Do</strong></p> <ul> <li>Design and fine-tune large-scale VLMs / LLMs — and hybrid architectures — for tasks such as visual reasoning, retrieval, 3D understanding, and embodied interaction.</li> <li>Build robust, efficient training and evaluation pipelines (data curation, distributed training, mixed precision, scalable fine-tuning).</li> <li>Conduct in-depth analysis of model performance: ablations, bias / robustness checks, and generalisation studies.</li> <li>Collaborate across research, engineering, and 3D / graphics teams to bring models from prototype to production.</li> <li>Publish impactful research and help establish best practices for multimodal model adaptation.</li> </ul> <p><strong>What You Bring</strong></p> <ul> <li>PhD (or equivalent experience) in Machine Learning, Computer Vision, NLP, Robotics, or Computer Graphics.</li> <li>Proven track record in <strong>fine-tuning or training large-scale VLMs / LLMs</strong> for real-world downstream tasks.</li> <li>Strong <strong>engineering mindset</strong> — you can design, debug, and scale training systems end-to-end.</li> <li>Deep understanding of <strong>multimodal alignment and representation learning</strong> (vision–language fusion, CLIP-style pre-training, retrieval-augmented generation).</li> <li>Familiarity with recent trends, including <strong>video-language and long-context VLMs</strong>, <strong>spatio-temporal grounding</strong>, <strong>agentic multimodal reasoning</strong>, and <strong>Mixture-of-Experts (MoE)</strong> fine-tuning.</li> <li>Awareness of <strong>3D-aware multimodal models</strong> — using NeRFs, Gaussian splatting, or differentiable renderers for grounded reasoning and 3D scene understanding.</li> <li>Hands-on experience with PyTorch / DeepSpeed / Ray and distributed or mixed-precision training.</li> <li>Excellent communication skills and a collaborative mindset.</li> </ul> <p><strong>Bonus / Preferred</strong></p> <ul> <li>Experience integrating <strong>3D and graphics pipelines</strong> into training workflows (e.g., mesh or point-cloud encoding, differentiable rendering, 3D VLMs).</li> <li>Research or implementation experience with <strong>vision-language-action models</strong>, <strong>world-model-style architectures</strong>, or <strong>multimodal agents</strong> that perceive and act.</li> <li>Familiarity with <strong>efficient adaptation methods</strong> — LoRA, adapters, QLoRA, parameter-efficient finetuning, and distillation for edge deployment.</li> <li>Knowledge of <strong>video and 4D generation</strong> trends, <strong>latent diffusion / rectified flow</strong> methods, or <strong>multimodal retrieval and reasoning pipelines</strong>.</li> <li>Background in <strong>GPU optimisation, quantisation, or model compression</strong> for real-time inference.</li> <li>Open-source or publication track record in top-tier ML / CV / NLP venues.</li> </ul> <p><strong>Equal Employment Opportunity:</strong></p> <p>We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.</p>