GPU Optimization Engineer (Hiring Immediately)

techire ai • san francisco, ca • Posted June 19, 2026

Position Overview

GPU Optimisation Engineer — Real-Time Inference Want to push GPU performance to its limits — not in theory, but in production systems handling real-time speech and multimodal workloads? This team is building low-latency AI systems where milliseconds actually matter. The target isn’t “faster than baseline.” It’s sub-50ms time-to-first-token at 100+ concurrent requests on a single H100 — while maintaining model quality. They’re hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy, kernel launch overhead, occupancy limits, scheduling inefficiencies, KV cache behaviour, attention paths. The work sits close to the metal, inside inference execution — not general infra, not model research. You’ll operate across the kernel and runtime layers, profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear. What you’ll work on Profiling GPU bottleneck...