🎨 About OpenArt
OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination.
We believe the future of creativity is AI-native, and we're shaping that future.
🚀 Why Join OpenArt
- Small team, massive surface area, senior engineers own real systems, not slices.
- Ship at real scale, your work goes to millions of users, fast.
- Founder-led engineering culture, both founders are technical and deeply involved in product and architecture.
- AI-native product, you’ll design how cutting-edge AI models are exposed as real user experiences.
- High ownership, low process, we value judgment, clarity, and speed over bureaucracy.
- 7-10X growth in revenue for the past 2 years. Now you’ll play a critical role in helping the company scale to the next stage.
🎯 About the Role
We’re looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization.
This is not a traditional ops or DevOps role. You’ll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—in a fast-moving, AI-native environment.
You’ll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users—while raising the overall engineering bar.
🛠What You’ll Do
- Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.
- Participate in an on-call rotation and improve incident response (alert quality, runbooks, escalation paths), including leading blameless postmortems and driving follow-through on action items.
- Improve system resilience at external boundaries (AI providers, storage, etc.), including timeouts, retries, circuit breakers, and fallback strategies.
- Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand “what broke” and “why.”
- Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.