This job offer is no longer available
About
We are at the forefront of developing cutting-edge AI solutions that push the boundaries of machine learning, LLM applications, and agentic AI. Our team builds real-world AI systems and deploys scalable, production-ready solutions across Oracle's enterprise customers.
We are seeking a highly skilled AI/ML Infrastructure Engineer to design, build, and support the systems, tools, and cloud services that power large-scale machine learning and agent-based workflows. In this role, you will leverage your expertise in distributed systems, cloud-native platforms, and software engineering to enable applied scientists to train, evaluate, and deploy state-of-the-art AI models efficiently and reliably.
You will collaborate across infrastructure, platform, and applied science teams to deliver secure, high-performance, and scalable infrastructure that accelerates AI innovation.
ResponsibilitiesKey Responsibilities
- Design, deploy, and operate infrastructure components—including cloud compute, distributed systems, and data storage—to support AI/ML model training, evaluation, and deployment.
- Build automation pipelines for provisioning, configuring, and monitoring AI/ML infrastructure using Terraform, Docker, Kubernetes, and related tools.
- Optimize resource utilization and performance through cluster tuning, caching, data preprocessing, and system-level performance enhancements.
- Troubleshoot and resolve complex issues in distributed computing environments, ensuring high availability, reliability, and scalability.
- Enforce strong security and compliance standards through access control, vulnerability management, and encryption best practices.
- Partner closely with applied scientists, platform engineers, and cloud infrastructure teams to gather requirements and deliver frictionless ML workflows.
- Produce clear and comprehensive documentation for infrastructure, APIs, designs, troubleshooting, and best practices.
- Stay current with emerging trends in AI infrastructure, agent frameworks, HPC systems, and cloud-native technologies; evaluate and integrate them where appropriate.
Required Qualifications
- 4–7 years of software engineering experience focused on backend systems, distributed systems, or AI/ML applications.
- Hands-on experience with Docker and deploying containerized applications in Kubernetes environments.
- Strong Linux administration skills, including shell scripting, package management, troubleshooting, and performance tuning.
- Proven experience designing or managing infrastructure for AI/ML or HPC workloads. HA, Failover, Cross-Region DR
- Expertise with at least one public cloud (OCI, AWS, Azure, GCP) and willingness to specialize in OCI.
- Excellent communication skills, particularly in distributed and asynchronous team environments.
- Demonstrated ability to own problems end-to-end and collaborate effectively with internal teams and external customers..
- Knowledge of DevOps and CI/CD pipelines.
Additional Skills
- Strong research and evaluation capabilities for new technologies and approaches.
- Advanced understanding of applications, server technologies, network routing, and security.
Disclaimer:
Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.
Range and benefit information provided in this posting are specific to the stated locations only
US: Hiring Range in USD from: $79,200 to $178,100 per annum. May be eligible for bonus and equity.
Oracle maintains broad salary ranges for its roles in order to
Languages
- English
This job was posted by one of our partners. You can view the original job source here.