JOB DESCRIPTION / ROLE
We are seeking an experienced Al/ML Engineer with strong expertise in deploying and maintaining machine-learning services at scale. The ideal candidate will be responsible for running and enhancing our GPU-powered Al infrastructure, ensuring high performance as user demand grows, and integrating new models and pipelines.
This is an urgent requirement — candidates must be ready to join within 15 days.
Responsibilities
•Deploy, monitor and maintain production AI/ML services on GPU servers (RunPod).
•Scale GPU infrastructure and load-balancing to meet growing user demand; implement autoscaling and cost controls.
•Optimize inference speed and reliability through batching, quantization, caching and other performance techniques.
•Integrate AI models into existing services as required.
•Enhance our current AI model.
•Build and maintain asynchronous processing and queue systems (Celery/Redis or equivalent).
•Implement safety filters and other guardrails around model outputs.
•Ensure zero-touch operations: automatic startup, model pre-loading, health checks, alerts and automated restarts.
•Expose clean API endpoints and work closely with backend developer for seamless integration with the main application.
•Maintain dashboards for GPU usage, latency, error rates and alerting.
•Document all infrastructure, deployment processes and enhancements clearly for handover and future scaling.
Requirements:
•2+ years of experience deploying machine-learning models into production environments.•Strong Python skills and experience with PyTorch/TensorFlow or similar frameworks.
•Proven experience with GPU-based inference, performance tuning and cost optimisation.
•Familiarity with containerisation (Docker) and orchestration (Kubernetes or similar).
•Knowledge of queueing systems and asynchronous processing (Celery/Redis, RabbitMQ, etc.).
•Experience integrating multiple model types into APIs is a plus.
•Proficiency with cloud platforms or specialised GPU providers (RunPod, Lambda, AWS, GCP, Azure).
•Understanding of monitoring and observability tools (Prometheus, Grafana, ELK, etc.).
•Version control (Git/GitHub) and CI/CD familiarity.
•Ability to work independently and take full responsibility of the AI/ML infrastructure in production.
Salary:
OMR
400 to 500
per month inclusive of fixed allowances.
ABOUT THE COMPANY
Hike Tech LLC is an Oman-based technology company dedicated to the creation and management of innovative digital platforms. We are dedicated to building, launching and managing high-quality digital products for users worldwide.
Find Top Talent
Other jobs you might be interested in
Service Engineer Jobs in MuscatField Engineer Jobs in Muscat
Senior Civil Engineer Jobs in Muscat
Show More