HPC Observability EngineerEIT Professionals Corp • New York, New York, United States
HPC Observability Engineer
EIT Professionals Corp
- New York, New York, United States
- New York, New York, United States
À propos
Location: Remote Contract Description:
The client has Grafana and InfluxDB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in: Setting up and maintaining Grafana dashboards for HPC environments Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization Exploring and utilizing out-of-the-box metrics from InfluxDB Writing Python scripts for data ingestion into InfluxDB with examples Developing a proof of concept with a simple Python script to monitor load Ingesting Infiniband packet data Monitoring LSF jobs in various states Visualizing server-specific and cluster-wide metrics in Grafana Optional: Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc. Qualifications and Skills:
B.Tech, MS, or PhD in Computer Science or related field 5-8 years of experience with Grafana, InfluxDB, and Telegraf Experience in Python and Bash scripting is a plus Knowledge of Docker and Google Cloud Platform is advantageous HPC operations experience is beneficial Strong communication skills and ability to work independently Proficiency in requirements analysis and automated testing Ability to write efficient, secure, and well-documented Python code Experience with Git and pipeline development Awareness of modern security and development practices Responsibilities:
Develop and leverage Grafana dashboards and Telegraf configurations Create dashboards for server and cluster metrics Develop Python scripts for data ingestion and documentation Visualize non-native resources in Grafana Optional: Integrate third-party plugins Maintain high-quality code and documentation Collaborate with teams to troubleshoot and optimize pipelines Desired Skills:
Python (good to have) Bash scripting (good to have) Docker (must) HPC operations and LSF (good to have) Experience with DDN Lustre, Mellanox fabric (good to have) Google Cloud Platform (good to have) Knowledge of Git (must) Seniority level:
Mid-Senior level Employment type:
Contract Job function:
Engineering and Information Technology Industries:
IT Services and IT Consulting This job is active and accepting applications.
#J-18808-Ljbffr
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.