À propos
Join our innovative team at Trellix, where you'll be instrumental in building the evaluation and benchmarking infrastructure for our cutting-edge agentic AI platform. This role sits at the intersection of data science and AI engineering — you'll own the science of how we know our AI works, designing evaluation frameworks, curating test datasets, and measuring the performance of AI agents, knowledge graphs, and foundation models across the Trellix security portfolio. About the Role:
Evaluation Framework Design: Architect and implement rigorous evaluation pipelines for agentic AI systems, including multi-step reasoning agents, retrieval-augmented pipelines, and autonomous SOC workflows. Model & Agent Benchmarking: Design and execute model evaluations to assess accuracy, reliability, latency, and safety across LLMs and agentic systems, including custom benchmarks tailored to cybersecurity use cases. Knowledge Graph Evaluation: Develop methods to validate knowledge graph quality, coverage, and correctness including entity resolution, relationship accuracy, and graph completeness metrics. Dataset Engineering: Build, curate, and maintain high-quality synthetic and real-world datasets for training, fine-tuning, and testing models and agents — including adversarial and edge-case datasets. Agentic Agent Testing: Design structured test harnesses for agentic systems covering tool use, multi-agent coordination, hallucination rates, decision quality, and task completion fidelity. Metrics & Observability: Define and instrument evaluation metrics, surface results through dashboards, and translate findings into actionable insights for engineering and product teams. Research & Innovation: Stay current with the latest evaluation methodologies (e.g., LLM-as-judge, RAGAS, MT-Bench, custom evals) and adapt them to Trellix's security domain. Cross-Functional Collaboration: Partner closely with AI engineers, product managers, and security researchers to align evaluation standards with real-world performance requirements. About You:
Experience: 5+ years of professional experience in data science, ML engineering, or AI research, with hands-on work in evaluation or benchmarking of AI/ML systems. Strong proficiency in Python (pandas, NumPy, scikit-learn)
Statistical analysis and experimental design
Experience building and managing datasets for ML training and evaluation
Familiarity with annotation workflows and data quality frameworks
AI/LLM Evaluation: Hands-on experience evaluating Large Language Models (LLMs)
Familiarity with evaluation frameworks such as RAGAS, HELM, EleutherAI LM Eval, or equivalent
Experience designing LLM-as-judge pipelines or preference evaluation workflows
Understanding of hallucination detection, groundedness, and faithfulness metrics
Agentic Systems: Experience testing or evaluating agentic AI systems
Familiarity with tool use, ReACT-style, Deep Agents, and multi-agent coordination patterns
Ability to define pass/fail criteria for complex, multi-step agent tasks
Knowledge Graphs: Experience working with knowledge graphs (NebulaGraph, Neo4j, or equivalent)
Ability to evaluate graph quality, ontology coverage, and traversal correctness
Familiarity with embedding-based retrieval and vector databases (Qdrant preferred)
Data Engineering & Infrastructure: Experience with synthetic data generation for model and agent testing
Proficiency with vector databases and embedding pipelines
Familiarity with MLflow, Weights & Biases, Langfuse, or similar experiment tracking tools
AWS experience preferred
Domain Knowledge: Familiarity with the cybersecurity domain strongly preferred
Understanding of SOC workflows, threat detection, and incident response a plus
Experience evaluating AI systems in high-stakes or regulated environments a plus
Soft Skills: Strong analytical thinking and ability to translate ambiguous quality questions into measurable metrics
Excellent written communication — able to document evaluation methodologies and present findings to technical and non-technical stakeholders
Collaborative mindset with a bias toward rigor and reproducibility
Company Benefits and Perks: We believe that the best solutions are developed by teams who embrace each other's unique experiences, skills, and abilities. We work hard to create a dynamic workforce where we encourage everyone to bring their authentic selves to work every day. We offer a variety of social programs, flexible work hours and family-friendly benefits to all of our employees. Retirement Plans
Medical, Dental and Vision Coverage
Paid Time Off
Paid Parental Leave
Support for Community Involvement
We're serious about our commitment to a workplace where everyone can thrive and contribute to our industry-leading products and customer support, which is why we prohibit discrimination and harassment based on race, color, religion, gender, national origin, age, disability, veteran status, marital status, pregnancy, gender expression or identity, sexual orientation or any other legally protected status. Our Commitment to You: At Trellix, we are committed to creating a safe and trustworthy experience for our customers, employees, and candidates. Please be aware that fraudulent recruiting activity can occur through fake job postings or impersonated communications. Trellix conducts interviews through professional channels only and does not use text messages, instant messaging, or group chats for interviews. We will never request sensitive personal information—such as your date of birth, Social Security number, or national ID number—during the interview process. Trellix also does not require candidates to pay fees, purchase products or services, or process payments of any kind as part of the recruiting or hiring process. And Trellix will never keep any original work authorization documents that we may be required to review during the hiring process.
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.