This job offer is no longer available
About
We are seeking a
Data Engineer
to build and scale the data infrastructure powering our
Agentic AI
products. You will be responsible for the "Ingestion-to-Insight" pipeline that allows autonomous agents to access, search, and reason over vast amounts of proprietary and public data.
Your role is critical: you will design the
RAG (Retrieval-Augmented Generation)
architectures and data pipelines that ensure our agents have the right context at the right time to make accurate decisions.
Key Responsibilities AI-Ready Data Pipelines:
Design and implement scalable ETL/ELT pipelines that process both structured (SQL, logs) and unstructured (PDFs, emails, docs) data specifically for LLM consumption. Vector Database Management:
Architect and optimize
Vector Databases
(e.g., Pinecone, Weaviate, Milvus, or Qdrant) to ensure high-speed, relevant similarity searches for agentic retrieval. Chunking & Embedding Strategies:
Collaborate with AI Engineers to optimize data chunking strategies and embedding models to improve the "recall" and "precision" of the agent's knowledge retrieval. Data Quality for AI:
Develop automated "Data Cleaning" workflows to remove noise, PII (Personally Identifiable Information), and toxicity from training/context datasets. Metadata Engineering:
Enrich raw data with advanced metadata tagging to help agents filter and prioritize information during multi-step reasoning tasks. Real-time Data Streaming:
Build low-latency data streams (using Kafka or Flink) to provide agents with "fresh" data, enabling them to act on real-time market or operational changes. Evaluation Frameworks:
Construct "Gold Datasets" and versioned data snapshots to help the team benchmark agent performance over time. Required Skills & Qualifications
Experience:
4+ years in Data Engineering, with at least 1 year focusing on
data for LLMs or AI/ML applications . Python Mastery:
Deep expertise in
Python
(Pandas, Pydantic, FastAPI) for data manipulation and API integration. Data Tooling:
Strong experience with modern data stack tools (e.g., dbt, Airflow, Dagster, Snowflake, or Databricks). Vector Expertise:
Hands-on experience with at least one major
Vector Database
and knowledge of similarity search algorithms (HNSW, Cosine Similarity). Search Knowledge:
Familiarity with hybrid search techniques (combining semantic search with traditional keyword search like Elasticsearch/BM25). Cloud Infrastructure:
Proficiency in managing data workloads on AWS, Azure, or GCP. Preferred Qualifications
Experience with
LlamaIndex
or
LangChain
for data ingestion. Knowledge of
Graph Databases
(e.g., Neo4j) to help agents understand complex relationships between data points. Familiarity with "Data-Centric AI" principles-prioritizing data quality over model size.
Compensation, Benefits and Duration
Minimum Compensation: USD 38,000 Maximum Compensation: USD 133,000 Compensation is based on actual experience and qualifications of the candidate. The above is a reasonable and a good faith estimate for the role. Medical, vision, and dental benefits, 401k retirement plan, variable pay/incentives, paid time off, and paid holidays are available for full time employees. This position is not available for independent contractors No applications will be considered if received more than 120 days after the date of this post
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.