Data Engineer (OCR & Data Pipelines, Contract)Intelance • London, England, United Kingdom
Data Engineer (OCR & Data Pipelines, Contract)
Intelance
- London, England, United Kingdom
- London, England, United Kingdom
Über
We are looking for a Data Engineer (OCR & Pipelines) who can turn messy PDFs and documents into clean reliable auditable data flows for ML and downstream systems. This is a contract / freelance role (2-3 days/week) working closely with our AI Solution Architect Lead ML Engineer and Integration Engineer.
Tasks
Design and implement the end-to-end data pipeline for the project.
Ingest PDF/Word reports from secure storage.
Run OCR / text extraction and layout parsing.
Normalise structure and validate the data.
Store outputs in a form ready for ML and integration.
Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar) and wrap them in robust retry‑safe cost‑aware scripts/services.
Define and implement data contracts and schemas between ingestion ML and integration components (JSON/Parquet/relational as appropriate).
Build quality checks and validation rules (field presence, format range checks, duplicate detection, basic anomaly checks).
Implement logging, monitoring and lineage so every processed document can be traced from source > OCR > structured output > model input.
Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training evaluation and explainability.
Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the client’s assessment system (API CSV exports or SFTP drop‑zone).
Follow good security and privacy practices in all pipelines: encryption, access control, least privilege and redaction where needed.
Contribute to infrastructure decisions (storage layout, job orchestration, simple CI/CD for data jobs).
Document the pipeline clearly: architecture diagrams, table/field definitions, data dictionaries, operational runbooks.
Requirements Must-have
3–5 years of hands‑on Data Engineering experience.
Strong Python skills including building and packaging data processing scripts or services.
Practical experience with OCR / document processing (e.g. Tesseract, Azure Form Recognizer, AWS Textract, Google Document AI or equivalent).
Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure but AWS/GCP is fine if you’re comfortable switching).
Good knowledge of data modelling and file formats (JSON, CSV, Parquet, relational schemas).
Experience implementing data quality checks, logging and monitoring for pipelines.
Understanding of security and privacy basics: encryption at rest/in transit, access control, secure handling of potentially sensitive data.
Comfortable working in a small senior remote team; able to take a loosely defined problem and design a clean maintainable solution.
Available for 2–3 days per week on a contract basis working largely remotely in UK or close European time zones.
Nice-to-have
Experience in healthcare life sciences diagnostics or other regulated environments.
Familiarity with Azure Data Factory, Azure Functions, Databricks or similar orchestration/compute tools.
Knowledge of basic MLOps concepts (feature stores, model input/output formats).
Experience with SFTP‑based exchanges and batch integrations with legacy systems.
Benefits
Core impact role: you own the pipeline that makes the entire AI solution possible without you nothing moves.
Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
Lean senior team: work alongside experienced architects and ML engineers; minimal bureaucracy direct access to decision‑makers.
Remote‑first flexible: work from anywhere compatible with UK hours 2–3 days/week.
Contract / freelance: competitive day rate with potential extension into further phases and additional schemes if the pilot is successful.
Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.
We review every application personally. If there’s a good match we will invite you to a short call to walk through the project expectations and next steps.
Key Skills Apache Hive, S3, Hadoop, Redshift, Spark, AWS, Apache Pig, NoSQL, Big Data, Data Warehouse, Kafka, Scala
Employment Type: Freelance
Experience: years
Vacancy: 1
Daily Salary: 450 - 750
#J-18808-Ljbffr
Sprachkenntnisse
- English
Hinweis für Nutzer
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klicken Sie auf „Jetzt Bewerben“, um Ihre Bewerbung direkt auf deren Website einzureichen.