Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)Logic Hire Solutions LTD • United States
Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)
Logic Hire Solutions LTD
- United States
- United States
À propos
Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)
Location:
Remote / Hybrid / Specific Location
Experience Level:
5+ Years (Mid-Senior to Senior)
Position Overview We are undergoing a fundamental shift in our data infrastructure, moving away from legacy on-premise Cloudera (CDH/HDP) environments toward a modern, hybrid‑cloud data mesh architecture spanning GCP and AWS. We need a Senior Data Engineer who has built these platforms from the ground up, leading large‑scale migration projects re‑platforming Hive/Impala workloads and HDFS datasets to cloud‑native storage and compute (Snowflake/Databricks). Responsibilities include writing high‑performance Python code, optimizing Spark jobs that process petabytes of data, and ensuring our real‑time streaming infrastructure (Kafka/PubSub) is rock‑solid.
Detailed Tech Stack & Environment
Languages
Python 3.9+ (Advanced: Decorators, Generators, Multiprocessing, Pydantic, Poetry)
PySpark
SQL (ANSI & BigQuery Dialect)
Scala (Maintenance only)
Compute & Processing
Apache Spark 3.x (DataFrames, Structured Streaming)
Databricks (Delta Live Tables, Photon, Unity Catalog)
GCP Dataproc (Serverless & Cluster Mode)
AWS EMR (on EC2 & EKS)
Streaming & Messaging
Apache Kafka (Schema Registry, Avro)
GCP Pub/Sub
AWS Kinesis Data Streams
Debezium (CDC)
Storage & Warehouse
Snowflake (Snowpipe Streaming, Streams & Tasks, Time Travel)
GCP BigQuery (BI Engine, Materialized Views)
AWS S3
GCP Cloud Storage
Delta Lake / Apache Iceberg
Orchestration & Ops
Apache Airflow 2.x (GCP Cloud Composer, AWS MWAA)
dbt Core/Cloud
Terraform (IaC)
Docker
GitHub Actions / Jenkins
Legacy (Migration Source)
Cloudera CDH/HDP
Apache Hive
Apache Impala
Oozie
HDFS
Detailed Must-Have Responsibilities & Technical Expectations
Core Software Engineering in Python
5+ years of professional software engineering experience.
Code quality focus: enforce unit testing (PyTest), integration testing, and CI/CD for all Spark jobs.
Debug JVM GC issues in Spark UDFs and refactor to native Spark SQL or vectorized Pandas UDFs for 10× performance gains.
Design reusable Python packages for data ingestion, validation (Great Expectations), and logging across GCP and AWS environments.
Large‑Scale Migration Expertise
Proven, hands‑on experience migrating from Cloudera (CDH/HDP) to Snowflake or Databricks.
Legacy decommissioning: analyze Hive Metastore schemas and Impala query patterns to design migration strategy to BigQuery or Snowflake.
Data transfer: architect and execute transfer of hundreds of TBs from HDFS to GCS/S3 using DistCp, incremental syncs, and converting Hive table formats to optimal cloud layouts (partitioned Parquet/Delta).
Workflow refactoring: reverse‑engineer Oozie workflows and rebuild as idempotent DAGs in Apache Airflow.
Cloud‑native feature adoption: replace batch INSERT OVERWRITE jobs with Snowpipe Streaming or Databricks Auto Loader to reduce latency from hours to seconds.
Greenfield Platform Architecture & Integration
Build secure reverse‑ETL pipelines from Snowflake/BigQuery to operational systems (Salesforce, HubSpot, Postgres) using Apache Beam (Dataflow) or AWS Lambda with custom retry logic.
Design a Change Data Capture pipeline using Kafka Connect (Debezium) → Pub/Sub → Dataflow → BigQuery, ensuring exactly‑once semantics and schema evolution handling.
Build a serverless ingestion framework on GCP Cloud Functions (Python) that pulls data from 3rd‑party REST APIs, handles pagination/authentication, and lands raw JSON into Cloud Storage partitioned by date.
Data Modeling & Distributed Systems Expertise
Explain trade‑offs between Kimball Star Schema, Data Vault 2.0, and One Big Table; implement appropriate approaches in dbt.
Spark tuning: diagnose Data Skew (salting keys), optimize shuffle partitions, and manage broadcast joins to avoid executor OOMs on Dataproc and EMR.
Streaming architecture: tune Kafka retention policies and Pub/Sub subscription backlogs for data durability during consumer downtime.
Cloud Platform Proficiency (GCP Focus)
BigQuery: enforce cost governance via table partitioning and clustering; use BigQuery Omni for cross‑cloud analytics on AWS data if required.
Dataflow: write Apache Beam pipelines in Python handling late‑arriving data with windowing and watermarks.
Networking: understand VPC Service Controls and Private Service Connect to keep data off the public internet.
Detailed Nice-to-Have Qualifications
Databricks & Unity Catalog: experience implementing fine‑grained access control and lineage using Unity Catalog in a multi‑workspace environment.
NoSQL & Graph:
Redis: experience implementing Redis as a distributed cache for lookup tables in Spark Streaming jobs to reduce latency on joins against BigQuery.
Neo4j: knowledge of building identity resolution graphs or supply‑chain dependencies using Cypher queries.
Infrastructure as Code (Terraform): ability to write Terraform modules to provision GCP Service Accounts, BigQuery datasets, IAM bindings, and AWS Glue catalogs in a repeatable manner.
Machine Learning Integration: experience building feature stores on Databricks or using BigQuery ML for batch inference directly within the warehouse.
Skills: cloud, data, snowflake, cloudera, gcs, gcp, python, storage, aws, gcp bigquery, architecture, dataflow, docker
#J-18808-Ljbffr
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.