This job offer is no longer available
About
to work on large-scale data processing and analytics initiatives. The ideal candidate will have hands-on experience working with
large datasets, complex joins, and performance optimization , along with the ability to apply
basic analytical thinking
and deliver
clear, stakeholder-ready outputs . Key Responsibilities Data Engineering & Development Design, develop, and maintain scalable data pipelines using
PySpark . Write
efficient and optimized PySpark code
to process and transform
large-scale datasets . Handle
joins across multiple large databases , ensuring performance, accuracy, and scalability. Optimize Spark jobs to
minimize runtime, memory usage, and compute cost . Work with structured and semi-structured data from multiple sources. Data Preparation & Analysis Support Build and curate
training and analytical datasets
by joining and transforming multiple data sources. Apply
basic analytical skills
to understand data patterns, anomalies, and business relevance. Perform
data validation and quality checks , including: Record counts and reconciliation Duplicate detection Null and outlier checks Schema and data-type validation Ensure datasets are
analysis-ready and trustworthy . Stakeholder Interaction & Reporting Understand business objectives and translate them into data requirements. Ask the right questions to determine: Level of aggregation required Metrics definitions Data freshness and accuracy expectations Preferred output and reporting formats Present results and insights clearly to stakeholders. Create
reports and summaries using Excel
for business users and leadership. Expected Technical Approach (Problem-Solving Mindset) Candidates are expected to demonstrate the ability to: Approach complex data projects methodically, starting with: Understanding business objectives Reviewing source data structure and volume Designing efficient join strategies Choose the right join types, partitioning strategies, and caching techniques. Validate data at every stage of the pipeline. Balance technical accuracy with business usability when presenting results. Core Skill Sets (Must-Have) Strong hands-on experience with PySpark Extensive experience working with large datasets Proven expertise in
joining large databases efficiently Ability to write
high-performance, optimized code Basic analytical skills
to interpret and validate data Reporting skills using Excel Good to Have Skills Experience in
model development
or supporting analytics/modeling teams SAS
experience Exposure to
Cloudera
or similar big data platforms Understanding of data warehousing and analytics workflows Soft Skills & Competencies Strong problem-solving and logical thinking
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.