This job offer is no longer available
Data Engineer
Code4lib
- Oklahoma City, Oklahoma, United States
- Oklahoma City, Oklahoma, United States
About
HBS’s Baker Library is seeking a temporary Data Engineer to help launch a faculty citation data project aimed at better understanding how its collections support and influence scholarly research. This initiative involves identifying faculty publications, extracting their cited references, and analyzing the relationships within this data to generate meaningful insights into patterns of use and library collection impact. By analyzing citations, the project seeks to surface evidence of how Baker’s resources contribute to the research ecosystem at HBS.
Reporting to Baker Library’s User Needs and Assessment Librarian, this temporary Data Engineer role will focus on the final phase of the project, where a corpus of raw citation data has already been collected and aggregated from multiple sources. At this stage, the data requires careful cleaning, normalization, and transformation to ensure it is accurate, consistent, and suitable for analysis. The individual in this role will work with this messy dataset to standardize fields, resolve inconsistencies, and prepare the data for downstream analytical work. This phase is critical to ensuring the reliability and interpretability of the project’s findings and will directly shape the quality of insights generated about Baker’s impact.
This is a temporary, full-time, remote position. Employees in fully remote positions must work all scheduled hours in a Harvard registered state in compliance with the University’s Policy on Employment Outside of Massachusetts . Specific hours and work days will be determined by business needs and are subject to change with appropriate advanced notice.
Responsibilities
Clean and normalize raw citation data by resolving inconsistencies in author names, publication titles, journal names, and other variables
Co‑develop and apply standardized schemas for field names and data structures to ensure consistency across the dataset
Design and implement reproducible data cleaning workflows using scripts that can be reused
Co‑create or locate unique identifiers (e.g., for authors, works, journals) to enable accurate linking and deduplication across records
Perform record linkage and deduplication using techniques such as fuzzy matching and string comparison
Assess and improve data quality by identifying missing, inconsistent, or anomalous values and determining appropriate remediation strategies
Conduct exploratory analysis to evaluate the completeness and reliability of the dataset, including identifying patterns of data gaps
Collaborate with project stakeholders to align data cleaning decisions with project goals
Explore connection points for citation data with other HBS administrative datasets
Document data transformations, data dictionaries, and workflows to support transparency, reproducibility, and future project phases
Qualifications
Experience working with messy, real‑world datasets
Advanced proficiency in R (preferred), using libraries such as dplyr, tidyr, and tidyverse, or Python, using libraries such as pandas
Familiarity with regular expressions (regex), string comparison, and fuzzy matching
Proficient understanding of standardization principles and controlled vocabularies
Ability to balance precision and pragmatism when making decisions in the absence of perfect information
Comfort documenting processes and decisions for both technical and non‑technical audiences
Ability to work independently while also seeking input when project ambiguity or edge cases arise
Ability to envision how data cleaning and manipulation serve larger project goals
Basic understanding of academic publishing and citation formats
Proficiency in Microsoft Office tools (Outlook email, Teams sites, folder management, file retrieval)
#J-18808-Ljbffr
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.