Data Engineer Intermediate/Senior

at HathiTrust (view profile)
Published January 9, 2023
Location Ann Arbor, Michigan
Category Academic  
Job Type Full-time  
Apply Here https://careers.umich.edu/job_detail/228433/data-engineer-intermediatesenior
Cover Letter Requirements Required
Minimum Compensation in Local Currency $68,000.00
Maximum Compensation in Local Currency $100,000
Hourly or Salary? Salary
Twitter Handle @hathitrust

Description

Summary

Founded in 2008, HathiTrust is a not-for-profit collaborative of academic and research libraries. HathiTrust offers reading access to the fullest extent allowable by U.S. copyright law, computational access to the entire corpus for scholarly research, and other new services based on the combined collection. HathiTrust members steward the collection — the largest set of digitized books managed by academic and research libraries. For more information on HathiTrust, please visit www.hathitrust.org.

HathiTrust is looking for an experienced developer to help develop data workflows for large-scale digital library systems. HathiTrust has a repository of over a petabyte of data comprising about 17.6 million digitized books. There is an Apache Solr index comprising over 12 terabytes of full text from these books and a separate index with library catalog metadata (MARC records) for each item. We manage a variety of metadata in MariaDB and in MongoDB including information about holdings from member librariescopyright and licensing informationUS federal government documents, and more. We use in-house software written in the Perl and Ruby programming languages to manage indexing and search processes for this data, much of which is publicly available in HathiTrust's GitHub. Our infrastructure runs in multiple Linux environments, including virtual machines and containerized with Docker and Kubernetes.

As the Data Engineer you will work with developers, librarians, and other partners to modernize and improve indexing, search, and analysis for these kinds of data to enable improvements to HathiTrust's websites and applications. You will report to the HathiTrust Enterprise Technology Team Lead.

HathiTrust is administratively based in the University of Michigan Library, and its staff are employees of the University.  The Library is committed to recruiting and retaining a diverse workforce and encourages all employees to incorporate their diverse backgrounds, skills, and life experiences into their work. Our Diversity Plan is at https://www.lib.umich.edu/diversity-equity-inclusion-accessibility/diversity-strategic-plan.

Responsibilities

  • Improve workflows for loading, indexing, searching, and analyzing data, including bibliographic metadata and the full text for over 17 million scanned books.
  • Collaborate with other developers, staff, and researchers to equitably improve the search experience and to deliver more relevant catalog and full-text search results for a diverse user audience.
  • Be part of a team working to modernize technology used by the HathiTrust Digital Library applications to better support user needs.
  • Use modern development practices such as version control, dependency management, secure development practices, containerization, and automated testing and deployment.
  • Participate in needs assessment, requirements gathering, and development for systems that support the HathiTrust Digital Library, such as full-text and catalog search.
  • Continue improving development skills through learning about new technologies and best practices in search and databases and communicating those with the team.

Requirements

  • Demonstrated ability and 5+ years experience developing systems to support data management, indexing, search, and analysis.
  • Experience with related technologies, including SQL databases, NoSQL databases, and systems for full-text search.
  • Experience working in a collaborative team to build complex applications.
  • An awareness of how data and search algorithms amplify or reduce inequity and bias.
  • Basic reading proficiency in at least one non-English language, or background in linguistics, natural language processing, multilingual information retrieval, or similar discipline.
  • Familiarity with issues of data and search in at least one non-English language, preferably including at least one language that does not use Latin characters, such as Arabic, Chinese, Hindi or other South Asian languages, Japanese, Korean, or others.
  • Understanding of the value of diversity and the importance of inclusion expressed through a commitment to apply and incorporate the differences, complexities, and opportunities that diversity brings to an organization.