HPC Systems Manager (req2237)

Posted: 11/16/2021
Location: Frederick, MD
Employee Type: exempt full-time
Job ID: req2237

PROGRAM DESCRIPTION

The mission of Enterprise Information Technology (EIT) is to develop an enterprise-level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) is a part of Enterprise Information Technology (EIT) within Leidos Biomedical Research, Inc. ITOG is responsible for computational servers, storage servers, virtual machine infrastructure, and the FNLCR network. ITOG focuses on implementing enterprise IT best practices in the areas of computational services, storage, backup, and archiving; batch and application support; server consolidation and virtualization; network infrastructure; unification of voice, teleconferencing, and video communication technologies; and improved infrastructure for collocation of dedicated servers.

KEY ROLES/RESPONSIBILITIES

  • Work with scientific researchers to architect, implement, and deploy: HPC clusters, high-capacity, high-bandwidth storage, and scientific software applications necessary to support scientific research
  • Manage and grow a small and technically strong team of HPC engineers who develop, build, and deploy HPC systems that are part of our product
  • Partner with enterprise storage and networking teams to optimize workflows and workloads needed by scientific labs with large data generators
  • Model, characterize, and tune the performance of HPC systems to achieve the most efficient and cost-effective solution
  • Manage the HPC capacity plan, develop deployment schedules, and identify critical science deliverables
  • Identify and manage risks for the HPC systems and develop mitigation plan
  • Perform without considerable direction and mentor and supervise employees if needed

BASIC QUALIFICATIONS

To be considered for this position, you must minimally meet the knowledge, skills, and abilities listed below:

  • Possession of Bachelor's degree from an accredited college/university according to the Council for Higher Education Accreditation (CHEA) or four (4) years relevant experience in lieu of degree. Foreign degrees must be evaluated for U.S. equivalency
  • In addition to the education requirement, a minimum of four (4) years of progressively responsible experience, including two (2) years of experience in a manager capacity
  • Experience in managing Linux and Windows systems in a high-throughput, data intensive environment
  • Experience as a technical lead and/or managing a technical team
  • Solid knowledge of HPC systems, storage, high-speed interconnect, and GPU architecture
  • Experience with batch control software such as SLURM
  • Strong understanding of Linux internals
  • Broad experience with high performance storage systems, NFS, SMB, POSIX
  • Familiarity with system performance analysis, monitoring, and tuning
  • Ability to obtain and maintain a clearance

PREFERRED QUALIFICATIONS

Candidates with these desired skills will be given preferential consideration:

  • 5 years of experience in managing Linux and Windows systems in a high-throughput, data intensive environment, including 3+ years as a technical lead and/or managing a technical team
  • Experience with programming in a variety of languages, both traditional and nontraditional
  • Experience with container technologies and associated infrastructure
  • Experience with Cloud and hybrid models
  • Knowledge of emerging computing technologies
  • Knowledge of various microarchitectures and developing firmware
  • Ability to rapidly evaluate scientific research on new and emerging technologies
  • Possession of excellent client-facing or consulting skills

EXPECTED COMPETENCIES

  • SLURM, GPU, HPC Architecture, Linux
  • Excellent written and verbal communication skills