Linux NVidia GPU Server Administrator
Company: Leidos Inc
Location: Bethesda
Posted on: January 21, 2023
Job Description:
Description Job Description:ATTENTION!! **Qualified candidates
with critical skills may be eligible for a one time Sign-on Bonus
up to $15k depending upon the skill level and requirements per
requisition.Please contact Stephanie Lovett, Principal Recruiter,
for more details @ stephanie.a.lovett@leidos.com**At Leidos, we
deliver innovative solutions through the efforts of our diverse and
talented people who are dedicated to our customers' success. We
empower our teams, contribute to our communities, and operate
sustainable practices. Everything we do is built on a commitment to
do the right thing for our customers, our people, and our
community. Our Mission, Vision, and Values guide the way we do
business. Employees enjoy career enrichment opportunities available
through mobility and development and experience rewarding
relationships with supportive supervisors and talented colleagues
and customers. Your most important work is ahead.
If this sounds like the kind of environment where you can thrive,
keep reading!
Leidos is looking to fill a Linux Server/NVidia GPU
Administrator/Engineer position within the Analysis Solutions
Division (ASD) to support the National Media Exploitation Center
(NMEC). This role requires an individual that has technical
experience with administering Nvidia DGX1 and A100 servers within a
within a physical and virtual environment. This individual should
be detail oriented in order to capture customer inquiries
appropriately. This role is responsible for interacting with
administrators to handle service inquiries and problems. Duties
include examining customer problems and implementing appropriate
corrective action to initiate a repair or return to service. This
role analyzes recurring problems and initiates solutions for
preventing reoccurrence and analyzes existing infrastructure for
tuning/performance enhancements. The individual will provide
systems and software operations and maintenance support in a large,
multi-enclave enterprise environment. This individual will work in
a team environment to ensure mission needs are met and ensure
functionality of capabilities of customers. Individuals in this
role may be required to perform technical software configuration,
rebooting, and other remedial actions on customer servers. The
Customer utilizes an Agile Framework to plan and successfully
complete all initiatives. The work location is in Bethesda at the
Intelligence Community Campus.Primary Responsibilities
- Review C&A documentation providing feedback on completeness
and compliance of its content
- Perform system installation, configuration maintenance, account
maintenance, signature maintenance, patch management, and
troubleshooting of operational IA and CND systems
- Operates with appreciable latitude in developing methodology
and presenting solutions to problems.
- Contributes to deliverables and performance metrics where
applicable.
- Responsible for implementing, operating, and maintaining
physical and virtual server hardware and systems software.
- Monitor resource management system (SLURM) to keep resource
allocation efficient and aligned with organizational
priorities
- Automate configuration management, software updates, and
maintenance of system availability using modern DevOps tools
(Ansible, Salt, Gitlab, etc.)
- Plan and maintain new systems that support the NVIDIA Software
stack
- Work directly with developers and hardware architects to debug
issues, identify new requirements, and improve workflows
- Actively communicate with users and management regarding
resource planning and allocation
- Provide technical support, administration, and monitoring of
Linux systems, Nvidia DGX1 and A100 servers within a physical and
virtual environment.
- Provide support for the implementation, troubleshooting and
maintenance of IT systems. Rapidly distinguish isolated user
problems from enterprise-wide application/system problems.
- Maintain scripts, security updates, patches, and configurations
for the proper functioning of servers.
- Coordinate with customers and stakeholders to collect data,
conduct analysis, develop, and implement solutions associated with
incident tickets and requirements.
- Seek opportunities for continuous improvement to support
effective and efficient operations
- Develop solutions to complex technical issues.
- Provide documentation and follow-up reports (technical
findings, feedback, resolution steps taken) for Root Cause
analysis, engineering technical assessment and process improvement
initiatives.
- Support customer requirements in a 24/7/365 environment and be
able to provide on-call support during outages occurring after
hours; may involve shift work.
- Update operations and monitoring documentation for 24/7/365
Operations Watch personnel. Basic Qualifications
- Requires a bachelor's degree and 10+ years of relevant
experience, additional years of experience may be considered in
lieu of a degree
- Experience supervising others
- 2 years of Unix administration experience, including Red
Hat/CentOS (or derivative) and Ubuntu administration
- System security engineering expertise in one or more of the
following: system security design process; engineering life cycle;
information domain; cross domain solutions; commercial
off-the-shelf and government off-the-shelf cryptography;
identification; authentication; and authorization; system
integration; risk management; intrusion detection; contingency
planning; incident handling; configuration control; change
management; auditing; certification and accreditation process;
principles of IA (confidentiality, integrity, non-repudiation,
availability, and access control); and security testing
- Possesses and applies expertise on multiple complex work
assignments. Assignments may be broad in nature requiring
originality and innovation in determining how to accomplish
tasks.
- Hands on experience identifying server hardware failures,
including hard drives and memory
- Experience with cluster configuration management tools such as
Ansible, Salt
- Strong knowledge of DNS, NFS, LDAP, and DHCP services
- Experience with shell scripting and/or Python to automate
repetitive administration tasks
- Background in Linux server setup, deployment and
maintenance
- Experience with hardening Linux environments
- Experience with system administration of server operating
systems such as Linux (CentOS, RHEL, or Ubuntu)
- Experience troubleshooting issues in a growing environment
- Experience with log reviews, incident analysis, and
identification of issue trends
- Experience with server patch management methodologies
- Time management skills with the ability to work within an IT
Service Management/ticketing system independently
- Ability to triage and properly classify incidents and
prioritize work efforts accordingly
- Strong oral and written communications skills
- Experience establishing goals and plans that meet project
objectives
- Track record of working effectively within a team, and support
to peers toward improved processes and results
- Candidate must, at a minimum, meet DoD 8570.11- IAT Level II
certification requirements (currently Security+ CE, CCNA-Security,
GSEC, or SSCP along with an appropriate computing environment (CE)
certification)Clearance
- TS/SCI clearance with Polygraph required
- US Citizenship is required due to the nature of the government
contracts we support.Preferred Qualifications
- Experience with container technologies (Docker,
Kubernetes)
- Experience with Prometheus/Grafana for monitoring
- Knowledge of distributed resource scheduling systems [Slurm
(preferred), LSF, etc.]
- Familiarity with CUDA and managing GPU-accelerated computing
systems
- Basic knowledge of deep learning frameworks and
algorithms#NMECDTPPay Range:Pay Range $84,500.00 - $130,000.00 -
$175,500.00The Leidos pay range for this job level is a general
guideline onlyand not a guarantee of compensation or salary.
Additional factors considered in extending an offer include (but
are not limited to) responsibilities of the job, education,
experience, knowledge, skills, and abilities, as well as internal
equity, alignment with market data, applicable bargaining agreement
(if any), or other law.
Keywords: Leidos Inc, Bethesda , Linux NVidia GPU Server Administrator, IT / Software / Systems , Bethesda, Maryland
Didn't find what you're looking for? Search again!
Loading more jobs...