All roles

Network/Infrastructure Engineer- remote

Remote · USA Full-time New today

Company Overview: We are a pioneering Infrastructure-as-a-Service (IaaS) company, focusing on delivering High-Performance Computing (HPC) solutions. Our cutting-edge data centers form the core of our operations, empowering us to offer unmatched computational resources to our global clientele. In line with our growth and the expansion of our services, we are on the lookout for a skilled and innovative Network/Infrastructure Engineer to strengthen our team. Position Summary: The Network/Infrastructure Engineer is pivotal in designing, implementing, and optimizing the network and compute infrastructure that powers our high-performance computing environments. This role encompasses network architecture design, operational management of complex BGP environments, HPC cluster optimization, and performance benchmarking. The successful applicant will collaborate closely with NVIDIA, deployment teams, and cross-functional engineering groups to ensure our infrastructure delivers exceptional performance and reliability. Travel to Data Centers located within the US may sometimes be required to support network deployments, troubleshooting, or performance optimization initiatives Key Responsibilities:Network Design & Architecture: Design physical and logical network topologies for high-performance computing environments supporting large-scale workloads Maintain IP address management (IPAM) schemes ensuring efficient allocation and documentation Create comprehensive network diagrams and technical documentation for current and future infrastructure Collaborate with NVIDIA on Reference Architecture standards to ensure adherence to best practices and optimal configurations Evaluate and recommend network technologies and solutions to meet evolving business requirements Network Operations: Configure and maintain BGP peering sessions with ISPs, partners, and internal autonomous systems Monitor network health using observability tools, identifying and resolving performance bottlenecks Respond to network incidents and perform advanced troubleshooting to minimize downtime Coordinate IP block procurement and assignment, working with RIRs and transit providers Maintain network security posture and implement changes following established protocols Participate in on-call rotation for critical network incidents Network Projects: Develop detailed network BOMs (Bills of Materials) for new deployments in collaboration with deployment teams Test and validate network configurations in lab environments prior to production deployment Evaluate driver upgrades and perform compatibility testing across network hardware and software stacks Design and implement network enhancements to improve performance, reliability, and scalability Execute comprehensive network performance benchmarking using industry-standard tools and methodologies Document project outcomes and create knowledge base articles for operational teams HPC Cluster Management: Optimize cluster performance and utilization through tuning of network fabric, storage, and compute resources Test and validate deployment profiles for various HPC workloads and use cases Configure and maintain high-speed interconnects (InfiniBand, RoCE) for low-latency communication Work with infrastructure teams to ensure proper integration of compute, storage, and network components Performance & Optimization: Conduct rigorous benchmarking and performance analysis of HPC infrastructure using tools such as IOR, NCCL, and MLPerf Test driver and firmware upgrades in HPC context, validating compatibility and performance impact Troubleshoot complex compute node and interconnect issues affecting application performance Document HPC-specific configurations and tuning parameters for various workload types Identify and implement optimizations for network throughput, latency, and job completion times Collaboration and Documentation: Work closely with deployment engineers to ensure successful network implementation Collaborate with infrastructure operations teams on incident response and problem resolution Maintain comprehensive technical documentation including network diagrams, runbooks, and configuration standards Participate in architecture review sessions and contribute to infrastructure planning Mentor junior team members on networking concepts and HPC technologies Safety and Compliance: Adhere to strict data center safety protocols and operational standards during all on-site activities Follow security best practices for network configuration and access control Participate in regular safety training and briefings Qualifications: Bachelor's degree in Computer Science, Computer Engineering, Information Technology, or a related field preferred 3-5 years of experience in network engineering, with emphasis on large-scale data center or HPC environments Expert-level knowledge of networking protocols including BGP, OSPF, VLANs, and routing fundamentals Strong hands-on experience with enterprise network equipment from vendors such as Cisco, Arista, NVIDIA (Mellanox), or Juniper Proficiency with high-speed interconnect technologies including InfiniBand, Ethernet RDMA (RoCE), and related protocols Experience with network monitoring and observability tools (Prometheus, Grafana, Nagios, or similar) Deep understanding of IP addressing, subnetting, and IPAM management Demonstrated experience with HPC cluster architectures and job scheduling systems (Slurm, PBS, or similar) Strong Linux system administration skills including shell scripting and automation Experience with network performance testing tools and benchmarking methodologies Familiarity with NVIDIA GPU computing architectures and networking solutions preferred Knowledge of software-defined networking (SDN) concepts and implementation Experience with configuration management tools (Ansible, Terraform, or similar) preferred Strong analytical and troubleshooting skills with systematic problem-solving approach Excellent documentation skills with attention to detail Effective communication skills, both written and verbal, with ability to explain complex technical concepts to diverse audiences Self-motivated with ability to work independently and manage multiple projects simultaneously Availability to participate in on-call rotation and travel occasionally to data center locations as required Preferred Certifications: CCNP, CCIE, or equivalent networking certifications NVIDIA networking certifications Relevant cloud or data center certifications Apply To This Job

Related roles

Associate Wireless & Networking Engineer

Remote · USA Full-time

Network Engineer- Datacenter

Remote · USA Full-time

Systems Administrator II job at Ntiva in Washington, DC

Remote · USA Full-time

IT Systems Admin

Remote · USA Full-time

SR Sysadmin (Remote Latam)

Remote · USA Full-time

Sr Systems Administrator

Remote · USA Full-time

Associate System Administrator - MS Exchange & Active Directory (REMOTE)

Remote · USA Full-time

Remote Job opening for Senior Linux Systems Engineer!!

Remote · USA Full-time

Managed Services Linux Engineer | Grand Rapids, MI or Remote

Remote · USA Full-time

Sr. Embedded Linux Engineer

Remote · USA Full-time

Experienced Customer Support Representative – American Airlines (Work From Home)

Remote · USA Full-time

Ops Suppt Agt

Remote · USA Full-time

Site/Civil (Land Development) Project Engineer

Remote · USA Full-time

Steuerfachkraft (m/w/d) in Potsdam mindestens 52.000€ - 100% Remote möglich

Remote · USA Full-time

Flexible Research Participant – Earn on Your Terms (hiring Immediately)

Remote · USA Full-time

Experienced Data Entry Remote/Virtual Assistant – Support Operations at arenaflex

Remote · USA Full-time

Graphics Producer (Remote)

Remote · USA Full-time

Experienced Customer Service Representative – Remote Work Opportunity at arenaflex

Remote · USA Full-time

Work from Anywhere, Earn Big: Elite Insurance Sales Jobs for Top Talent

Remote · USA Full-time

Experienced Data Entry Clerk – Remote Energy Services Administration

Remote · USA Full-time