[Remote] Senior Cloud Engineer
Note: The job is a remote job and is open to candidates in USA. Onyx Visual Effects is a company specializing in visual effects and cloud infrastructure. They are seeking a Senior Cloud Engineer to manage AWS services, optimize cloud resources for VFX workloads, and ensure compliance with security standards while collaborating with global teams.
Responsibilities
- Proficiency in AWS core services, including EC2 for compute, EFS/S3/EBS for storage, VPC networking, Security Groups, NACLs, Route 53, and Direct Connect for low-latency remote access
- Includes managing instance failures during long-running renders, handling multi-AZ outages with failover, optimizing for global teams, and integrating with on-premises legacy hardware
- Specialization in MPA compliance and security-first engineering, including AWS KMS encryption, access logging, Trusted Partner Network assessments, and zero-trust models
- Includes adapting to evolving MPA guidelines, managing sensitive IP with external studios, handling data sovereignty requirements, and responding to vulnerabilities in media workflows
- Experience with AWS VFX solutions like Thinkbox Deadline, Deadline Cloud, Nimble Studio, and EC2 Spot/GPU instances for cost-effective rendering
- Includes scaling farms for 8K+ projects, recovering from spot interruptions, troubleshooting custom VFX plugins, and optimizing hybrid CPU/GPU workloads
- Identity and Access Management with role-based controls, MFA, and integration with directory services
- Includes onboarding/offboarding remote users, federated logins from third-party IDPs, managing privilege escalation risks, and auditing access logs for anomalous behavior
- Cost optimization using AWS Cost Explorer, Savings Plans, Reserved Instances, and auto-scaling groups for variable VFX workloads
- Includes forecasting burst render costs, mitigating overspending from misconfigured scaling, and tracking costs across multiple projects
- Data transfer tools like AWS Snowball and DataSync for asset migrations, plus multi-tier storage strategies such as S3 Intelligent-Tiering
- Includes large-scale transfers, partial sync recovery, encryption integrity, and cold storage retrieval planning
- AWS certifications such as Solutions Architect or SysOps Administrator, with the ability to apply certification knowledge to custom VFX scenarios, real-time collaboration setups, renewals, and edge deployments such as AWS Outposts
- Expertise in Rocky Linux, Redhat-based OS, Windows, and macOS command-line and general administration, including cross-platform scripting with Bash and PowerShell
- Includes troubleshooting Linux kernel issues, macOS driver conflicts, Windows updates, and mixed-OS fleets
- Infrastructure as Code with Terraform, AWS CloudFormation, or Ansible for provisioning and automation
- Includes idempotent deployments, rolling back failed IaC changes during live productions, version control collaboration, and provider quirks
- Monitoring and logging with AWS CloudWatch, X-Ray, and integrations like ELK Stack for metrics, alarms, and proactive issue resolution
- Includes custom alarms for GPU utilization, tracing distributed render jobs, filtering high-volume logs, and SIEM integration
- Backup and disaster recovery using AWS Backup, S3 versioning, and multi-region replication
- Includes testing restores for corrupted VFX assets, managing RTO/RPO in outages, automating failover drills, and handling version conflicts
- Networking and security operations, including VPN, firewalls, AWS GuardDuty, and high-performance network-attached storage
- Includes mobile artist VPN access, detecting network attacks, optimizing NAS for 4K/8K streaming, and securing third-party integrations
- Virtual machine management and containerization with Docker, ECS, or Kubernetes for portable VFX applications
- Includes bursty simulations, pod evictions during resource contention, GPU passthrough, and network policy debugging
- Proficiency with core VFX software like Nuke, ZBrush, Maya, V-Ray, Houdini, Redshift, Arnold, RenderMan, and Octane
- Includes optimizing for non-standard hardware, troubleshooting batch-mode plugin crashes, integrating emerging AI tools, and handling license server failures
- Render farm management using AWS Deadline Cloud, PipelineFX Qube, or custom scripts for job distribution and optimization
- Includes prioritizing jobs during overlapping deadlines, recovering orphaned tasks, scaling to thousands of nodes, and integrating hybrid cloud/off-cloud farms
- Pipeline tools including asset management systems such as ShotGrid or ftrack, version control with Perforce or Git, and CI/CD for artist workflows
- Includes merging conflicting asset versions, handling large binary files, automating plugin testing, and securing pipelines against IP leaks
- Performance tuning for GPU/CPU workloads, memory management in simulations, and benchmarking to reduce render times
- Includes managing OOM errors in Houdini sims, comparing instance types, and optimizing cost/performance trade-offs
- Troubleshooting application issues, OS problems, and providing deskside, phone, and ticket support to VFX artists and production teams
- Includes remote debugging, VPN-disrupted sessions, vendor escalation, and documenting repeatable fixes
- Experience with HP Connect Anywhere, PCoIP desktop environments, NICE DCV, and AWS AppStream for low-latency streaming and multi-monitor support
- Includes high-DPI displays, transcontinental latency, session security, and VR/AR review workflows
- NVIDIA CUDA drivers, GRID/AMDGPU management in EC2 instances, and virtual workstations for color-accurate VFX work
- Includes driver updates, CUDA version mismatches, color calibration over compressed streams, and experimental AMD setups
- Secure file sharing via AWS Transfer Family and real-time collaboration tools such as Frame.io integrations
- Includes enforcing upload quotas, recovering interrupted transfers, auditing shares, and custom encryption for sensitive dailies
- WEKA Storage Solutions integration with AWS for high-I/O VFX tasks such as 4K/8K footage
- Includes scaling IOPS for parallel artist access, handling filesystem issues, optimizing mixed read/write patterns, and migrating from legacy storage
- Advanced storage strategies, including lifecycle policies for archiving and handling large media files
- Includes tier transitions, retention policies, legal holds, accidental deletion recovery, snapshots, and cost optimization for growing project data
- Scripting and programming in Python, Bash, or similar for automation, system tasks, and DevOps practices
- Includes resilient scripts for flaky APIs, exception handling in long-running automations, VFX-specific libraries, and secure handling of user input
- Configuration management, deployment tools, and CI/CD pipeline building
- Includes managing config drift, zero-downtime deployments, troubleshooting branched pipeline failures, and securing secrets in CI/CD environments
- Strong problem-solving, critical thinking, and root cause analysis for render failures and remote issues
- Includes diagnosing cascading failures, intermittent bugs, post-mortems with non-technical stakeholders, and adapting solutions to evolving tech stacks
- Excellent communication, teamwork, and ability to consult, train, and build relationships with remote artists, producers, and vendors
- Includes bridging time zones, supporting high-stress deadlines, training via screen share, and negotiating SLAs during outages
- Self-motivated, proactive, and committed to continuous learning, including AWS trends and VFX innovations like AI-assisted rendering
- Includes self-teaching during rapid tech shifts, identifying bottlenecks before escalation, and testing beta features in sandboxes
- Experience in vendor management and shift work flexibility for global remote operations
- Includes managing multi-vendor ecosystems, adapting to 24/7 on-call needs, negotiating custom integrations, and handling critical vendor escalations
Skills
- Proficiency in AWS core services, including EC2 for compute, EFS/S3/EBS for storage, VPC networking, Security Groups, NACLs, Route 53, and Direct Connect for low-latency remote access
- Specialization in MPA compliance and security-first engineering, including AWS KMS encryption, access logging, Trusted Partner Network assessments, and zero-trust models
- Experience with AWS VFX solutions like Thinkbox Deadline, Deadline Cloud, Nimble Studio, and EC2 Spot/GPU instances for cost-effective rendering
- Identity and Access Management with role-based controls, MFA, and integration with directory services
- Cost optimization using AWS Cost Explorer, Savings Plans, Reserved Instances, and auto-scaling groups for variable VFX workloads
- Data transfer tools like AWS Snowball and DataSync for asset migrations, plus multi-tier storage strategies such as S3 Intelligent-Tiering
- AWS certifications such as Solutions Architect or SysOps Administrator
- Expertise in Rocky Linux, Redhat-based OS, Windows, and macOS command-line and general administration
- Infrastructure as Code with Terraform, AWS CloudFormation, or Ansible for provisioning and automation
- Monitoring and logging with AWS CloudWatch, X-Ray, and integrations like ELK Stack for metrics, alarms, and proactive issue resolution
- Backup and disaster recovery using AWS Backup, S3 versioning, and multi-region replication
- Networking and security operations, including VPN, firewalls, AWS GuardDuty, and high-performance network-attached storage
- Virtual machine management and containerization with Docker, ECS, or Kubernetes for portable VFX applications
- Proficiency with core VFX software like Nuke, ZBrush, Maya, V-Ray, Houdini, Redshift, Arnold, RenderMan, and Octane
- Render farm management using AWS Deadline Cloud, PipelineFX Qube, or custom scripts for job distribution and optimization
- Pipeline tools including asset management systems such as ShotGrid or ftrack, version control with Perforce or Git, and CI/CD for artist workflows
- Performance tuning for GPU/CPU workloads, memory management in simulations, and benchmarking to reduce render times
- Troubleshooting application issues, OS problems, and providing deskside, phone, and ticket support to VFX artists and production teams
- Experience with HP Connect Anywhere, PCoIP desktop environments, NICE DCV, and AWS AppStream for low-latency streaming and multi-monitor support
- NVIDIA CUDA drivers, GRID/AMDGPU management in EC2 instances, and virtual workstations for color-accurate VFX work
- Secure file sharing via AWS Transfer Family and real-time collaboration tools such as Frame.io integrations
- WEKA Storage Solutions integration with AWS for high-I/O VFX tasks such as 4K/8K footage
- Advanced storage strategies, including lifecycle policies for archiving and handling large media files
- Scripting and programming in Python, Bash, or similar for automation, system tasks, and DevOps practices
- Configuration management, deployment tools, and CI/CD pipeline building
- Strong problem-solving, critical thinking, and root cause analysis for render failures and remote issues
- Excellent communication, teamwork, and ability to consult, train, and build relationships with remote artists, producers, and vendors
- Self-motivated, proactive, and committed to continuous learning, including AWS trends and VFX innovations like AI-assisted rendering
- Experience in vendor management and shift work flexibility for global remote operations
Company Overview