Member of Engineering – Pre-training, Data Engineering

Remote · USA Full-time New today

Job Description:

Build and maintain high-performance pipelines for trillions of tokens.
Deliver diverse and high quality datasets for pre-training foundation models.
Closely work with other teams such as Pretraining, Posttraining, Evals and Product to to ensure alignment on the quality of the models delivered.

Requirements:

Strong background in building production-grade, distributed data systems for machine learning, with experience in:
Orchestration: Slurm, Airflow, or Dagster
Observability & Reliability: CI/CD, Grafana, Prometheus, etc.
Infra: Git, Docker, k8s, cloud managed services
Batched inference (ex: vLLM)
Performance obsession, especially with large-scale GPU clusters and distributed pipelines
Expert-level python knowledge and ability to write clean and maintainable code
Strong algorithmic foundations
Proficiency with libraries like Polars, Dask, or PySpark
Nice to have:
Experience in building trillion-scale SOTA pretraining datasets
Experience translating research to production at scale
Experience with OCR, web crawling, or evals
Prior experience pre-training LLMs

Benefits:

Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you and dependents
Company-provided equipment
Wellbeing, always-be-learning and home office allowances
Frequent team get togethers
Great diverse & inclusive people-first culture

Apply tot his job Apply To this Job

Related roles

Part time Data Scientist: LLM Message Optimization

Remote · USA Full-time

Associate Data Scientist - New Grad

Remote · USA Full-time

Senior Data Engineer – Marketing

Remote · USA Full-time

Data Engineer, Data Platforms (Remote)

Remote · USA Full-time

Data Scientist (Entry Level)

Remote · USA Full-time

Data Scientist, Cancer Informatics and AI/ML, Remote, Grant Funded

Remote · USA Full-time

Senior Data Engineer – Remote, Azure & Analytics

Remote · USA Full-time

Remote Data Scientist/Analyst (Entry/Junior Level)

Remote · USA Full-time

Senior Data Scientist, Product Analytics (Remote, US)

Remote · USA Full-time

Data Analyst/Scientist - Junior/Entry (Remote)

Remote · USA Full-time

Part-Time Remote Data Entry Specialist – arenaflex

Remote · USA Full-time

Insurance License Tutor

Remote · USA Full-time

Volunteer Shopify Specialist & PHP Developer (Remote)

Remote · USA Full-time

Senior Content Designer, Email Strategy– Wearables (6061)

Remote · USA Full-time

Experienced Part-Time Remote Chat Support Agent – Entry-Level | $25-$35/hr

Remote · USA Full-time

Experienced Data Entry Specialist – Remote, Part-Time Opportunity at arenaflex

Remote · USA Full-time

Account Servicing Specialist – Pre 30 Day Collections

Remote · USA Full-time

Product Analyst, Elements

Remote · USA Full-time

Construction Superintendent (National Traveling) - Kansas City

Remote · USA Full-time

Customer Success Manager, Commercial

Remote · USA Full-time