// SIM + EVAL

The moat is the simulator and the harness.

Data alone is table stakes. Whoever owns the vertical-specific simulator that matches a plant's physics, layouts, and SOPs, plus the evaluation harness that measures policy performance against real operator benchmarks, owns the data flywheel for that vertical. We are building both, with design partners, in one vertical first.

// WHY SIM-ONLY DOES NOT TRANSFER

A general-purpose simulator is not an industrial simulator.

Three reasons sim-only training does not graduate a VLA policy into a real plant.

// PHYSICS

Physics that match the plant, not the demo.

Real plants have process fluids, vibration, heat, surface corrosion, and operator interaction that generic sim engines do not model out of the box. Isaac and MuJoCo are excellent for kinematic primitives. They are insufficient for the dynamics that determine whether a VLA policy actually transfers to a refinery floor.

// LAYOUT

Layout that matches the SOPs operators actually follow.

A pipe rack is not a corridor. A control room is not a kitchen. The simulator must encode the spatial reality of the workflow, not a generic factory. We build scene geometry from real capture in real facilities: depth maps, point clouds, calibrated camera rigs. Every scene maps to a real layout the policy will eventually deploy in.

// SOP

SOPs encoded as evaluable behavior.

Policy performance is not "did the robot move." It is "did the robot do the operator's job under the operator's SOP, within the time, safety, and quality envelope the human team holds itself to." The evaluation harness encodes the SOP as a graph of evaluable substeps and scores the policy against the same benchmark the human team uses.

// WHAT WE ARE BUILDING

Inside the sim and eval stack.

  • Plant-accurate scene meshes derived from real multimodal capture (RGB, depth, LiDAR where present)
  • Physics calibration from real operator trajectory data, not synthetic priors
  • SOP graphs encoded as machine-readable task specifications
  • Operator benchmark runs as the ground-truth comparison set for policy evaluation
  • Closed-loop A/B harness comparing candidate policies on the same task across the same plant
  • Continuous integration with the data pipeline that captured the original episodes
example evaluation report
1{
2 "evaluation_id": "trk-eval-construction-0001-example",
3 "policy_id": "lab-x-vla-v0.3",
4 "vertical": "construction_inspection",
5 "task": "site_progress_walk",
6 "plant_id": "trk-site-001",
7 "operator_baseline_runs": 24,
8 "policy_runs": 60,
9 "metrics": {
10 "task_completion_rate": 0.83,
11 "sop_compliance_rate": 0.71,
12 "time_to_complete_seconds": 412,
13 "operator_baseline_time_seconds": 348,
14 "intervention_rate": 0.18,
15 "safety_event_count": 0
16 },
17 "verdict": "above_baseline_on_completion_below_on_compliance"
18}

// example report. real reports include per-substep scores and side-by-side video clips.

// ITERATION CYCLE

The bottleneck is not data. It is the iteration cycle.

Real-world evaluation in industrial robotics is brutal. A single evaluation pass across one vertical's workflows takes hundreds of operator hours, one robot station, one plant. Statistically meaningful comparisons across model checkpoints require many such passes. Foundation labs hit a wall not because data is missing, but because they cannot score candidate recipes fast enough to iterate. Simulation lifts that cap. It turns the development cycle from a wall-clock problem into a compute problem.

// A

Evaluation, not data generation, is the entry point.

We start with simulation as the evaluation engine. Closed-loop simulation evaluation correlates with on-hardware rollouts. Once that correlation holds, every candidate policy, every checkpoint, every architecture sweep can be scored against the same vertical-specific benchmark without operator time. The first job of our simulator is not to generate synthetic data. It is to be the harness that grades real-world policies against operator runs from the same plant.

Precedent: Genesis AI on iteration-cycle compute. Waymo built scalable simulation evaluation first, then learned from it.

// B

We close the sim-to-real gap from real plant capture, not from priors.

Trustworthy simulation requires every layer of the stack to match reality: hardware and system identification, control, compiler, physics, assets, rendering. We do not build the simulator from synthetic priors. We build it from real captured episodes (the same captures our data layer ships) and from real operator benchmarks. The harness scores a policy against the operator who did the same task on the same plant floor.

Approach: zero-shot real-to-sim. Visual fidelity, robot kinematics, and low-level control timing all calibrated from captured episodes.

// C

One vertical's physics beats a general-purpose engine.

General-purpose simulators (Isaac, MuJoCo, Genesis World) are excellent foundations. None of them ship vertical-specific plant fluid dynamics, real SOP sequencing, or operator benchmarks for a specific industrial workflow. We are not competing with the platform layer. We are building the vertical-specific layer on top: the plant-accurate scene meshes, the SOP graphs, the operator benchmark runs. The platform stays open. The vertical specifically becomes the moat.

// ONE VERTICAL FIRST

One vertical first. The pull picks the winner.

We are evaluating three verticals with design partners: discrete manufacturing, construction and inspection, and oil and gas. The selection criteria are concrete: lab pull (which foundation labs are training toward this vertical now), operator pull (which industrial operators are giving us facility access and commercial commitments), regulatory clarity, capture feasibility, and vertical TAM. The vertical that pulls hardest gets the simulator and the evaluation harness first. Then it gets the deployment partner relationship.

If you are a lab building toward a specific vertical, the fastest way to influence which one we pick is to talk to us before we pick.

Help us pick.

Tell us what your model is training on and where you want to deploy. We will tell you if the sim and eval stack for your vertical is coming with you or being built without you.