Planning with the Views

Can VLMs predict how each camera move changes the view, and plan many such moves ahead?

Kangrui Wang¹, Linjie Li², Zhengyuan Yang³, Shiqi Chen⁴, Zihan Wang¹

Li Fei-Fei⁵, Jiajun Wu⁵, Leonidas Guibas⁵, Lijuan Wang³, Manling Li¹

¹Northwestern University, ²University of Washington, ³Microsoft, ⁴University of Oxford, ⁵Stanford University

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this ability view planning: using camera moves as planning primitives to find a target view in 3D. We study it in ViewSuite, a 6-DoF environment on real ScanNet scenes, and decompose it into two abilities: tracking how given camera actions change the view, and composing a path that localizes an unseen target view.

Across 13 frontier VLMs, a sharp planning gap emerges: models track local view transitions but collapse when they must plan toward an unseen target view. This inability cannot simply be fixed by reinforcement learning (RL): with success near 2.5%, reward is too sparse for RL to bootstrap.

Our key insight is to distill valid view transitions from on-policy self-exploration, aggregating them into a view graph and distilling it into supervised demonstrations. With no stronger teacher, this lifts Qwen2.5-VL-7B from 2.5% → 47.8% on interactive view planning, surpassing GPT-5.4 Pro (19.9%) and Gemini 3.1 Pro (21.3%). View planning is a clean probe for prospective spatial reasoning: looking ahead, predicting how future viewpoint changes reshape observation, and inferring a target view's camera pose before it is fully observed, a capability frontier VLMs still lack.

Three Diagnostic Tasks

ViewSuite probes the two coupled abilities of view planning: tracking how camera actions change the view, and composing them into a multi-turn plan that localizes an unseen target view.

Overview of ViewSuite: a point-cloud environment with rendered first-person views, and the three tasks P2V, V2P, and IVP. — **Overview of ViewSuite.** *Left*: a point-cloud environment built on ScanNet, with a top-down view marking the initial (blue) and target (orange) viewpoints and their rendered first-person views. *Middle*: Path-to-View (P2V) and View-to-Path (V2P), both single-turn. *Right*: Interactive View Planning (IVP), a multi-turn task where the agent issues 6-DoF camera actions to localize the target view and submits its camera pose.

P2V

Path‑to‑View

Show full prompt

Given the initial view [Image 1] and a top-down reference [Image 2], after you execute the following action sequence (translation step = 0.5 m; rotation step = 30.0 degrees per step):
[turn_left, look_up, move_forward, move_right]
which of the following images corresponds to the result?
A. [Image 3]
B. [Image 4]
C. [Image 5]
D. [Image 6]

(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)

Show model's full response

<think>Turning left makes the current center features shift right, and looking up should show more upper wall/ceiling. After the small move, the shelf/door area becomes the main view while the papasan chair moves out of frame. That matches image C.</think><action>answer(C)</action>

V2P

View‑to‑Path

Show full prompt

Given the initial view [Image 1] and a top-down reference [Image 2], which action sequence will reach the target view [Image 3]?
(Action semantics: translation step = 0.5 m; rotation step = 30.0 degrees per step.)
A. [turn_left, turn_left, turn_left, look_down, move_left, move_left]
B. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_right, move_up]
C. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_backward, move_backward, move_right, move_down]
D. [turn_right, turn_right, turn_right, turn_right, look_up, move_backward, move_backward, move_left, move_left]

(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)

Show model's full response

<think>From the initial view, the camera is near the left-side chair/stool area, facing toward the left wall. The target view is a downward-looking view over the rug near the lower-middle/right side of the room. The only sequence that sensibly rotates toward that area and matches the downward pitch is: turn left 90°, look down, then translate sideways across the rug.</think><action>answer(A)</action>

IVP

Interactive View Planning

Step 0 / 0

User prompt at this turn

—

Model response at this turn

—

Show system prompt

—

286

ScanNet scenes

~55K

view pairs

~165K

task instances

6-DoF actions

0.5m / 30°

IVP success threshold

Self-Exploration with View Graph Distillation

Each iteration alternates two stages. In the self-exploration stage, the agent interacts with ViewSuite environments and its trajectories are incrementally compressed into a view graph. In the view graph distillation stage, paths are sampled from this graph and reformulated into diverse view-planning demonstrations used to fine-tune the policy. The resulting model initializes the next self-exploration stage.

Iterative training pipeline (self-exploration + view graph distillation)

1. RL Stage

The agent runs IVP rollouts on ViewSuite environments with PPO. Reward is sparse: +1 when the submitted target estimate is within 0.5 m / 30° of the ground truth, plus a small format reward. Even with success rate near 2.5%, every rollout is useful, since it streams into the graph builder.

2. Graph Construction

A background process incrementally compresses every completed trajectory into a view graph. Nodes are viewpoints (with their rendered views); edges are actions between viewpoints. Nodes and edges are deduplicated via viewpoint similarity, so success and failure alike contribute to one shared structured graph.

3. Task Reformulation

Any path P = (v₀, a₁, v₁, …, a_K, v_K) in the graph yields a valid IVP demonstration regardless of whether the original episode succeeded: end node → target, start node → initial view, action chain → labeled plan. This is the lever that lets us learn from failed episodes.

4. SFT Stage

Sampled paths are reformulated into supervised view-planning demonstrations and used to fine-tune the policy with standard cross-entropy. The resulting model initializes the next RL stage, kicking off the iteration. Stages alternate RL → SFT → RL → SFT.

Results

Frontier VLM benchmark on ViewSuite-5K test (530 view pairs)

Accuracy / Success Rate (%) on Short (d < 3) and Long (d ≥ 3) splits. Best in each column is bold.

Model	Path-to-View			View-to-Path			Interactive View Planning			Overall
Model	Short	Long	All	Short	Long	All	Short	Long	All	Overall
Random Response	20.7	24.6	23.3	24.3	26.5	25.7	2.2	0.0	0.8	16.6
Proprietary Models
GPT-5.4 Pro	70.8	43.8	53.2	72.4	38.8	50.6	34.8	11.7	19.9	41.2
Gemini 3.1 Pro	63.8	40.9	48.9	53.0	47.5	49.4	28.6	17.4	21.3	39.9
GPT-5.4	57.3	42.9	47.9	60.5	37.4	45.5	33.5	7.5	16.6	36.7
Grok 4.20 Beta	61.6	38.0	46.2	44.9	44.3	44.5	17.3	2.9	7.9	32.9
GPT-5.1	60.5	35.1	44.0	52.4	33.3	40.0	11.9	3.2	6.2	30.1
Claude Opus 4.6	46.5	28.4	34.7	47.6	38.3	41.5	23.8	3.8	10.8	29.0
Gemini 3 Pro	50.3	31.0	37.7	44.9	35.4	38.7	13.5	7.0	9.2	28.5
Open-Weight Models
Qwen3.5-397B	57.8	30.1	39.8	44.3	31.0	35.7	12.4	0.0	4.3	26.6
GLM-4.6V	36.4	23.2	27.8	31.4	29.7	30.2	9.2	1.2	4.0	20.7
Qwen2.5-VL-72B	28.1	29.3	28.9	35.7	30.1	32.1	2.2	0.6	1.1	20.7
Qwen3-VL-32B	27.0	27.5	27.4	41.1	28.7	33.0	4.3	0.0	1.5	20.6
Kimi K2.5	36.2	24.6	28.7	18.4	29.3	25.5	4.9	1.2	2.5	18.9
Qwen2.5-VL-7B	23.8	32.5	29.4	27.0	22.9	24.3	7.0	0.0	2.5	18.7

GPT-5.4 Pro refuses 23 of the 530 IVP instances under its content policy; its IVP rates are computed over the remaining 507 valid instances (101 / 507 = 19.9%). All other models are evaluated on the full 530.

Results on Interactive View Planning

Success rate (%) under the calibrated 0.5 m / 30° threshold. Our framework lifts a 7B model from 2.5% → 47.8%, beating every proprietary VLM evaluated.

Method	Short	Long	All
Prompting Baselines
Qwen2.5-VL-7B-Instruct	7.0	0.0	2.5
GPT-5.4 Pro	34.8	11.7	19.9
Gemini 3.1 Pro	28.6	17.4	21.3
Training Baselines
Direct PPO	7.0	1.2	3.2
Direct GRPO (filter)	10.8	2.2	5.2
Success-Only Bootstrapping	14.0	2.0	6.2
Ablations of Our Framework
Random-graph	25.4	6.4	13.0
1 iter + RL	24.3	5.4	12.0
2 iter + RL	49.7	16.2	27.9
Ours (Self-Exploration + View Graph Distillation, 3 iters)
Qwen2.5-VL-7B-Instruct	67.2	36.9	47.8
Qwen3-VL-8B-Instruct	56.8	19.4	32.5

Findings & Analyses

Every analysis from the paper, packaged into one section: where models fail, what bottlenecks IVP, what training actually learns, and how the priors transfer.

1 · Single-turn tracking vs. multi-turn planning

Single-turn tracking ≫ multi-turn planning. The best VLMs reach ~70% on short-horizon P2V/V2P but collapse to at most 21.3% on Interactive View Planning. Most models score below 10%; on long-horizon samples most fall below 3%. Open-weight models stay below 5% on IVP.

70.8%

GPT-5.4 Pro · P2V short

21.3%

Best IVP (Gemini 3.1 Pro)

<5%

Every open-weight model · IVP

<3%

Most models · long-horizon IVP

Takeaway. Local view-action knowledge does not compose into multi-turn plans.

1b · When models succeed, are they reasoning or just matching?

A competent spatial reasoner need not see the target to localize it: after a few informative moves it could infer where the target lies and submit its pose without ever visiting it. We test whether frontier successes work this way. For every successful IVP rollout, we check whether the agent observed any view within the success threshold (0.5 m / 30°) of the target before answering. Across all five models, at least 90% of successes (up to 99.1% for Gemini 3.1 Pro) follow such a visual encounter. Genuine inference of a correct pose, without ever visiting a threshold-close view, accounts for at most ~10% of successes.

IVP successes: visual encounter vs. genuine inference

Model	#Success	Visited target view	Inferred (no visit)
GPT-5.4 Pro	101	99 (98.0%)	2 (2.0%)
Gemini 3.1 Pro	113	112 (99.1%)	1 (0.9%)
GPT-5.4	88	83 (94.3%)	5 (5.7%)
Grok 4.20 Beta	42	38 (90.5%)	4 (9.5%)
Claude Opus 4.6	57	54 (94.7%)	3 (5.3%)

Takeaway. The planning gap is really a cognitive gap: models mostly succeed by view matching after they reach the target, not by localizing it in advance.

2 · Difficulty along rotation and translation axes

Decomposing view distance into rotation and translation axes flips the difficulty signal. P2V/V2P degrade primarily with rotation distance (cumulative rotations are hard to mentally simulate); IVP reverses this — success collapses with position distance, since 3D translation needs spatial-layout understanding and path planning.

Dual axis: P2V/V2P degrade with rotation, IVP collapses with translation

P2V/V2P (left two): accuracy falls along the rotation axis. IVP (right): success collapses along the position axis (~7× drop for GPT-5.4 Pro).

Takeaway. The two task families are bottlenecked by different spatial reasoning skills.

2b · Which per-sample factors correlate with success?

Spearman correlation between 12 per-sample factors (geometric distance, visual overlap, directional geometry) and each model's binary success makes the difficulty signal even sharper. P2V / V2P success correlates most with orientation_agreement (ρ up to +0.30, same-facing camera pairs are easier). IVP success collapses with pos_dist (ρ down to −0.42 for GPT-5.4 Pro), with rotation barely registering. Visual-overlap factors help single-turn tasks but contribute almost nothing to IVP.

Spearman correlation between sample factors and per-model success

Rows: 12 sample-level factors. Columns: per-model binary success on P2V, V2P, IVP. Blue cells are negative correlations (factor makes the sample harder), red cells positive. The strong negative column under pos_dist in the IVP panel is the position-bottleneck signature.

Takeaway. Position distance is to IVP what orientation agreement is to P2V / V2P — the single dominant predictor of per-sample difficulty.

3 · Does the turn budget bottleneck IVP?

Doubling the turn budget from 10 → 20 helps every model (Claude Opus 4.6 nearly doubles), but 20 → 30 is essentially free. Models exhaust their effective strategies before the turn limit runs out.

IVP All-split accuracy (%)

Model	B = 10	B = 20	B = 30	Δ 10→30
Gemini 3.1 Pro	21.3	23.0	23.2	+1.9
GPT-5.4	16.6	19.2	20.2	+3.6
Claude Opus 4.6	10.8	17.7	19.6	+8.8
Grok 4.20 Beta	7.9	11.7	11.7	+3.8

Takeaway. IVP is bottlenecked by planning ability, not by horizon length.

4 · Does higher-fidelity rendering change the picture?

We re-render the test set with 3D Gaussian Splatting (GS), then re-evaluate at budget = 10. IVP improves only marginally. P2V/V2P show mixed and sometimes large swings — Gemini 3.1 Pro gains +6.4 on P2V, while GPT-5.4 and Grok 4.20 Beta lose 14.4 and 13.0 points on V2P.

Same scene, same target — three independent runs on scene0518_00, all succeed.

Gemini 3.1 Pro on point-cloud rendering — **Gemini 3.1 Pro · Point Cloud** success · 10 turns default point-cloud render

Gemini 3.1 Pro on Gaussian-Splat rendering — **Gemini 3.1 Pro · Gaussian Splat** success · 4 turns higher-fidelity neural render

GPT-5.4 on point-cloud rendering — **GPT-5.4 · Point Cloud** success · 6 turns different model, same render

Each GIF cycles through target → initial → agent's per-turn views. The three runs vary on two axes (model and renderer); all reach the target within the unified-distance threshold, but turn counts differ. Higher fidelity occasionally accelerates planning (4 turns vs. 10), yet — as the table below shows — it does not unlock the broader IVP gap.

Gaussian-splat re-render, B = 10 (% accuracy)

Model	P2V	V2P	IVP	Δ Overall vs Point Cloud
Gemini 3.1 Pro	55.3	49.4	23.2	+2.7
GPT-5.4	43.8	31.1	18.5	−5.6
Claude Opus 4.6	35.3	41.3	12.3	+0.6
Grok 4.20 Beta	28.3	31.5	8.1	−10.3

Takeaway. The bottleneck is composing view changes, not the visual fidelity of each observation.

5 · Comparing training recipes for IVP

Direct PPO plateaus at 3.2%; GRPO with reward-variance filtering reaches 5.2%; iterating PPO with SFT on only successful trajectories (Success-Only Bootstrapping) gets to 6.2%. The breakthrough is recognizing that even failed trajectories encode valid view transitions: A → B is supervision regardless of the original goal. Compressing all exploration into a graph and reformulating sampled paths into view-planning demos takes Qwen2.5-VL-7B from 2.5% → 47.8%.

IVP success rate, Qwen2.5-VL-7B base (%)

Method	Short	Long	All
Base model (prompting)	7.0	0.0	2.5
Direct PPO	7.0	1.2	3.2
Direct GRPO (filter)	10.8	2.2	5.2
Success-Only Bootstrapping	14.0	2.0	6.2
Random-graph (ablation)	25.4	6.4	13.0
1 iter + RL	24.3	5.4	12.0
2 iter + RL	49.7	16.2	27.9
Ours · Qwen2.5-VL-7B (3 iters)	67.2	36.9	47.8
Ours · Qwen3-VL-8B (3 iters)	56.8	19.4	32.5

Takeaway. Useful supervision comes from the geometry recorded by failed exploration, not from filtering for successes.

5b · How does the ranking shift under No-Snap and No-Submit?

Two evaluation knobs could in principle inflate our IVP numbers: rotation snapping to the discrete 30° grid, and the submit requirement. We re-evaluate under No-Snap (raw rotation magnitudes executed as-is, no on-grid rounding) and No-Submit (success the moment the pose enters the threshold). The ordering between models is unchanged under all three protocols — our trained models continue to dominate the proprietary baselines by a wide margin.

IVP All-split success rate (%), per protocol

Method	Default	No-Snap	No-Submit
Gemini 3.1 Pro	21.3	15.7	31.5
GPT-5.4	16.6	13.0	31.3
Ours · Qwen2.5-VL-7B	47.8	19.6	60.2
Ours · Qwen3-VL-8B	32.5	18.5	48.3

No-Snap lowers every model: without rounding, per-step rotation residuals accumulate over 10 turns and the agent drifts off the on-grid pose distribution from which targets are drawn. No-Submit raises every model because no commit to a final answer is required. Across both relaxations, our framework's gains transfer cleanly.

Takeaway. The 47.8% headline number isn't an artefact of rotation snapping or of the submit step — relax either one and the ranking is preserved.

6 · How does the trained agent's coverage evolve over turns?

Tracked 3D point-cloud coverage reveals a clean two-phase strategy: scene coverage grows rapidly in early turns as the agent looks around, then plateaus while the target-intersection ratio accelerates in the middle turns as the agent moves toward the target (peaking near 55%). Base and frontier models show flat or erratic target coverage instead.

Two-phase exploration: scene coverage then target intersection

Left: scene coverage ratio. Right: target intersection ratio. Our trained model is the only one with sustained monotonic growth on the target axis.

Full model comparison (all 15 models)

Takeaway. The trained policy follows a goal-directed two-phase trajectory; baselines do not.

7 · How does training reshape per-layer image attention?

Image-attention fraction (the share of response-token attention pointed at image tokens) reveals two patterns. Layer-wise: our trained model attends more to images in early layers (L0–L4) and less in deep layers (L8+) than the base — it grounds visually early, then operates in text space. Turn-wise: image attention decreases monotonically across turns, consistent with progressive information accumulation, while the base model stays flat.

Per-layer image-attention breakdown — full 28 layers (click to collapse)

$Image attention fraction across all 28 layers, per turn$

Image-attention fraction per layer × turn for all 28 layers. Our trained model front-loads visual grounding then drops off; the base model is flatter across both axes. Click the figure to open at native resolution.

Takeaway. Training reshapes how the VLM uses its visual stream, not just what it outputs.

8 · How is turn-usage distributed across models?

The base Qwen and GPT-5.4 Pro terminate most rollouts in a single turn (no exploration). Our trained model and Gemini 3.1 Pro use the full 10-turn budget. Of the rollouts that do use all turns, our model maintains a much higher success rate on harder episodes.

Turn distribution and success by turn count

(a) Total rollouts by turn count. (b) Successful rollouts. (c) Success rate by turns used.

Takeaway. Frontier VLMs that look like they're solving IVP often submit instantly without planning.

9 · Do the learned priors transfer to other view-related tasks?

Under identical GRPO post-training, our trained model beats the base on both internal and external view-dependent tasks. On the external MindCube benchmark (no shared scenes / actions / rendering pipeline) we gain ~10 points.

Spatial-prior transfer under identical GRPO post-training (% accuracy)

Model	P2V init	P2V +GRPO	V2P init	V2P +GRPO	MindCube init	MindCube +GRPO
Base Qwen2.5-VL-7B	32.1	45.1	29.2	44.8	33.0	56.3
Ours (after IVP training)	25.7	57.3	31.6	52.8	33.1	66.2

Takeaway. Interactive view planning is not a narrow skill — its priors strengthen view-dependent reasoning both within and beyond ViewSuite.

Cite

If you use ViewSuite or its trained models, please cite our paper.

@misc{wang2026planning,
  title={Planning with the Views},
  author={Kangrui Wang and Linjie Li and Zhengyuan Yang and Shiqi Chen and Zihan Wang and Li Fei-Fei and Jiajun Wu and Leonidas Guibas and Lijuan Wang and Manling Li},
  year={2026},
  eprint={2605.29563},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.29563},
}