ViewSuite logo ViewSuite

Can VLMs predict how each camera move changes the view, and plan many such moves ahead?

Kangrui Wang1, Linjie Li2, Zhengyuan Yang3, Shiqi Chen4, Zihan Wang1

Li Fei-Fei5, Jiajun Wu5, Leonidas Guibas5, Lijuan Wang3, Manling Li1

1Northwestern University, 2University of Washington, 3Microsoft, 4University of Oxford, 5Stanford University

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this ability view planning: using camera moves as planning primitives to find a target view in 3D. We study it in ViewSuite, a 6-DoF environment on real ScanNet scenes, and decompose it into two abilities: tracking how given camera actions change the view, and composing a path that localizes an unseen target view.

Across 13 frontier VLMs, a sharp planning gap emerges: models track local view transitions but collapse when they must plan toward an unseen target view. This inability cannot simply be fixed by reinforcement learning (RL): with success near 2.5%, reward is too sparse for RL to bootstrap.

Our key insight is to distill valid view transitions from on-policy self-exploration, aggregating them into a view graph and distilling it into supervised demonstrations. With no stronger teacher, this lifts Qwen2.5-VL-7B from 2.5% → 47.8% on interactive view planning, surpassing GPT-5.4 Pro (19.9%) and Gemini 3.1 Pro (21.3%). View planning is a clean probe for prospective spatial reasoning: looking ahead, predicting how future viewpoint changes reshape observation, and inferring a target view's camera pose before it is fully observed, a capability frontier VLMs still lack.

Three Diagnostic Tasks

ViewSuite probes the two coupled abilities of view planning: tracking how camera actions change the view, and composing them into a multi-turn plan that localizes an unseen target view.

Overview of ViewSuite: a point-cloud environment with rendered first-person views, and the three tasks P2V, V2P, and IVP.
Overview of ViewSuite. Left: a point-cloud environment built on ScanNet, with a top-down view marking the initial (blue) and target (orange) viewpoints and their rendered first-person views. Middle: Path-to-View (P2V) and View-to-Path (V2P), both single-turn. Right: Interactive View Planning (IVP), a multi-turn task where the agent issues 6-DoF camera actions to localize the target view and submits its camera pose.
P2V

Path‑to‑View

Initial
Initial
Top-down
Top-down
A
A
B
B
C
C
D
D
Show full prompt
Given the initial view [Image 1] and a top-down reference [Image 2], after you execute the following action sequence (translation step = 0.5 m; rotation step = 30.0 degrees per step):
[turn_left, look_up, move_forward, move_right]
which of the following images corresponds to the result?
A. [Image 3]
B. [Image 4]
C. [Image 5]
D. [Image 6]

(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)
Show model's full response
<think>Turning left makes the current center features shift right, and looking up should show more upper wall/ceiling. After the small move, the shelf/door area becomes the main view while the papasan chair moves out of frame. That matches image C.</think><action>answer(C)</action>
V2P

View‑to‑Path

Initial
Initial
Top-down
Top-down
Target
Target
Show full prompt
Given the initial view [Image 1] and a top-down reference [Image 2], which action sequence will reach the target view [Image 3]?
(Action semantics: translation step = 0.5 m; rotation step = 30.0 degrees per step.)
A. [turn_left, turn_left, turn_left, look_down, move_left, move_left]
B. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_right, move_up]
C. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_backward, move_backward, move_right, move_down]
D. [turn_right, turn_right, turn_right, turn_right, look_up, move_backward, move_backward, move_left, move_left]

(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)
Show model's full response
<think>From the initial view, the camera is near the left-side chair/stool area, facing toward the left wall. The target view is a downward-looking view over the rug near the lower-middle/right side of the room. The only sequence that sensibly rotates toward that area and matches the downward pitch is: turn left 90°, look down, then translate sideways across the rug.</think><action>answer(A)</action>
IVP

Interactive View Planning

initial view
Initial
top-down
Top-down
target view
Target
current view
Step 0 / 0
User prompt at this turn
Model response at this turn
Show system prompt
286
ScanNet scenes
~55K
view pairs
~165K
task instances
12
6-DoF actions
0.5m / 30°
IVP success threshold

Self-Exploration with View Graph Distillation

Each iteration alternates two stages. In the self-exploration stage, the agent interacts with ViewSuite environments and its trajectories are incrementally compressed into a view graph. In the view graph distillation stage, paths are sampled from this graph and reformulated into diverse view-planning demonstrations used to fine-tune the policy. The resulting model initializes the next self-exploration stage.

Iterative training pipeline (self-exploration + view graph distillation)

1. RL Stage

The agent runs IVP rollouts on ViewSuite environments with PPO. Reward is sparse: +1 when the submitted target estimate is within 0.5 m / 30° of the ground truth, plus a small format reward. Even with success rate near 2.5%, every rollout is useful, since it streams into the graph builder.

2. Graph Construction

A background process incrementally compresses every completed trajectory into a view graph. Nodes are viewpoints (with their rendered views); edges are actions between viewpoints. Nodes and edges are deduplicated via viewpoint similarity, so success and failure alike contribute to one shared structured graph.

3. Task Reformulation

Any path P = (v₀, a₁, v₁, …, aK, vK) in the graph yields a valid IVP demonstration regardless of whether the original episode succeeded: end node → target, start node → initial view, action chain → labeled plan. This is the lever that lets us learn from failed episodes.

4. SFT Stage

Sampled paths are reformulated into supervised view-planning demonstrations and used to fine-tune the policy with standard cross-entropy. The resulting model initializes the next RL stage, kicking off the iteration. Stages alternate RL → SFT → RL → SFT.

Results

Frontier VLM benchmark on ViewSuite-5K test (530 view pairs)

Accuracy / Success Rate (%) on Short (d < 3) and Long (d ≥ 3) splits. Best in each column is bold.

Model Path-to-View View-to-Path Interactive View Planning Overall
ShortLongAll ShortLongAll ShortLongAll
Random Response 20.724.623.3 24.326.525.7 2.20.00.8 16.6
Proprietary Models
GPT-5.4 Pro 70.843.853.2 72.438.850.6 34.811.719.9 41.2
Gemini 3.1 Pro 63.840.948.9 53.047.549.4 28.617.421.3 39.9
GPT-5.4 57.342.947.9 60.537.445.5 33.57.516.6 36.7
Grok 4.20 Beta 61.638.046.2 44.944.344.5 17.32.97.9 32.9
GPT-5.1 60.535.144.0 52.433.340.0 11.93.26.2 30.1
Claude Opus 4.6 46.528.434.7 47.638.341.5 23.83.810.8 29.0
Gemini 3 Pro 50.331.037.7 44.935.438.7 13.57.09.2 28.5
Open-Weight Models
Qwen3.5-397B 57.830.139.8 44.331.035.7 12.40.04.3 26.6
GLM-4.6V 36.423.227.8 31.429.730.2 9.21.24.0 20.7
Qwen2.5-VL-72B 28.129.328.9 35.730.132.1 2.20.61.1 20.7
Qwen3-VL-32B 27.027.527.4 41.128.733.0 4.30.01.5 20.6
Kimi K2.5 36.224.628.7 18.429.325.5 4.91.22.5 18.9
Qwen2.5-VL-7B 23.832.529.4 27.022.924.3 7.00.02.5 18.7

GPT-5.4 Pro refuses 23 of the 530 IVP instances under its content policy; its IVP rates are computed over the remaining 507 valid instances (101 / 507 = 19.9%). All other models are evaluated on the full 530.

Results on Interactive View Planning

Success rate (%) under the calibrated 0.5 m / 30° threshold. Our framework lifts a 7B model from 2.5% → 47.8%, beating every proprietary VLM evaluated.

MethodShortLongAll
Prompting Baselines
Qwen2.5-VL-7B-Instruct7.00.02.5
GPT-5.4 Pro34.811.719.9
Gemini 3.1 Pro28.617.421.3
Training Baselines
Direct PPO7.01.23.2
Direct GRPO (filter)10.82.25.2
Success-Only Bootstrapping14.02.06.2
Ablations of Our Framework
Random-graph25.46.413.0
1 iter + RL24.35.412.0
2 iter + RL49.716.227.9
Ours (Self-Exploration + View Graph Distillation, 3 iters)
Qwen2.5-VL-7B-Instruct67.236.947.8
Qwen3-VL-8B-Instruct56.819.432.5

Findings & Analyses

Every analysis from the paper, packaged into one section: where models fail, what bottlenecks IVP, what training actually learns, and how the priors transfer.

1 · Single-turn tracking vs. multi-turn planning

Single-turn tracking ≫ multi-turn planning. The best VLMs reach ~70% on short-horizon P2V/V2P but collapse to at most 21.3% on Interactive View Planning. Most models score below 10%; on long-horizon samples most fall below 3%. Open-weight models stay below 5% on IVP.

70.8%
GPT-5.4 Pro · P2V short
21.3%
Best IVP (Gemini 3.1 Pro)
<5%
Every open-weight model · IVP
<3%
Most models · long-horizon IVP

Takeaway. Local view-action knowledge does not compose into multi-turn plans.

1b · When models succeed, are they reasoning or just matching?

A competent spatial reasoner need not see the target to localize it: after a few informative moves it could infer where the target lies and submit its pose without ever visiting it. We test whether frontier successes work this way. For every successful IVP rollout, we check whether the agent observed any view within the success threshold (0.5 m / 30°) of the target before answering. Across all five models, at least 90% of successes (up to 99.1% for Gemini 3.1 Pro) follow such a visual encounter. Genuine inference of a correct pose, without ever visiting a threshold-close view, accounts for at most ~10% of successes.

IVP successes: visual encounter vs. genuine inference
Model #Success Visited target view Inferred (no visit)
GPT-5.4 Pro 10199 (98.0%) 2 (2.0%)
Gemini 3.1 Pro 113112 (99.1%)1 (0.9%)
GPT-5.4 88 83 (94.3%) 5 (5.7%)
Grok 4.20 Beta 42 38 (90.5%) 4 (9.5%)
Claude Opus 4.6 57 54 (94.7%) 3 (5.3%)

Takeaway. The planning gap is really a cognitive gap: models mostly succeed by view matching after they reach the target, not by localizing it in advance.

2 · Difficulty along rotation and translation axes

Decomposing view distance into rotation and translation axes flips the difficulty signal. P2V/V2P degrade primarily with rotation distance (cumulative rotations are hard to mentally simulate); IVP reverses this — success collapses with position distance, since 3D translation needs spatial-layout understanding and path planning.

Dual axis: P2V/V2P degrade with rotation, IVP collapses with translation

P2V/V2P (left two): accuracy falls along the rotation axis. IVP (right): success collapses along the position axis (~7× drop for GPT-5.4 Pro).

Takeaway. The two task families are bottlenecked by different spatial reasoning skills.

2b · Which per-sample factors correlate with success?

Spearman correlation between 12 per-sample factors (geometric distance, visual overlap, directional geometry) and each model's binary success makes the difficulty signal even sharper. P2V / V2P success correlates most with orientation_agreement (ρ up to +0.30, same-facing camera pairs are easier). IVP success collapses with pos_dist (ρ down to −0.42 for GPT-5.4 Pro), with rotation barely registering. Visual-overlap factors help single-turn tasks but contribute almost nothing to IVP.

Spearman correlation between sample factors and per-model success

Rows: 12 sample-level factors. Columns: per-model binary success on P2V, V2P, IVP. Blue cells are negative correlations (factor makes the sample harder), red cells positive. The strong negative column under pos_dist in the IVP panel is the position-bottleneck signature.

Takeaway. Position distance is to IVP what orientation agreement is to P2V / V2P — the single dominant predictor of per-sample difficulty.

3 · Does the turn budget bottleneck IVP?

Doubling the turn budget from 10 → 20 helps every model (Claude Opus 4.6 nearly doubles), but 20 → 30 is essentially free. Models exhaust their effective strategies before the turn limit runs out.

IVP All-split accuracy (%)
Model B = 10 B = 20 B = 30 Δ 10→30
Gemini 3.1 Pro 21.323.023.2+1.9
GPT-5.4 16.619.220.2+3.6
Claude Opus 4.6 10.817.719.6+8.8
Grok 4.20 Beta 7.9 11.711.7+3.8

Takeaway. IVP is bottlenecked by planning ability, not by horizon length.

4 · Does higher-fidelity rendering change the picture?

We re-render the test set with 3D Gaussian Splatting (GS), then re-evaluate at budget = 10. IVP improves only marginally. P2V/V2P show mixed and sometimes large swings — Gemini 3.1 Pro gains +6.4 on P2V, while GPT-5.4 and Grok 4.20 Beta lose 14.4 and 13.0 points on V2P.

Same scene, same target — three independent runs on scene0518_00, all succeed.
Gemini 3.1 Pro on point-cloud rendering
Gemini 3.1 Pro · Point Cloud success · 10 turns default point-cloud render
Gemini 3.1 Pro on Gaussian-Splat rendering
Gemini 3.1 Pro · Gaussian Splat success · 4 turns higher-fidelity neural render
GPT-5.4 on point-cloud rendering
GPT-5.4 · Point Cloud success · 6 turns different model, same render

Each GIF cycles through target → initial → agent's per-turn views. The three runs vary on two axes (model and renderer); all reach the target within the unified-distance threshold, but turn counts differ. Higher fidelity occasionally accelerates planning (4 turns vs. 10), yet — as the table below shows — it does not unlock the broader IVP gap.

Gaussian-splat re-render, B = 10 (% accuracy)
Model P2V V2P IVP Δ Overall vs Point Cloud
Gemini 3.1 Pro55.349.423.2+2.7
GPT-5.4 43.831.118.5−5.6
Claude Opus 4.635.341.312.3+0.6
Grok 4.20 Beta28.331.58.1 −10.3

Takeaway. The bottleneck is composing view changes, not the visual fidelity of each observation.

5 · Comparing training recipes for IVP

Direct PPO plateaus at 3.2%; GRPO with reward-variance filtering reaches 5.2%; iterating PPO with SFT on only successful trajectories (Success-Only Bootstrapping) gets to 6.2%. The breakthrough is recognizing that even failed trajectories encode valid view transitions: A → B is supervision regardless of the original goal. Compressing all exploration into a graph and reformulating sampled paths into view-planning demos takes Qwen2.5-VL-7B from 2.5% → 47.8%.

IVP success rate, Qwen2.5-VL-7B base (%)
MethodShortLongAll
Base model (prompting) 7.0 0.0 2.5
Direct PPO 7.0 1.2 3.2
Direct GRPO (filter) 10.8 2.2 5.2
Success-Only Bootstrapping 14.0 2.0 6.2
Random-graph (ablation) 25.4 6.4 13.0
1 iter + RL 24.3 5.4 12.0
2 iter + RL 49.7 16.2 27.9
Ours · Qwen2.5-VL-7B (3 iters) 67.236.947.8
Ours · Qwen3-VL-8B (3 iters) 56.819.432.5

Takeaway. Useful supervision comes from the geometry recorded by failed exploration, not from filtering for successes.

5b · How does the ranking shift under No-Snap and No-Submit?

Two evaluation knobs could in principle inflate our IVP numbers: rotation snapping to the discrete 30° grid, and the submit requirement. We re-evaluate under No-Snap (raw rotation magnitudes executed as-is, no on-grid rounding) and No-Submit (success the moment the pose enters the threshold). The ordering between models is unchanged under all three protocols — our trained models continue to dominate the proprietary baselines by a wide margin.

IVP All-split success rate (%), per protocol
Method Default No-Snap No-Submit
Gemini 3.1 Pro 21.315.731.5
GPT-5.4 16.613.031.3
Ours · Qwen2.5-VL-7B 47.819.660.2
Ours · Qwen3-VL-8B 32.518.548.3

No-Snap lowers every model: without rounding, per-step rotation residuals accumulate over 10 turns and the agent drifts off the on-grid pose distribution from which targets are drawn. No-Submit raises every model because no commit to a final answer is required. Across both relaxations, our framework's gains transfer cleanly.

Takeaway. The 47.8% headline number isn't an artefact of rotation snapping or of the submit step — relax either one and the ranking is preserved.

6 · How does the trained agent's coverage evolve over turns?

Tracked 3D point-cloud coverage reveals a clean two-phase strategy: scene coverage grows rapidly in early turns as the agent looks around, then plateaus while the target-intersection ratio accelerates in the middle turns as the agent moves toward the target (peaking near 55%). Base and frontier models show flat or erratic target coverage instead.

Two-phase exploration: scene coverage then target intersection

Left: scene coverage ratio. Right: target intersection ratio. Our trained model is the only one with sustained monotonic growth on the target axis.

Full model comparison (all 15 models) Coverage curves across all 15 models

Takeaway. The trained policy follows a goal-directed two-phase trajectory; baselines do not.

7 · How does training reshape per-layer image attention?

Image-attention fraction (the share of response-token attention pointed at image tokens) reveals two patterns. Layer-wise: our trained model attends more to images in early layers (L0–L4) and less in deep layers (L8+) than the base — it grounds visually early, then operates in text space. Turn-wise: image attention decreases monotonically across turns, consistent with progressive information accumulation, while the base model stays flat.

Per-layer image-attention breakdown — full 28 layers (click to collapse) Image attention fraction across all 28 layers, per turn

Image-attention fraction per layer × turn for all 28 layers. Our trained model front-loads visual grounding then drops off; the base model is flatter across both axes. Click the figure to open at native resolution.

Takeaway. Training reshapes how the VLM uses its visual stream, not just what it outputs.

8 · How is turn-usage distributed across models?

The base Qwen and GPT-5.4 Pro terminate most rollouts in a single turn (no exploration). Our trained model and Gemini 3.1 Pro use the full 10-turn budget. Of the rollouts that do use all turns, our model maintains a much higher success rate on harder episodes.

Turn distribution and success by turn count

(a) Total rollouts by turn count.  (b) Successful rollouts.  (c) Success rate by turns used.

Takeaway. Frontier VLMs that look like they're solving IVP often submit instantly without planning.

9 · Do the learned priors transfer to other view-related tasks?

Under identical GRPO post-training, our trained model beats the base on both internal and external view-dependent tasks. On the external MindCube benchmark (no shared scenes / actions / rendering pipeline) we gain ~10 points.

Spatial-prior transfer under identical GRPO post-training (% accuracy)
Model P2V initP2V +GRPO V2P initV2P +GRPO MindCube initMindCube +GRPO
Base Qwen2.5-VL-7B 32.145.1 29.244.8 33.056.3
Ours (after IVP training) 25.757.3 31.652.8 33.166.2

Takeaway. Interactive view planning is not a narrow skill — its priors strengthen view-dependent reasoning both within and beyond ViewSuite.

Cite

If you use ViewSuite or its trained models, please cite our paper.

@misc{wang2026planning,
  title={Planning with the Views},
  author={Kangrui Wang and Linjie Li and Zhengyuan Yang and Shiqi Chen and Zihan Wang and Li Fei-Fei and Jiajun Wu and Leonidas Guibas and Lijuan Wang and Manling Li},
  year={2026},
  eprint={2605.29563},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.29563},
}