1Northwestern University,
2University of Washington,
3Microsoft,
4University of Oxford,
5Stanford University
Can VLMs predict how each camera move changes the view, and plan many such moves ahead?
We call this ability view planning: using camera moves as planning primitives to find a
target view in 3D. We study it in ViewSuite, a 6-DoF environment on real ScanNet
scenes, and decompose it into two abilities: tracking how given camera actions change
the view, and composing a path that localizes an unseen target view.
Across 13 frontier VLMs, a sharp planning gap emerges: models track local view transitions but
collapse when they must plan toward an unseen target view. This inability cannot simply be fixed by
reinforcement learning (RL): with success near 2.5%, reward is too sparse for RL to
bootstrap.
Our key insight is to distill valid view transitions from on-policy self-exploration,
aggregating them into a view graph and distilling it into supervised demonstrations.
With no stronger teacher, this lifts Qwen2.5-VL-7B from 2.5% → 47.8% on
interactive view planning, surpassing GPT-5.4 Pro (19.9%) and Gemini 3.1 Pro (21.3%). View planning
is a clean probe for prospective spatial reasoning: looking ahead, predicting how future viewpoint
changes reshape observation, and inferring a target view's camera pose before it is fully observed,
a capability frontier VLMs still lack.
Three Diagnostic Tasks
ViewSuite probes the two coupled abilities of view planning:
tracking how camera actions change the view, and
composing them into a multi-turn plan that localizes an unseen target view.
Overview of ViewSuite.Left: a point-cloud environment built on ScanNet, with a top-down view marking the initial (blue) and target (orange) viewpoints and their rendered first-person views. Middle: Path-to-View (P2V) and View-to-Path (V2P), both single-turn. Right: Interactive View Planning (IVP), a multi-turn task where the agent issues 6-DoF camera actions to localize the target view and submits its camera pose.
P2V
Path‑to‑View
InitialTop-downABCD
Show full prompt
Given the initial view [Image 1] and a top-down reference [Image 2], after you execute the following action sequence (translation step = 0.5 m; rotation step = 30.0 degrees per step):
[turn_left, look_up, move_forward, move_right]
which of the following images corresponds to the result?
A. [Image 3]
B. [Image 4]
C. [Image 5]
D. [Image 6]
(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)
Show model's full response
<think>Turning left makes the current center features shift right, and looking up should show more upper wall/ceiling. After the small move, the shelf/door area becomes the main view while the papasan chair moves out of frame. That matches image C.</think><action>answer(C)</action>
V2P
View‑to‑Path
InitialTop-downTarget
Show full prompt
Given the initial view [Image 1] and a top-down reference [Image 2], which action sequence will reach the target view [Image 3]?
(Action semantics: translation step = 0.5 m; rotation step = 30.0 degrees per step.)
A. [turn_left, turn_left, turn_left, look_down, move_left, move_left]
B. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_right, move_up]
C. [turn_left, turn_left, turn_left, turn_left, turn_left, turn_left, move_backward, move_backward, move_right, move_down]
D. [turn_right, turn_right, turn_right, turn_right, look_up, move_backward, move_backward, move_left, move_left]
(Note: after every rotation, the Euler angles (rx, ry, rz) are rounded to the nearest integer multiples of the rotation step along each axis.)
Show model's full response
<think>From the initial view, the camera is near the left-side chair/stool area, facing toward the left wall. The target view is a downward-looking view over the rug near the lower-middle/right side of the room. The only sequence that sensibly rotates toward that area and matches the downward pitch is: turn left 90°, look down, then translate sideways across the rug.</think><action>answer(A)</action>
IVP
Interactive View Planning
InitialTop-downTarget
Submitted answer
Step 0 / 0
User prompt at this turn
—
Model response at this turn
—
Show system prompt
—
286
ScanNet scenes
~55K
view pairs
~165K
task instances
12
6-DoF actions
0.5m / 30°
IVP success threshold
Self-Exploration with View Graph Distillation
Each iteration alternates two stages. In the self-exploration stage, the agent
interacts with ViewSuite environments and its trajectories are incrementally compressed into a
view graph. In the view graph distillation stage, paths are sampled
from this graph and reformulated into diverse view-planning demonstrations used to fine-tune the
policy. The resulting model initializes the next self-exploration stage.
1. RL Stage
The agent runs IVP rollouts on ViewSuite environments with PPO. Reward is sparse:
+1 when the submitted target estimate is within 0.5 m / 30° of
the ground truth, plus a small format reward. Even with success rate near 2.5%,
every rollout is useful, since it streams into the graph builder.
2. Graph Construction
A background process incrementally compresses every completed trajectory into a
view graph. Nodes are viewpoints (with their rendered views);
edges are actions between viewpoints. Nodes and
edges are deduplicated via viewpoint similarity, so success and failure
alike contribute to one shared structured graph.
3. Task Reformulation
Any path P = (v₀, a₁, v₁, …, aK, vK) in the graph yields a
valid IVP demonstration regardless of whether the original episode succeeded: end node →
target, start node → initial view, action chain → labeled plan. This is the lever that
lets us learn from failed episodes.
4. SFT Stage
Sampled paths are reformulated into supervised view-planning demonstrations and used to
fine-tune the policy with standard cross-entropy. The resulting model initializes the
next RL stage, kicking off the iteration. Stages alternate RL → SFT → RL → SFT.
Results
Frontier VLM benchmark on ViewSuite-5K test (530 view pairs)
Accuracy / Success Rate (%) on Short (d < 3) and Long (d ≥ 3) splits.
Best in each column is bold.
Model
Path-to-View
View-to-Path
Interactive View Planning
Overall
Short
Long
All
Short
Long
All
Short
Long
All
Random Response
20.7
24.6
23.3
24.3
26.5
25.7
2.2
0.0
0.8
16.6
Proprietary Models
GPT-5.4 Pro
70.8
43.8
53.2
72.4
38.8
50.6
34.8
11.7
19.9
41.2
Gemini 3.1 Pro
63.8
40.9
48.9
53.0
47.5
49.4
28.6
17.4
21.3
39.9
GPT-5.4
57.3
42.9
47.9
60.5
37.4
45.5
33.5
7.5
16.6
36.7
Grok 4.20 Beta
61.6
38.0
46.2
44.9
44.3
44.5
17.3
2.9
7.9
32.9
GPT-5.1
60.5
35.1
44.0
52.4
33.3
40.0
11.9
3.2
6.2
30.1
Claude Opus 4.6
46.5
28.4
34.7
47.6
38.3
41.5
23.8
3.8
10.8
29.0
Gemini 3 Pro
50.3
31.0
37.7
44.9
35.4
38.7
13.5
7.0
9.2
28.5
Open-Weight Models
Qwen3.5-397B
57.8
30.1
39.8
44.3
31.0
35.7
12.4
0.0
4.3
26.6
GLM-4.6V
36.4
23.2
27.8
31.4
29.7
30.2
9.2
1.2
4.0
20.7
Qwen2.5-VL-72B
28.1
29.3
28.9
35.7
30.1
32.1
2.2
0.6
1.1
20.7
Qwen3-VL-32B
27.0
27.5
27.4
41.1
28.7
33.0
4.3
0.0
1.5
20.6
Kimi K2.5
36.2
24.6
28.7
18.4
29.3
25.5
4.9
1.2
2.5
18.9
Qwen2.5-VL-7B
23.8
32.5
29.4
27.0
22.9
24.3
7.0
0.0
2.5
18.7
GPT-5.4 Pro refuses 23 of the 530 IVP instances under its content policy; its IVP rates are computed
over the remaining 507 valid instances (101 / 507 = 19.9%). All other models are evaluated on the full 530.
Results on Interactive View Planning
Success rate (%) under the calibrated 0.5 m / 30° threshold. Our framework lifts a 7B model
from 2.5% → 47.8%, beating every proprietary VLM evaluated.
Every analysis from the paper, packaged into one section: where models fail, what bottlenecks IVP, what
training actually learns, and how the priors transfer.
1 · Single-turn tracking vs. multi-turn planning
Single-turn tracking ≫ multi-turn planning. The best VLMs reach ~70% on short-horizon P2V/V2P
but collapse to at most 21.3% on Interactive View Planning. Most models score below 10%; on
long-horizon samples most fall below 3%. Open-weight models stay below 5% on IVP.
70.8%
GPT-5.4 Pro · P2V short
21.3%
Best IVP (Gemini 3.1 Pro)
<5%
Every open-weight model · IVP
<3%
Most models · long-horizon IVP
Takeaway. Local view-action knowledge does not compose into multi-turn plans.
1b · When models succeed, are they reasoning or just matching?
A competent spatial reasoner need not see the target to localize it: after a few informative moves it
could infer where the target lies and submit its pose without ever visiting it. We test whether frontier
successes work this way. For every successful IVP rollout, we check whether the agent observed any view
within the success threshold (0.5 m / 30°) of the target before answering. Across all five models, at
least 90% of successes (up to 99.1% for Gemini 3.1 Pro) follow such a
visual encounter. Genuine inference of a correct pose, without ever visiting a threshold-close view,
accounts for at most ~10% of successes.
IVP successes: visual encounter vs. genuine inference
Model
#Success
Visited target view
Inferred (no visit)
GPT-5.4 Pro
101
99 (98.0%)
2 (2.0%)
Gemini 3.1 Pro
113
112 (99.1%)
1 (0.9%)
GPT-5.4
88
83 (94.3%)
5 (5.7%)
Grok 4.20 Beta
42
38 (90.5%)
4 (9.5%)
Claude Opus 4.6
57
54 (94.7%)
3 (5.3%)
Takeaway. The planning gap is really a cognitive gap: models mostly succeed by view matching after they reach the target, not by localizing it in advance.
2 · Difficulty along rotation and translation axes
Decomposing view distance into rotation and translation axes flips the difficulty signal. P2V/V2P
degrade primarily with rotation distance (cumulative rotations are hard to mentally simulate); IVP
reverses this — success collapses with position distance, since 3D translation needs spatial-layout
understanding and path planning.
P2V/V2P (left two): accuracy falls along the rotation axis. IVP (right): success collapses along the
position axis (~7× drop for GPT-5.4 Pro).
Takeaway. The two task families are bottlenecked by different spatial reasoning skills.
2b · Which per-sample factors correlate with success?
Spearman correlation between 12 per-sample factors (geometric distance, visual overlap, directional
geometry) and each model's binary success makes the difficulty signal even sharper. P2V / V2P
success correlates most with orientation_agreement (ρ up to +0.30, same-facing camera pairs
are easier). IVP success collapses with pos_dist (ρ down to −0.42 for
GPT-5.4 Pro), with rotation barely registering. Visual-overlap factors help single-turn tasks but
contribute almost nothing to IVP.
Rows: 12 sample-level factors. Columns: per-model binary success on P2V, V2P, IVP. Blue cells are
negative correlations (factor makes the sample harder), red cells positive. The strong negative
column under pos_dist in the IVP panel is the position-bottleneck signature.
Takeaway. Position distance is to IVP what orientation agreement is to P2V / V2P — the single dominant predictor of per-sample difficulty.
3 · Does the turn budget bottleneck IVP?
Doubling the turn budget from 10 → 20 helps every model (Claude Opus 4.6 nearly doubles), but 20 → 30
is essentially free. Models exhaust their effective strategies before the turn limit runs out.
IVP All-split accuracy (%)
Model
B = 10
B = 20
B = 30
Δ 10→30
Gemini 3.1 Pro
21.3
23.0
23.2
+1.9
GPT-5.4
16.6
19.2
20.2
+3.6
Claude Opus 4.6
10.8
17.7
19.6
+8.8
Grok 4.20 Beta
7.9
11.7
11.7
+3.8
Takeaway. IVP is bottlenecked by planning ability, not by horizon length.
4 · Does higher-fidelity rendering change the picture?
We re-render the test set with 3D Gaussian Splatting (GS), then re-evaluate at budget = 10. IVP
improves only marginally. P2V/V2P show mixed and sometimes large swings — Gemini 3.1 Pro gains +6.4
on P2V, while GPT-5.4 and Grok 4.20 Beta lose 14.4 and 13.0 points on V2P.
Same scene, same target — three independent runs on scene0518_00, all succeed.
Gemini 3.1 Pro · Point Cloudsuccess · 10 turnsdefault point-cloud renderGemini 3.1 Pro · Gaussian Splatsuccess · 4 turnshigher-fidelity neural renderGPT-5.4 · Point Cloudsuccess · 6 turnsdifferent model, same render
Each GIF cycles through target → initial → agent's per-turn views. The three runs vary on two axes
(model and renderer); all reach the target within the unified-distance threshold, but turn counts
differ. Higher fidelity occasionally accelerates planning (4 turns vs. 10), yet — as the table
below shows — it does not unlock the broader IVP gap.
Gaussian-splat re-render, B = 10 (% accuracy)
Model
P2V
V2P
IVP
Δ Overall vs Point Cloud
Gemini 3.1 Pro
55.3
49.4
23.2
+2.7
GPT-5.4
43.8
31.1
18.5
−5.6
Claude Opus 4.6
35.3
41.3
12.3
+0.6
Grok 4.20 Beta
28.3
31.5
8.1
−10.3
Takeaway. The bottleneck is composing view changes, not the visual fidelity of each observation.
5 · Comparing training recipes for IVP
Direct PPO plateaus at 3.2%; GRPO with reward-variance filtering reaches
5.2%; iterating PPO with SFT on only successful trajectories (Success-Only
Bootstrapping) gets to 6.2%. The breakthrough is recognizing that even
failed trajectories encode valid view transitions: A → B is supervision regardless of the
original goal. Compressing all exploration into a graph and reformulating sampled paths into
view-planning demos takes Qwen2.5-VL-7B from 2.5% → 47.8%.
IVP success rate, Qwen2.5-VL-7B base (%)
Method
Short
Long
All
Base model (prompting)
7.0
0.0
2.5
Direct PPO
7.0
1.2
3.2
Direct GRPO (filter)
10.8
2.2
5.2
Success-Only Bootstrapping
14.0
2.0
6.2
Random-graph (ablation)
25.4
6.4
13.0
1 iter + RL
24.3
5.4
12.0
2 iter + RL
49.7
16.2
27.9
Ours · Qwen2.5-VL-7B (3 iters)
67.2
36.9
47.8
Ours · Qwen3-VL-8B (3 iters)
56.8
19.4
32.5
Takeaway. Useful supervision comes from the geometry recorded by failed exploration, not from filtering for successes.
5b · How does the ranking shift under No-Snap and No-Submit?
Two evaluation knobs could in principle inflate our IVP numbers: rotation snapping to the
discrete 30° grid, and the submit requirement. We re-evaluate under
No-Snap (raw rotation magnitudes executed as-is, no on-grid rounding) and
No-Submit (success the moment the pose enters the threshold). The ordering between
models is unchanged under all three protocols — our trained models continue to dominate the
proprietary baselines by a wide margin.
IVP All-split success rate (%), per protocol
Method
Default
No-Snap
No-Submit
Gemini 3.1 Pro
21.3
15.7
31.5
GPT-5.4
16.6
13.0
31.3
Ours · Qwen2.5-VL-7B
47.8
19.6
60.2
Ours · Qwen3-VL-8B
32.5
18.5
48.3
No-Snap lowers every model: without rounding, per-step rotation residuals accumulate over
10 turns and the agent drifts off the on-grid pose distribution from which targets are drawn.
No-Submit raises every model because no commit to a final answer is required. Across both
relaxations, our framework's gains transfer cleanly.
Takeaway. The 47.8% headline number isn't an artefact of rotation snapping or of the submit step — relax either one and the ranking is preserved.
6 · How does the trained agent's coverage evolve over turns?
Tracked 3D point-cloud coverage reveals a clean two-phase strategy: scene coverage grows rapidly in
early turns as the agent looks around, then plateaus while the target-intersection ratio accelerates
in the middle turns as the agent moves toward the target (peaking near 55%). Base and frontier models
show flat or erratic target coverage instead.
Left: scene coverage ratio. Right: target intersection ratio. Our trained model is the only one
with sustained monotonic growth on the target axis.
Full model comparison (all 15 models)
Takeaway. The trained policy follows a goal-directed two-phase trajectory; baselines do not.
7 · How does training reshape per-layer image attention?
Image-attention fraction (the share of response-token attention pointed at image tokens) reveals two
patterns. Layer-wise: our trained model attends more to images in early layers
(L0–L4) and less in deep layers (L8+) than the base — it grounds visually early, then operates in
text space. Turn-wise: image attention decreases monotonically across turns,
consistent with progressive information accumulation, while the base model stays flat.
Per-layer image-attention breakdown — full 28 layers (click to collapse)
Image-attention fraction per layer × turn for all 28 layers. Our trained model front-loads visual
grounding then drops off; the base model is flatter across both axes. Click the figure to open at
native resolution.
Takeaway. Training reshapes how the VLM uses its visual stream, not just what it outputs.
8 · How is turn-usage distributed across models?
The base Qwen and GPT-5.4 Pro terminate most rollouts in a single turn (no exploration). Our trained
model and Gemini 3.1 Pro use the full 10-turn budget. Of the rollouts that do use all turns,
our model maintains a much higher success rate on harder episodes.
(a) Total rollouts by turn count. (b) Successful rollouts. (c) Success rate by turns used.
Takeaway. Frontier VLMs that look like they're solving IVP often submit instantly without planning.
9 · Do the learned priors transfer to other view-related tasks?
Under identical GRPO post-training, our trained model beats the base on both internal and external
view-dependent tasks. On the external MindCube benchmark (no shared scenes / actions /
rendering pipeline) we gain ~10 points.
Spatial-prior transfer under identical GRPO post-training (% accuracy)
Model
P2V init
P2V +GRPO
V2P init
V2P +GRPO
MindCube init
MindCube +GRPO
Base Qwen2.5-VL-7B
32.1
45.1
29.2
44.8
33.0
56.3
Ours (after IVP training)
25.7
57.3
31.6
52.8
33.1
66.2
Takeaway. Interactive view planning is not a narrow skill — its priors strengthen view-dependent reasoning both within and beyond ViewSuite.
Cite
If you use ViewSuite or its trained models, please cite our paper.
@misc{wang2026planning,
title={Planning with the Views},
author={Kangrui Wang and Linjie Li and Zhengyuan Yang and Shiqi Chen and Zihan Wang and Li Fei-Fei and Jiajun Wu and Leonidas Guibas and Lijuan Wang and Manling Li},
year={2026},
eprint={2605.29563},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.29563},
}