EgoCogNav: Cognition-aware Human Egocentric Navigation

¹ Cornell University ² Georgia Institute of Technology

European Conference on Computer Vision (ECCV) 2026

arXiv Paper soon Data soon Code soon Video soon

Abstract

Modeling the cognitive and experiential factors of human navigation is central to deepening our understanding of human–environment interaction and to enabling safe social navigation and effective assistive wayfinding. Most existing methods focus on forecasting motions in fully observed scenes and often neglect human factors that capture how people feel and respond to space. To address this gap, we propose EgoCogNav, a multimodal egocentric navigation framework that jointly forecasts perceived path uncertainty, trajectories and head motion from egocentric video, gaze, and motion history. To facilitate research in the field, we introduce the Cognition-aware Egocentric Navigation (CEN) dataset consisting of 6-hours of real-world egocentric recordings capturing diverse navigation behaviors in real-world scenarios. Experiments show that EgoCogNav learns the perceived uncertainty that highly correlates with human-like behaviors such as scanning, hesitation, and backtracking while generalizing to unseen environments.

Keywords: Egocentric Navigation · Trajectory Prediction · Perceived Uncertainty · Multimodal Learning

Method

EgoCogNav architecture. Perception and action streams are encoded independently with self-attention and fused by late concatenation; a cognition module predicts perceived uncertainty, retrieves context from learnable memory, and conditions decoding to forecast trajectory and head motion.

EgoCogNav is organized into three modules. A perception module extracts spatio-temporal features from the recent RGB frames with a frozen, pre-trained DINOv2 vision transformer. An action module encodes the past body-frame motion, 6D head rotations, and gaze, together with the body-frame navigation goal. The two streams are encoded independently with self-attention and fused by late concatenation into a shared representation.

A cognition module sits at the core. It (i) predicts the current perceived uncertainty from the shared features, whose supervision is gradient-coupled back through the encoder; (ii) retrieves situation-relevant context from a small set (16) of learnable navigation memory patterns via cross-attention; and (iii) applies uncertainty-conditioned decoding (UCD), modulating the latent representation through the predicted uncertainty with adaptive layer normalization. Two heads then forecast a 3-DOF body-frame trajectory and a 6D head-motion sequence, alongside a cognition head that estimates perceived uncertainty.

CEN Dataset

To support cognition-aware navigation research, we collect the Cognition-aware Egocentric Navigation (CEN) dataset — a multimodal egocentric corpus pairing rich sensory streams with moment-to-moment, self-reported annotations of perceived path uncertainty.

6 h

of recordings

participants

distinct sites

226K

RGB frames

Recordings span diverse indoor and outdoor settings — university campuses, healthcare facilities, urban commercial streets, and natural routes — captured with Tobii Pro Glasses plus GPS outdoors and Project Aria glasses (with SLAM) indoors, while participants continuously self-report perceived uncertainty on a normalized [0, 1] scale. Every session is annotated with environment types (junctions, occlusion, multi-level transitions, crowds, spatial changes) and trajectory / head-movement behaviors (hesitation, wrong turn, backtrack; scanning, confirmation, look-back). All data is de-identified by blurring faces and removing audio.

Results

On a held-out test set of environments unseen during training, EgoCogNav achieves the best trajectory and head-motion accuracy, reducing ADE / FDE by 3.8% / 5.0% relative to the strongest baseline. More importantly, its learned perceived uncertainty correlates strongly with human ratings (Spearman ρ = 0.788), far surpassing hand-crafted entropy- and heuristic-based proxies (ρ = 0.08 and 0.20). Qualitatively, uncertainty rises before hesitation and scanning at multi-junctions and peaks before backtracking under occlusion, while staying low in clear, well-specified corridors.

Qualitative visualizations. For each scenario, a bird’s-eye-view overlay shows the past trajectory (gray), ground-truth and predicted futures, and overlaid path uncertainty; the time-aligned egocentric frames (t+1 to t+3s) show ground-truth (red) vs. predicted (green) head positions, with predicted uncertainty intensity color-coded.

Failure Cases

Two failure cases highlighting limits in long-horizon scene memory: the model predicts a return along the same segment instead of backtracking to an earlier decision point, and misses a brief hesitation / look-back triggered by changing scene context.