Visual Navigation

Embodied Perception

Visual Navigation

Cross-Disciplinary Exploration

Visual navigation in complex scenes is a core capability of embodied AI. We are committed to researching how agents, relying solely on visual sensors, can understand high-level instructions (e.g., natural language) and plan and execute safe, efficient movement in open scenes filled with uncertainties (e.g., complex layouts, dynamic pedestrians, unknown traversability). The core challenge lies in building navigation systems that can integrate semantic understanding, motion prediction, and environmental affordance inference.

Semantic Understanding and Spatial Reasoning

We explore how agents parse the spatial semantics within natural language instructions and ground them to visual observations, performing spatial reasoning by 'building a mental map.' We aim to break through the limitations of the first-person perspective and construct cross-modal, multi-scale spatial representation and reasoning capabilities.

Our research enables agents to understand complex instructions like 'go to the red chair next to the window' and navigate accordingly.

Motion Prediction

To enable safe navigation in dynamic environments, agents need the ability to predict the future trajectories of surrounding agents (e.g., pedestrians, vehicles). Our research develops algorithms that can anticipate movement patterns and adjust navigation strategies accordingly.

This capability is crucial for applications in crowded environments where collision avoidance is essential.

Affordance Inference

We research how agents understand the 'affordances' of the environment—the possibilities for action that the environment offers them—covering terrain traversability, object interactivity, and the functional properties of target objects.

This understanding enables agents to make informed decisions about which paths are navigable and which objects can be interacted with.

Evolutionary Learning for Navigation

We explore how agents can autonomously evolve their navigation capabilities during the testing phase (i.e., when performing tasks in new environments) through online adaptation or memory-reflection mechanisms, to overcome the distribution shift between training and testing scenarios.

This approach enables agents to continuously improve their performance in novel environments without requiring extensive retraining.