[Spring 2026] Embodied Perception and Action – ePnA (具身感知與決策)

[Introduction]

This graduate course examines the principles and advanced research topics of embodied intelligence, focusing on how autonomous agents perceive, learn, and act in complex environments. The course covers 3D perception, neural scene representations (NeRF, Gaussian Splatting), motion prediction, planning and control, robotic manipulation, SLAM, mobile autonomy, simulation-based learning, sim-to-real transfer, and emerging robot foundation models. The lecture is organized around a progressive pipeline—starting with perception, moving through learning and action, and culminating in (robot) foundation models. The class is structured as a research seminar: students are required to present papers, lead discussions, and participate in weekly critiques of state-of-the-art research. Guest speakers from academia and industry may be invited to provide additional perspectives on real-world embodied AI systems.

Course Goals:

To introduce students to the core concepts linking perception, learning, and action in autonomy (e.g., robots, autonomous driving, etc.).
To expose students to cutting-edge research in 3D perception, neural scene representations, robot learning, and autonomous systems.
To develop students’ abilities to critically read, analyze, and evaluate research papers.
To strengthen presentation and scientific communication skills through weekly student-led sessions.
To prepare students for conducting independent research in embodied perception, robot learning, and related fields.
To enable students’ capabilities in evaluating industry opportunities for autonomy technologies.

Lecturer: Winston Hsu (office: R512, CSIE Building)

TA: Posheng Chen <d13922035@ntu.edu.tw> (office hour: 10am-12pm, Monday; @R501)

Time: 2:20pm – 5:10pm, Monday

Location: R546, CSIE Building

Info: 課號CSIE 5453; 課程識別碼: 922 U5030

Lecture Format: Research seminar consisting of lectures, student paper presentations, and weekly critiques/discussions of state-of-the-art research.

Assessment (tentative):

Course participation & paper summarization: 70%. (examples for paper summarization)
Final Project: 30%

Resources:

The lecture slides, homework descriptions, and datasets will be posted in NTU Cool. Only the students registered for the lecture can access them.
We will also host the discussions (questions) in NTU Cool as well.

Requirements: Background in image processing (or signal processing related courses), machine learning, deep learning, computer vision, etc. Experience in robots will be useful but not required.

Textbook: NO. We will cover some active research areas not included in any mature textbooks. Nevertheless, we will provide rich papers and reference books.

[Course Outline]

[02/23] Lecture 01 – Introduction: Embodied AI Landscape
- Readings:
  - How to read a paper. S Keshav, SIGCOMM Comput. Commun. Rev. 37, 3 (Jul. 2007), 83-84. [m, must read]
  - Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. Liu et al. IEEE/ASME Transactions on Mechatronics, Dec. 2025. [m]
  - Benchmarking Deep Reinforcement Learning for Continuous Control. Duan et al. ICML 2016 [o, optional]
- Tips for presentations (see Henning’s page)
[03/02] Lecture 02 – Sensors, 3D Perception & Representation Learning
- Readings:
  - PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Qi et al. CVPR 2017
  - PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. Qi et al. NeurIPS 2017
  - VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. Zhou & Tuzel. CVPR 2018 [m]
  - MAE: Masked Autoencoders Are Scalable Vision Learners. He et al. CVPR 2022
  - nuScenes: A Multimodal Dataset for Autonomous Driving. Caesar et al. CVPR 2020
[03/09] Lecture 03 – Autonomous Driving & Embodied Perception in the Wild
- Readings:
  - Bojarski et al. End to End Learning for Self-Driving Cars. arXiv 2016.
  - Bansal et al. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst. RSS 2019.
  - Gao et al. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation. CVPR 2020.
  - Sun et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. CVPR 2020.
  - Ettinger et al. Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset. ICCV 2021.
  - Caesar et al. nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous Vehicles. CVPR Workshop 2021.
  - Cao & de la Charette. MonoScene: Monocular 3D Semantic Scene Completion. CVPR 2022.
  - Liu et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. ICRA 2023.
  - Huang et al. TPVFormer: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. CVPR 2023.
  - Hu et al. Planning-Oriented Autonomous Driving (UniAD). CVPR 2023. （CPVR’03 Best Paper) [m]
  - Sima et al. DriveLLM: Charting the Path Toward Full Autonomous Driving with Large Language Models. arXiv 2023.
  - Weng et al. Drive Anything: Generalizable End-to-End Autonomous Driving with Multi-modal Foundation Models. arXiv 2024.
[03/16] Lecture 04 – Scene Representations: Neural Fields & Gaussian Splatting
- Readings
  - Mildenhall et al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020. [m]
  - Müller et al. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. SIGGRAPH 2022.
  - Zhi et al. In-Place Scene Labelling and Understanding with Implicit Scene Representation. ICCV 2021.
  - Kerr et al. LERF: Language Embedded Radiance Fields. ICCV 2023.
  - Barron et al. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. ICCV 2021.
  - Kerbl et al. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. (SIGGRAPH) 2023.
  - Zhou et al. Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. CVPR 2024.
  - Qin et al. LangSplat: 3D Language Gaussian Splatting. CVPR 2024.
  - Matsuki et al. Gaussian Splatting SLAM. CVPR 2024.
  - Zhu et al. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM. CVPR 2024.
  - Kerr et al. LERF-TOGO: Language Embedded Radiance Fields for Task-Oriented Grasping of Objects. CoRL 2023.
  - Zheng et al. GaussianGrasper: 3D Language Gaussian Splatting for Open-Vocabulary Robotic Grasping. RA-L 2024.
  - Xie et al. PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics. CVPR 2024.
  - Jatavallabhula et al. ConceptFusion: Open-Set Multimodal 3D Mapping. RSS 2023.
[03/23] Lecture 05 – SLAM + Semantic Navigation
- Readings:
[03/30] Lecture 06 – Learning-Based Decision Making (RL & IL)
[04/06] Lecture 07 – Holiday — 掃墓節 (No Class)
[04/13] Lecture 08 – LLM for Robot Planning
[04/20] Lecture 09 – Guest Speaker: ADAS & Autonomous Systems
[04/27] Lecture 10 – Vision-Language-Action (VLA) Models
[05/04] Lecture 11 – Embodied Reasoning
[05/11] Lecture 12 – World Models for Robot Learning
[05/18] Lecture 13 – Sim-to-Real Transfer
[05/25] Lecture 14 – Robotic Manipulation: Tabletop to Mobile
[06/01] Lecture 15 – Final Project — Demo & Presentation
[06/08] Lecture 16 – Lecture Backup

Winston H. Hsu

National Taiwan University