[Introduction]
This graduate course examines the principles and advanced research topics of embodied intelligence, focusing on how autonomous agents perceive, learn, and act in complex environments. The course covers 3D perception, neural scene representations (NeRF, Gaussian Splatting), motion prediction, planning and control, robotic manipulation, SLAM, mobile autonomy, simulation-based learning, sim-to-real transfer, and emerging robot foundation models. The lecture is organized around a progressive pipeline—starting with perception, moving through learning and action, and culminating in (robot) foundation models. The class is structured as a research seminar: students are required to present papers, lead discussions, and participate in weekly critiques of state-of-the-art research. Guest speakers from academia and industry may be invited to provide additional perspectives on real-world embodied AI systems.
Course Goals:
- To introduce students to the core concepts linking perception, learning, and action in autonomy (e.g., robots, autonomous driving, etc.).
- To expose students to cutting-edge research in 3D perception, neural scene representations, robot learning, and autonomous systems.
- To develop students’ abilities to critically read, analyze, and evaluate research papers.
- To strengthen presentation and scientific communication skills through weekly student-led sessions.
- To prepare students for conducting independent research in embodied perception, robot learning, and related fields.
- To enable students’ capabilities in evaluating industry opportunities for autonomy technologies.
Lecturer: Winston Hsu (office: R512, CSIE Building)
TA: Posheng Chen <d13922035@ntu.edu.tw> (office hour: 10am-12pm, Monday; @R501)
Time: 2:20pm – 5:10pm, Monday
Location: R546, CSIE Building
Info: 課號CSIE 5453; 課程識別碼: 922 U5030
Lecture Format: Research seminar consisting of lectures, student paper presentations, and weekly critiques/discussions of state-of-the-art research.
Assessment (tentative):
- Course participation & paper summarization: 70%. (examples for paper summarization)
- Final Project: 30%
Resources:
- The lecture slides, homework descriptions, and datasets will be posted in NTU Cool. Only the students registered for the lecture can access them.
- We will also host the discussions (questions) in NTU Cool as well.
Requirements: Background in image processing (or signal processing related courses), machine learning, deep learning, computer vision, etc. Experience in robots will be useful but not required.
Textbook: NO. We will cover some active research areas not included in any mature textbooks. Nevertheless, we will provide rich papers and reference books.
[Course Outline]
- [02/23] Lecture 01 – Introduction: Embodied AI Landscape
- Readings:
- How to read a paper. S Keshav, SIGCOMM Comput. Commun. Rev. 37, 3 (Jul. 2007), 83-84. [m, must read]
- Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. Liu et al. IEEE/ASME Transactions on Mechatronics, Dec. 2025. [m]
- Benchmarking Deep Reinforcement Learning for Continuous Control. Duan et al. ICML 2016 [o, optional]
- Tips for presentations (see Henning’s page)
- [03/02] Lecture 02 – Sensors, 3D Perception & Representation Learning
- Readings:
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Qi et al. CVPR 2017
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. Qi et al. NeurIPS 2017
- VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. Zhou & Tuzel. CVPR 2018 [m]
- MAE: Masked Autoencoders Are Scalable Vision Learners. He et al. CVPR 2022
- nuScenes: A Multimodal Dataset for Autonomous Driving. Caesar et al. CVPR 2020
- [03/09] Lecture 03 – Autonomous Driving & Embodied Perception in the Wild
- Readings:
- Bojarski et al. End to End Learning for Self-Driving Cars. arXiv 2016.
- Bansal et al. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst. RSS 2019.
- Gao et al. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation. CVPR 2020.
- Sun et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. CVPR 2020.
- Ettinger et al. Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset. ICCV 2021.
- Caesar et al. nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous Vehicles. CVPR Workshop 2021.
- Cao & de la Charette. MonoScene: Monocular 3D Semantic Scene Completion. CVPR 2022.
- Liu et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. ICRA 2023.
- Huang et al. TPVFormer: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. CVPR 2023.
- Hu et al. Planning-Oriented Autonomous Driving (UniAD). CVPR 2023. (CPVR’03 Best Paper) [m]
- Sima et al. DriveLLM: Charting the Path Toward Full Autonomous Driving with Large Language Models. arXiv 2023.
- Weng et al. Drive Anything: Generalizable End-to-End Autonomous Driving with Multi-modal Foundation Models. arXiv 2024.
- [03/16] Lecture 04 – Scene Representations: Neural Fields & Gaussian Splatting
- [03/23] Lecture 05 – SLAM + Semantic Navigation
- [03/30] Lecture 06 – Guest Speaker: ADAS & Autonomous Systems
- [04/06] Lecture 07 – Holiday — 掃墓節 (No Class)
- [04/13] Lecture 08 – Learning-Based Decision Making (RL & IL)
- [04/20] Lecture 09 – LLM for Robot Planning
- [04/27] Lecture 10 – Vision-Language-Action (VLA) Models
- [05/04] Lecture 11 – Embodied Reasoning
- [05/11] Lecture 12 – World Models for Robot Learning
- [05/18] Lecture 13 – Sim-to-Real Transfer
- [05/25] Lecture 14 – Robotic Manipulation: Tabletop to Mobile
- [06/01] Lecture 15 – Final Project — Demo & Presentation
- [06/08] Lecture 16 – Lecture Backup