[Introduction]

Course Goal:Cognitive computing refers to systems that learn at scale, reason with purpose, and interact with humans naturally from numerous emerging sensors and signals. Cognitive computing systems are trained to sense, predict, infer, and in some ways, reason, using machine learning algorithms that are operated over large-scale, noisy, and unstructured data streams.

The topic will be essential for current and future industrial needs and academic research opportunities.

We will go through the insightful and informative learning strategies regarding visual signals (e.g., image, video, 3D, etc.) and show how to discover and design novel learning methods for such emerging learning frameworks — especially for neural networks.

We aim to introduce the state-of-the-art and essential machine learning algorithms for numerous core problems in cognitive computing. We investigate methods for machine perception and the following action planning. We need to deal with the noisy, unstructured, high-dimensional data in rigorous and efficient manners.

We emphasize the hands-on experiences for conducting the course in terms of programming and experimental assignments, midterm, and final projects. We will organize the lecture content from the state-of-the-art and the reading materials will be mostly based on the literatures from top conferences.

Lecturer: Winston Hsu (office: R512, CSIE Building)

TA: Posheng Chen <d13922035@ntu.edu.tw> (office hour: 10am-12pm, Monday; @R501)

Time:  2:20pm – 5:10pm, Tuesday

LocationR105 CSIE Building

Info: 課號CSIE 5420; 課程識別碼: 922 U4460

Assessment (tentative):

  • Assignments : 20-30%
  • Midterm Exam.: 30-40%
  • Final Project: 30-40%

Resources:

  • The lecture slides, homework descriptions, and datasets will be posted in NTU Cool. Only the students registered for the lecture can access them.
  • We will also host the discussions (questions) in NTU Cool as well.

Requirements: Background in image processing (or signal processing related courses), probability, and linear algebra. Experience with machine learning or statistical pattern recognition will be useful but not required.

Textbook: NO. We will cover some active research areas not included in any mature textbooks. Nevertheless, we will provide rich papers and reference books.

[Course Outline]

  • [09/02] Lecture 01 – introduction for the topic and course planing
    • Readings:
      • How to read a paper. S Keshav, SIGCOMM Comput. Commun. Rev. 37, 3 (Jul. 2007), 83-84. [m, must read]
      • Elsworth et al. Measuring the environmental impact of delivering AI at Google Scale. Google, August 2025 [o, optional]
      • Deep learning. Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Nature 521, 436–444 (28 May 2015) [o]
      • Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation. John R Smith, Dhiraj Joshi, Benoit Huet, Winston Hsu, and Zef Cota. ACM Multimedia 2017. [o]
  • [09/09] Lecture 02 – understanding image, video and sensors
    • image sensor, video, compression, and video structure and syntax; an intuitive explanation on how image/video are created and recorded.
    • Readings:
      • D. Le Gall, “MPEG: A Video Compression Standard for Multimedia Applications,” Communications of ACM, April 1991, Vol 34, No. 4, pp. 46-58. [m]
      • R. Ramantha et al. “Color Image Processing Pipeline: a general survey of digital still cameras”, IEEE Signal Processing Magazine , Jan 2005. [m]
      • Liang et al. Raw Image Deblurring. IEEE Trans. on Multimedia, 2022
      • S. Uchihashi and J. Foote, “Summarizing Video Using a Shot Importance Measure and a Frame-Packing Algorithm,” ICASSP 1999. 
      • Michael Brown, “Understanding the In-Camera Image Processing Pipeline for Computer Vision,” CVPR 2016. A tutorial.
      • Chang et al. Free-form Video Inpainting with 3D Gated Convolution and Temporal Patch GAN. ICCV 2019. (GitHub)
      • I. Koprinska, S. Carrato, “Temporal video segmentation: a survey,” Signal Processing: Image Communication, vol. 16, pp. 477–500, 2001. (Sec. 3.2 & 3.3, skipped)
      • Sony. The Basics of Camera Technology. (PDF)
  • [09/16] Lecture 03 – visual features: color + texture + shape
    • low-level visual features: color, texture, and shape
    • Readings:
      • “Texture features for browsing and retrieval of image data,” B. S. Manjunath and W.Y. Ma, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol.18, no.8, pp.837-42, Aug 1996.
      • “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” S. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006. [m]
      • “Color and Texture Descriptors,” B. S. Manjunath, Jens-Rainer Ohm, Vinod V. Vasudevan, Akio Yamada, IEEE Transactions on Circuits and Systems for Video Technology, Vol 11, No. 6, June 2001.
      • “MPEG-7 visual shape descriptors,” Miroslaw Bober, IEEE Transactions on Circuits and Systems for Video Technology, Vol 11, No. 6, June 2001. [m]
      • “Representing shape with a spatial pyramid kernel,” A. Bosch, et al., CIVR 2007.
  • [09/23] Lecture 04 – visual features: local features
    • Local features and visual words
    • Readings:
      • “Video Google: A Text Retrieval Approach to Object Matching in Videos,” J. Sivic, and A. Zisserman, ICCV, 2003. [m]
      • “Aggregating local descriptors into a compact image representation,” H. J ́egou, M. Douze, C. Schmid, and P. P ́erez.. CVPR 2010. [m]
      • “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” S. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006.
      • “Distinctive Image Features from Scale-Invariant Keypoints,” Lowe, IJCV, 2004.
      • “A Performance Evaluation of Local Descriptors,” Mikolajczyk, PAMI 2005.
      • “A Comparison of Affine Region Detectors,” Mikolajczyk, IJCV, 2004.
      • “Scale & Affine Invariant Interest Point Detectors,” Mikolajczyk, IJCV, 2004.
      • “Scalable Face Image Retrieval using Attribute-Enhanced Sparse Codewords,” Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Winston H. Hsu, IEEE Transactions on Multimedia, 2013.
      • Chi-Ming Chung, Yang-Che Tseng, Ya-Ching Hsu, Xiang-Qian Shi, Yun-Hung Hua, Jia-Fong Yeh, Wen-Chin Chen, Yi-Ting Chen, Winston H Hsu. Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping. ICRA 2023.
  • [09/30] Lecture 05 – visual features: face
    • early face detection, facial attributes, and face recognition. The new paradigm for deep learning methods will be introduced in the CNN sessions.
    • Readings:
      • Wang et al. Deep Face Recognition: A Survey. arXiv 2019
      • “Robust real-time face detection,” P. Viola and M. Jones, IJCV 57(2), 2004.
      • “An extended set of Haar-like features for rapid object detection,” Lienhart, R. and Maydt, J., ICIP, 2002.
      • “FaceTracer: A Search Engine for Large Collections of Images with Faces,” Neeraj Kumar, Peter N. Belhumeur, Shree K. Nayar, ECCV, 2008. [m]
      • Bor-Chun Chen, Chu-Song Chen, Winston H. Hsu, “Cross-Age Reference Coding for Age-Invariant Face Recognition and Retrieval,” ECCV 2014 [m]
      • Wu et al., A Light CNN for Deep Face Representation With Noisy Labels. IEEE Transactions on Information Forensics and Security, 2018
      • “Face recognition using eigenfaces,” M. A. Turk, A.P. Pentland, CVPR, 1991.
      • “Face Recognition with Local Binary Patterns,” Timo Ahonen, ECCV, 2004.
      • “Scalable Face Image Retrieval using Attribute-Enhanced Sparse Codewords,” Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Winston H. Hsu, IEEE Transactions on Multimedia, 2013.
      • “Toward Large-Scale Face Recognition Using Social Network Context,” Z. Stone, T. Zickler, and T. Darrell, Proceedings of the IEEE, 2010.
      • “Discovering Informative Social Subgraphs and Predicting Pairwise Relationships from Group Photos,” Yan-Ying Chen, Winston H. Hsu, Hong-Yuan Mark Liao, ACM Multimedia 2012.
  • [10/07] Lecture 06 – ranking and learning to hash
    • Reviewing key hash-based indexing methods
    • Readings:
      • Kristen Grauman. Efficiently Searching for Similar Images. Communications of the ACM,, 2009. (good paper) [m]
      • Malcolm Slaney and Michael Casey, “Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, 2008. [m]
      • Alexandr Andoni and Piotr Indyk, “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,” Communications of the ACM, 2008. (good paper)
      • Wang et al., “Learning to Hash for Indexing Big Data—A Survey,” Proceedings of the IEEE, 2016. 
      • M. Datar, et al. Locality-sensitive hashing scheme based on p-stable distributions. SoCG 2004.
      • P. Indyk et al. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998
      • Scalable object detection by filter compression with regularized sparse coding. Ting-Hsuan Chao, Yen-Liang Lin, Yin-Hsi Kuo, Winston H. Hsu. CVPR 2015.
      • Junfeng He et al. Mobile Product Search with Bag of Hash Bits and Boundary Reranking. CVPR 2012.
  • [10/14] Lecture 07 – feature reduction and manifold
    • Readings:
      • Ella, Bingham; Heikki, Mannila. “Random projection in dimensionality reduction: Applications to image and text data”. KDD 2001. (a very good paper)
      • “Graph Embedding and Extensions: A General Framework for Dimensionality Reduction,” Shuicheng Yan et al., PAMI 2007. (a very good paper)
      • “Eigenfaces for recognition,” M Turk, A Pentland – Journal of Cognitive Neuroscience, 1991. (page 72, The Eigenface Approach, to page 76 ONLY)
      • “Nonlinear dimensionality reduction by locally linear embedding,” Roweis & Saul, Science, 2000.
      • Vittorio Castelli, “Multidimensional Indexing Structures for Content-based Retrieval,” IBM Research Report, 2001. (overview paper, section 1 & 2 ONLY) [problems in high-dim features listed here]
      • Laurens van der Maaten, Geoffrey Hinton. Visualizing Data using t-SNE; The Journal of Machine Learning Research, 2008. [m]
  • [10/21] Lecture 08 – midterm
    • Coverage: TBA
    • Closed book
    • Closed book
    • 2:20pm, R105
  • [10/28] Lecture 09 – visual learning with neural networks (I)
    • Applications in segmentation, classification, detection, etc.
  • [11/04] Lecture 10 – visual learning with neural networks (II)
    • Neural networks besides convolutions
    • Varying effective cost functions for neural networks
  • [11/11] Lecture 11 – 3D learning
  • [11/18] Lecture 12 – aesthetic learning
  • [11/25] Lecture 13 – sentiment/emotion learning
  • [12/02] Lecture 14 – visual comprehension & system 2
  • [12/09] Lecture 15 – final project presentation
    • Workshop style presentations — drinks and snacks will be provided (but bring in your own mugs)
    • Each team will have 8 – 10 mins (to be finalized)
    • Two Best Awards (Technical and Presentation) will be selected