The prevalence of capture devices (e.g., mobile phones, egocentric devices, drones, vehicles, etc.) and the advent of media-sharing services have drastically increased the sheer amount of image and video collections. Here arise the strong needs for effective recognition, mining, forecasting, search, etc. We have been devoted to machine intelligence over large-scale multimodal data streams (e.g., images, videos, text, etc.).
Recently, we focus on numerous Deep Convolutional Neural Network methods for large-scale image/video recognition, retrieval, and mining projects, sponsored by leading industry partners (Microsoft Research, IBM Watson, NVIDIA, MediaTek, HTC, etc.). Especially, we aim for effective CNN methods for image/video applications such as image search, video event detection, face recognition, facial/clothing attributes, super-resolution, etc.
Though having observed very exciting applications requiring intensive researches in machine intelligence, we further identify certain core challenges and respond to them respectively:
- Semantic gap – bridging the low-level visual features to satisfy semantic needs by proposing semantic ontology and learning semantic representations in an automatic manner;
- User gap – helping users issue proper queries to satisfy their intentions in different application scenarios and mobile devices (e.g., by sketch, attribute, snapshot, speech, and touch-based);
- Volume gap – learning ultra-large-scale photos and videos by distributed learning and efficient high-dimensional indexing (e.g., hash-based methods) for real-time query response over big photo/video data; balancing the technical strengths between the mobiles and cloud servers.
- Privacy – conducting privacy-preserving mining for large-scale photos and videos and addressing the privacy concerns as sharing sensitive photos and videos (e.g., family albums) in the public clouds.
- Industry needs – besides thorough algorithms for academic researches, we also investigate practical methods for meeting possible needs in industrial developments (e.g., technology transfer).

Note that the project is kindly sponsored by
In this multi-year project – DeepTutor, by leveraging very large-scale and diverse data streams (e.g., speech, video, image, text, geo-locations, etc.), we aim to investigate advanced deep learning algorithms and intelligent human-computer interfaces (HCI) for enabling the brand-new and proactive question and answering (QA) platform. Besides plain QA in a passive manner, DeepTuor will proactively question and answer in a self-taught and reinforcement manner and further lead the role as a tutor to guide and raise the questions for the users (students). We need to investigate scalable and in-depth learning algorithms for inference from the multi-modal and noisy data streams. In a novel and unique aspect, we will devise brand-new HCI techniques for enabling two-way interactions – QA and tutoring. DeepTutor will automatically generate questions for users to practice and then explain; for example, asking “what is the nearest planet to earth” after parsing a Discovery video. DeepTutor will also leverage the how-to videos to answer and further guide the users in the new augmented-reality (AR) environment, e.g., how to fix a flat bike tire. We will investigate deep reinforcement learning in the new tangible interfaces and present the answers and tutoring actionable steps in quality and AR manner.
Collaborating with a leading university hospital in Taichung, we are advancing to another major tumor with thousands of patient data provided and lacking quality annotation/segmentation tools.
We have been investigating real-time system for large-scale image retrieval since 2007 and derive numerous indexing methods (hash-based and inverted-indexing-based) for numerous conditions. We further improve the challenging issue – recall rate – by mining semantically relevant auxiliary visual features through visual and textual clusters in an unsupervised and scalable manner and yield significantly better accuracy than the traditional models. The results had been published in top venues (e.g., IEEE TMM’14, CVPR’11, ACM MM’11, ECIR’15). The PhD candidate, 



The photos with people are confirmed the most memorable for the users. We argue a novel paradigm called, search by impression, which helps users compose his intension (a photo layout of persons) for the target images to search. We leverage the contextual cues (in terms of human attributes and face locations) to index and search the photos. The method receives several recognitions including a full paper in SIGIR’12 and
There have been rich social contexts and human activities in the ever-growing user-contributed photos. We effectively mine the demographics (e.g., gender, age, race) of different locations and travel paths for personalized travel recommendation (ACM MM’11, IEEE TMM’13). Meanwhile, from the huge group photos, there arise strong needs for automatically understanding the group types (e.g., family vs. classmates) for recommendation services and even predicting the pairwise relationships for mining implicit social connections in the photos. We demonstrated the proposed graph-based method achieving 30.5% relative improvement over the prior low-level features (ACM MM’12). The main contributor, Yan-Ying Chen, was awarded ACM Multimedia 2012 Doctoral Symposium Best Paper Award.
Semantic understanding by mobile visual recognition is essential for effectively understanding the context and manipulating images/videos. More aggressively, we proposed a brand-new mobile-compliant visual recognition framework, detailed in IEEE TMM’14 and scalable recognition model in CVPR’15.
There are huge different technical characteristics between mobiles and servers. Meanwhile, the communication in-between is always limited. In this project, we aim for balancing the strengths between the mobiles and servers and design bandwidth-friendly leaning algorithms.

To make photos more visually appealing, users usually apply filters on their photos. However, due to the growing number of filter types, choosing a proper filter is cumbersome. To address the issue, we propose a brand-new problem — filter recommendation for photo aesthetics.