Note that the project is kindly sponsored by NVIDIA AI Lab and Ministry of Science and Technology (MOST) AI grants.
With the advance of social media (e.g., Flickr, Instagram, etc.) and capture devices, the volume of user-contributed photos has increased dramatically. So do the very fast-growing online videos with diverse purposes for lecturing, experience sharing, commenting, guides, etc. Among them, knowledge-related videos (i.e., Discovery, History, etc.), how-to videos (e.g., Howcast, wikiHow, etc.), online courses (e.g., courser, udacity, EdX, etc.) are all freely available knowledge sources.
Though seeing huge progresses in deep-learning-based image/video search and semantic understanding in recent years, such techniques can only match visually similar instances or represent the low-level signals in pre-defined labels. It is still far from human’s cognitive capabilities – to comprehend, organize, and memory the very large-scale visual contents and then infer and reason for answers and intensions.
In this multi-year project – DeepTutor, by leveraging very large-scale and diverse data streams (e.g., speech, video, image, text, geo-locations, etc.), we aim to investigate advanced deep learning algorithms and intelligent human-computer interfaces (HCI) for enabling the brand-new and proactive question and answering (QA) platform. Besides plain QA in a passive manner, DeepTuor will proactively question and answer in a self-taught and reinforcement manner and further lead the role as a tutor to guide and raise the questions for the users (students). We need to investigate scalable and in-depth learning algorithms for inference from the multi-modal and noisy data streams. In a novel and unique aspect, we will devise brand-new HCI techniques for enabling two-way interactions – QA and tutoring. DeepTutor will automatically generate questions for users to practice and then explain; for example, asking “what is the nearest planet to earth” after parsing a Discovery video. DeepTutor will also leverage the how-to videos to answer and further guide the users in the new augmented-reality (AR) environment, e.g., how to fix a flat bike tire. We will investigate deep reinforcement learning in the new tangible interfaces and present the answers and tutoring actionable steps in quality and AR manner.
QA system is the advanced technique to serve numerous high-value applications in e-commerce, customer support, education, manufacturing, healthcare, etc. Current solutions are restricted to text only. There are hugely and freely available online data in terms of images, videos, comments, geo-locations, etc., which are never exploited. Beyond that, we aim for taking the questions in images or videos. For example, a picture as a query on how to operate the new appliance and a golf swing video to inquire the QA system on how to adjust the swing posture.
Seeing the progresses in deep learning methods in text-based QA systems, it is still very challenging in exploiting these rich and multimodal data streams. Effective semantic understating is required. Furthermore, we need to memorize and inference among cross-modal and high-dimensional data streams. Such scalable, multimodal, and end-to-end deep learning QA systems are vital but still missing. During the development for this essential research problem, we shall contribute our curated benchmark data for the international research community.
The multi-year project is collaboratively with Prof. Yung-Yu Chuang, Prof. Robin Bing-Yu Chen, and Prof. Hung-Yi Lee.