Research

The prevalence of capture devices (e.g., mobile phones, egocentric devices, drones, vehicles, etc.) and the advent of media-sharing services have drastically increased the sheer amount of image and video collections. Here arise the strong needs for effective recognition, mining, forecasting, search, etc. We have been devoted to machine intelligence over large-scale multimodal data streams (e.g., images, videos, text, etc.).

Recently, we focus on numerous Deep Convolutional Neural Network methods for large-scale image/video recognition, retrieval, and mining projects, sponsored by leading industry partners (Microsoft Research, IBM Watson, NVIDIA, MediaTek, HTC, etc.). Especially, we aim for effective CNN methods for image/video applications such as image search, video event detection, face recognition, facial/clothing attributes, super-resolution, etc.

Though having observed very exciting applications requiring intensive researches in machine intelligence, we further identify certain core challenges and respond to them respectively:

Semantic gap – bridging the low-level visual features to satisfy semantic needs by proposing semantic ontology and learning semantic representations in an automatic manner;
User gap – helping users issue proper queries to satisfy their intentions in different application scenarios and mobile devices (e.g., by sketch, attribute, snapshot, speech, and touch-based);
Volume gap – learning ultra-large-scale photos and videos by distributed learning and efficient high-dimensional indexing (e.g., hash-based methods) for real-time query response over big photo/video data; balancing the technical strengths between the mobiles and cloud servers.
Privacy – conducting privacy-preserving mining for large-scale photos and videos and addressing the privacy concerns as sharing sensitive photos and videos (e.g., family albums) in the public clouds.
Industry needs – besides thorough algorithms for academic researches, we also investigate practical methods for meeting possible needs in industrial developments (e.g., technology transfer).

Research Projects

DeepTutor: From Question and Answering to Tutoring by Leveraging Large-Scale Multimedia Streams

Note that the project is kindly sponsored by NVIDIA AI Lab and Ministry of Science and Technology (MOST) AI grants.

With the advance of social media (e.g., Flickr, Instagram, etc.) and capture devices, the volume of user-contributed photos has increased dramatically. So do the very fast-growing online videos with diverse purposes for lecturing, experience sharing, commenting, guides, etc. Among them, knowledge-related videos (i.e., Discovery, History, etc.), how-to videos (e.g., Howcast, wikiHow, etc.), online courses (e.g., courser, udacity, EdX, etc.) are all freely available knowledge sources.

Though seeing huge progresses in deep-learning-based image/video search and semantic understanding in recent years, such techniques can only match visually similar instances or represent the low-level signals in pre-defined labels. It is still far from human’s cognitive capabilities – to comprehend, organize, and memory the very large-scale visual contents and then infer and reason for answers and intensions.

In this multi-year project – DeepTutor, by leveraging very large-scale and diverse data streams (e.g., speech, video, image, text, geo-locations, etc.), we aim to investigate advanced deep learning algorithms and intelligent human-computer interfaces (HCI) for enabling the brand-new and proactive question and answering (QA) platform. Besides plain QA in a passive manner, DeepTuor will proactively question and answer in a self-taught and reinforcement manner and further lead the role as a tutor to guide and raise the questions for the users (students). We need to investigate scalable and in-depth learning algorithms for inference from the multi-modal and noisy data streams. In a novel and unique aspect, we will devise brand-new HCI techniques for enabling two-way interactions – QA and tutoring. DeepTutor will automatically generate questions for users to practice and then explain; for example, asking “what is the nearest planet to earth” after parsing a Discovery video. DeepTutor will also leverage the how-to videos to answer and further guide the users in the new augmented-reality (AR) environment, e.g., how to fix a flat bike tire. We will investigate deep reinforcement learning in the new tangible interfaces and present the answers and tutoring actionable steps in quality and AR manner.

QA system is the advanced technique to serve numerous high-value applications in e-commerce, customer support, education, manufacturing, healthcare, etc. Current solutions are restricted to text only. There are hugely and freely available online data in terms of images, videos, comments, geo-locations, etc., which are never exploited. Beyond that, we aim for taking the questions in images or videos. For example, a picture as a query on how to operate the new appliance and a golf swing video to inquire the QA system on how to adjust the swing posture.

Seeing the progresses in deep learning methods in text-based QA systems, it is still very challenging in exploiting these rich and multimodal data streams. Effective semantic understating is required. Furthermore, we need to memorize and inference among cross-modal and high-dimensional data streams. Such scalable, multimodal, and end-to-end deep learning QA systems are vital but still missing. During the development for this essential research problem, we shall contribute our curated benchmark data for the international research community.

The multi-year project is collaboratively with Prof. Yung-Yu Chuang, Prof. Robin Bing-Yu Chen, and Prof. Hung-Yi Lee.
Cross-Modality Deep Convolutional Neural Networks for Medical Image Segmentation
Convolutional neural networks have shown effective in image segmentation. However, most of them merely operate with a single modality or simply stack multiple modalities as different input channels. Seeing oncologists leverage the multi-modal signals in tumor diagnosis, we propose a deep encoder-decoder structure with cross-modality convolution layers to incorporate different modalities of MRI data for tumor segmentation. In addition, we exploit convolutional LSTM (convLSTM) to model a sequence of 2D slices, and jointly learn the multi-modalities and both sequential and spatial contexts in an end-to-end manner. To avoid converging to the dominating background labels, we adopt a re-weighting scheme and two-phase training to handle the label imbalance. Experimental results on BRATS-2015, an open benchmark for tumor segmentation, evidence that our method yields the best performance so far among the deep methods. To our best knowledge, this is the first end-to-end network jointly considering multiple modalities and the contextual sequences. We believe the proposed framework can be extended to other applications with emerging multimodal signals.

Collaborating with a leading university hospital in Taichung, we are advancing to another major tumor with thousands of patient data provided and lacking quality annotation/segmentation tools.
- Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang (Ric) Huang. Joint Sequence Learning and Cross-Modality Convolution for 3D Biomedical Segmentation, IEEE Computer Vision and Pattern Recognition (CVPR), July 2017.
Feature Learning and Indexing for Large-Scale Images and Videos
We have been investigating real-time system for large-scale image retrieval since 2007 and derive numerous indexing methods (hash-based and inverted-indexing-based) for numerous conditions. We further improve the challenging issue – recall rate – by mining semantically relevant auxiliary visual features through visual and textual clusters in an unsupervised and scalable manner and yield significantly better accuracy than the traditional models. The results had been published in top venues (e.g., IEEE TMM’14, CVPR’11, ACM MM’11, ECIR’15). The PhD candidate, Yin-His Kuo, the leading expert this research domain, was also awarded Microsoft Research Asia Fellowship 2012 (10 PhD students among top Asian universities and the only one from Taiwan).

The derived solutions also enable numerous promising applications, reported in the media.
- Yin-Hsi Kuo, Wen-Huang Cheng, Hsuan-Tien Lin, Winston H. Hsu, Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement. In IEEE Transactions on Multimedia, August 2012.
- Yin-Hsi Kuo, Hsuan-Tien Lin, Wen-Huang Cheng, Yi-Hsuan Yang, Winston H. Hsu. Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval. In IEEE Computer Vision and Pattern Recognition (CVPR), June 2011
- Yin-Hsi Kuo, Wen-Yu Lee, Winston H. Hsu, Wen-Huang Cheng. Augmenting Mobile City-View Image Retrieval with Context-Rich User-Contributed Photos. In ACM Multimedia, November 2011
- Yin-Hsi Kuo, Kuan-Ting Chen, Chien-Hsing Chiang, Winston H. Hsu. Query Expansion for Hash-based Image Object Retrieval. In ACM Multimedia, October 2009.
- Chiang-Yu Tsai, Yin-Hsi Kuo and Winston H. Hsu. Approximating Weighted Hamming Distance by Probabilistic Selection for Multiple Hash Tables. In European Conference on Information Retrieval (ECIR), March 2015.
- A technical demonstration of large-scale image object retrieval by efficient query evaluation and effective auxiliary visual feature discovery, ACM Multimedia, October 2010.
- 行動拍拍樂不到一秒幫你迅速比價 (video), TVBS, September 1, 2009.
- Me-link: link me to the media — fusing audio and visual cues for robust and efficient mobile media interaction, WWW, April 2014.
Learning from Large-Scale and Multimodal Data Streams (Logs)
User logs are another emerging and important data assets. Collaborating with MSRA, we mine the trending of image search logs and further provide personalization and identify those quality representative images (SIGIR’14). We are the first to visually interpret a name to demographic attributes (for benefiting user profiling, recommendation, ad, etc.) by using huge click-throughs of image search logs (AAAI’15). We also crowdsource such freely available photos as auxiliary training data for image/video classification (IEEE TMM’13).

We further combine billion-scale open data and multiple million-scale social media and, in a novel and systematic manner, attack the “New York 360 Degree” Challenge, hosted by IBM TJ Watson Research Center and won ACM Multimedia 2014 Grand Challenge Multimodal Award.

We attend the first image search challenge hosted by Microsoft Research and Bing and derived the automatic learning methods to judge the relevance of the image and text pair from the huge Bing image search logs. With the great efforts, we won the FIRST PLACE in MSR-Bing Image Retrieval Challenge 2013, with the award presented by Microsoft CTO, Dr. Harry Shum, in Bellevue, WA.
- Yan-Ying Chen, Yin-Hsi Kuo, Chun-Che Wu, and Winston H. Hsu. Visually Interpreting Names as Demographic Attributes by Exploiting Click-Through Data. In AAAI Conference on Artificial Intelligence, January 2015
- Chun-Che Wu, Tao Mei, Winston H. Hsu, Yong Rui. Learning to Personalize Trending Image Search Suggestion. ACM SIGIR, July 2014
- Yin-Hsi Kuo, Yan-Ying Chen, Bor-Chun Chen, Wen-Yu Lee, Chun-Che Wu, Chia-Hung Lin, Yu-Lin Hou, Wen-Feng Cheng, Yi-Chih Tsai, Chung-Yen Hung, Liang-Chi Hsieh, Winston H. Hsu Discovering the City by Mining Diverse and Multimodal Data Streams. ACM Multimedia, November 2014. [ACM Multimedia 2014 Grand Challenge Multimodal Award]
- Chun-Che Wu, Kuan-Yu Chu, Yin-Hsi Kuo, Yan-Ying Chen, Wen-Yu Lee, Winston H. Hsu. Search-Based Relevance Association with Auxiliary Contextual Cues. ACM Multimedia, October 2013. [ACM Multimedia 2013 Grand Challenge Multimodal Award] [FIRST PRIZE in MSR-Bing Image Retrieval Challenge]
- Yan-Ying Chen, Winston H. Hsu, Hong-Yuan Mark Liao, Automatic Training Image Acquisition and Effective Feature Selection from Community-Contributed Photos for Facial Attribute Detection. IEEE Transactions on Multimedia, October 2013.
Wi-Fi Signals for (Privacy-Preserving) Action and Human Recognition
Recently, Wi-Fi has caught tremendous attention for its ubiquity, and, motivated by Wi-Fi’s low cost and privacy preservation, researchers have been putting lots of investigation into its potential on action recognition and even person identification. In this paper, we offer an comprehensive overview on these two topics in Wi-Fi. Also, through looking at these two topics from an unprecedented perspective, we could achieve generality instead of designing specific ad-hoc features for each scenario.

Observing the great resemblance of Channel State Information (CSI, a fine-grained information captured from the received Wi-Fi signal) to texture, we proposed a brand-new framework based on computer vision methods. To minimize the effect of location dependency embedded in CSI, we propose a novel de-noising method based on Singular Value Decomposition (SVD) to eliminate the background energy and effectively extract the channel information of signals reflected by human bodies.

From the experiments conducted, we demonstrate the feasibility and efficacy of the proposed methods. Also, we conclude factors that would affect the performance and highlight a few promising issues that require further deliberation.
- Jen-Yin Chang, Kuan-Ying Lee, Kate Ching-Ju Lin, Winston Hsu. WiFi action recognition via vision-based methods. ICASSP 2016.
- Jen-Yin Chang, Kuan-Ying Lee, Yu-Lin Wei, Kate Ching-Ju Lin, Winston Hsu. Location-Independent WiFi Action Recognition via Vision-based Methods. ACM Multimedia 2016.
- Jen-Yin Chang, Kuan-Ying Lee, Yu-Lin Wei, Kate Ching-Ju Lin, Winston Hsu. We Can “See” You via Wi-Fi – An Overview and Beyond. aXiv 2016. (CoRR abs/1608.05461)
Drone-Based Visual Recognition and Retrieval

We had witnessed the strength in deep convolution neural networks in different applications ranging from recognition, segmentation, object detection, tracking, etc. In light of the emergence of drones, we aim for effective deep neural networks targeting at drone-based images and videos. We observe the challenges and the huge differences from the images/videos of regular cameras. We argue to leverage the rich sensors from the drones and derive cross-modality convolutional neural networks. We also investigate strategies to further leverage the rich information from the location-based services for benefiting cross-domain training. We will provide the datasets for numerous novel applications once the papers are accepted.
Search by Impression and Sketch
The photos with people are confirmed the most memorable for the users. We argue a novel paradigm called, search by impression, which helps users compose his intension (a photo layout of persons) for the target images to search. We leverage the contextual cues (in terms of human attributes and face locations) to index and search the photos. The method receives several recognitions including a full paper in SIGIR’12 and FIRST PRIZE in ACM Multimedia Grand Challenge 2011.Meanwhile, the advent of touch panels has provided a good platform for mobile sketch search. However, most of the prior works are formidable due to limited memory. We proposed sparse hashing techniques to compress high-dimensional sketch features for first-ever mobile-based sketch search in ACM MM’12.

To improve the sketch performance, we further propose a brand-new 3D sub-query expansion approach for boosting sketch-based multi-view image retrieval by automatically converting two 2D sketches as an approximated 3D sketch model for multi-view sketches. Our approach shows superior performance on a public multi-view image dataset (ICCV’13).
- Yu-Heng Lei, Yan-Ying Chen, Bor-Chun Chen, Lime Iida, Winston H. Hsu, Where Is Who: Large-Scale Photo Retrieval by Facial Attributes and Canvas Layout. ACM SIGIR 2012. [Demo Video]
- Yen-Liang Lin, Cheng-Yu Huang, Hao-Jeng Wang, Winston H. Hsu. 3D Sub-Query Expansion for Improving Sketch-based Multi-View Image Retrieval. International Conference on Computer Vision (ICCV), 2013
- Yu-Heng Lei, Yan-Ying Chen, Lime Iida, Bor-Chun Chen, Hsiao-Hang Su, Winston H. Hsu. Photo Search by Face Positions and Facial Attributes on Touch Devices. In ACM Multimedia, November 2011. [FIRST Prize for ACM Multimedia Grand Challenge 2011] [Presentation Video]
- Kai-Yu Tseng, Yen-Liang Lin, Yu-Hsiu Chen, Winston H. Hsu. Sketch-based Image Retrieval on Mobile Devices Using Compact Hash Bits. In ACM Multimedia, October 2012. [Demo Video]
- Ching-Hsuan Liu, Yen-Liang Lin, Wen-Feng Cheng, Winston H. Hsu: Exploiting Word and Visual Word Co-occurrence for Sketch-based Clipart Image Retrieval. ACM Multimedia 2015.
Mining Large-Scale (Geo-Referenced) Social Media
There have been rich social contexts and human activities in the ever-growing user-contributed photos. We effectively mine the demographics (e.g., gender, age, race) of different locations and travel paths for personalized travel recommendation (ACM MM’11, IEEE TMM’13). Meanwhile, from the huge group photos, there arise strong needs for automatically understanding the group types (e.g., family vs. classmates) for recommendation services and even predicting the pairwise relationships for mining implicit social connections in the photos. We demonstrated the proposed graph-based method achieving 30.5% relative improvement over the prior low-level features (ACM MM’12). The main contributor, Yan-Ying Chen, was awarded ACM Multimedia 2012 Doctoral Symposium Best Paper Award.

We further combine billion-scale open data and multiple million-scale social media and, in a novel and systematic manner, attack the “New York 360 Degree” Challenge, hosted by IBM TJ Watson Research Center and won ACM Multimedia 2014 Grand Challenge Multimodal Award.
- Yan-Ying Chen, An-Jung Cheng, Winston H. Hsu. Travel Recommendation by Mining People Attributes and Travel Group Types from Community-Contributed Photos. In IEEE Transactions on Multimedia, October 2013.
- An Jung Cheng, Yan-Ying Chen, Yen-TA Huang, Winston H. Hsu. Hong-Yuan Mark Liao. Personalized Travel Recommendation by Mining People Attributes from Community-Contributed Photos. In ACM Multimedia, November 2011.
- Yan-Ying Chen, Winston H. Hsu, Hong-Yuan Mark Liao. Discovering Informative Social Subgraphs and Predicting Pairwise Relationships from Group Photos. In ACM Multimedia, October 2012.
- Yin-Hsi Kuo, Yan-Ying Chen, Bor-Chun Chen, Wen-Yu Lee, Chun-Che Wu, Chia-Hung Lin, Yu-Lin Hou, Wen-Feng Cheng, Yi-Chih Tsai, Chung-Yen Hung, Liang-Chi Hsieh, Winston Hsu, Discovering the City by Mining Diverse and Multimodal Data Streams. ACM Multimedia 2014. [ACM Multimedia 2014 Grand Challenge Multimodal Award]
- Yan-Ying Chen, ACM Multimedia 2012 Doctoral Symposium Best Paper Award
Scalable (Mobile) Visual Recognition
Semantic understanding by mobile visual recognition is essential for effectively understanding the context and manipulating images/videos. More aggressively, we proposed a brand-new mobile-compliant visual recognition framework, detailed in IEEE TMM’14 and scalable recognition model in CVPR’15.

In the trend of recognizing large-scale image/video ontology, we also propose the “recognizer compression” method to reduce the computation complexity.

With that, we also enabled the first (low bitrate) iOS flower recognition app as a social service, collaborating with environmental protection NGO and a leading mobile company.
- Ting-Hsuan Chao, Yen-liang Lin, Yin-Hsi Kuo, Winston H. Hsu. Scalable Object Detection by Filter Compression with Regularized Sparse Coding. In IEEE Computer Vision and Pattern Recognition (CVPR), June 2015.
- Yu-Chuan Su, Tzu-Hsuan Chiu, Yin-Hsi Kuo, Chun-Yen Yeh, Winston H. Hsu. Scalable Mobile Visual Classification by Kernel Preserving Projection Over High-Dimensional Features. In IEEE Transactions on Multimedia, October 2014.
- Chia-Hung Lin, Yan-Ying Chen, Bor-Chun Chen, Yu-Lin Hou, Winston H. Hsu. Facial Attribute Space Compression by Latent Human Topic Discovery. In ACM Multimedia, November 2014.
- 電信業者推無障礙app　估4百萬人受惠 (花名錄 App, enabled by our mobile visual search)
- 電信ｉ無限!跨界公益推APP 「花名錄」APP 東森新聞, April 9, 2013.
Cross-Age Face Recognition and Retrieval
Recently, promising results have been shown on face recognition researches. However, face recognition and retrieval across age is still challenging. Unlike prior methods using complex models with strong parametric assumptions to model the aging process, we use a data-driven method to address this problem. We propose a novel coding framework called Cross-Age Reference Coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC is able to encode the low-level feature of a face image with an age-invariant reference space. In the testing phase, the proposed method only requires a linear projection to encode the feature and therefore it is highly scalable. To thoroughly evaluate our work, we introduce a new large-scale dataset for face recognition and retrieval across age called Cross-Age Celebrity Dataset (CACD). The dataset contains more than 160,000 images of 2,000 celebrities with age ranging from 16 to 62. To the best of our knowledge, it is by far the largest publicly available cross-age face dataset. Experimental results show that the proposed method can achieve state-of-the-art performance on both our dataset as well as the other widely used dataset for face recognition across age, MORPH dataset.
- Cross-Age Celebrity Dataset (CACD)
- Bor-Chun Chen, Chu-Song Chen, Winston H. Hsu. Cross-Age Reference Coding for Age-Invariant Face Recognition and Retrieval, ECCV 2014.
- Bor-Chun Chen, Chu-Song Chen, Winston H. Hsu. Face Recognition using Cross-Age Reference Coding with Cross-Age Celebrity Dataset, IEEE Transactions on Multimedia, 2015
Scalable Facial Image Retrieval and Recognition
We aim efficient and effective methods for face retrieval and verification over very large-scale consumer photos.
- Hui-Lan Hsieh, Winston Hsu, Yan-Ying Chen. Multi-task learning for face identification and attribute estimation. International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2017
- Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Thanh Duc Ngo, Duy-Dinh Le, Shin’ichi Satoh, Winston H. Hsu. Scalable Face Track Retrieval in Video Archives using Bag-of-Faces Sparse Representation. In IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2016.
- Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Winston H. Hsu. Scalable Face Image Retrieval Using Attribute-Enhanced Sparse Codewords. In IEEE Transactions on Multimedia, August 2013.
- Bor-Chun Chen, Yin-Hsi Kuo, Yan-Ying Chen, Kuan-Yu Chu, Winston H. Hsu. Semi-Supervised Face Image Retrieval Using Sparse Coding with Identity Constraint. In ACM Multimedia, November 2011.
End-to-End Learning with the Consideration of the Mobile, the Server, and the (Limited) Bandwidth
There are huge different technical characteristics between mobiles and servers. Meanwhile, the communication in-between is always limited. In this project, we aim for balancing the strengths between the mobiles and servers and design bandwidth-friendly leaning algorithms.

We first argue the limitation in client-server on network bandwidth, i.e., the recognition bitrate, the amount of data transmission under the same recognition performance. We propose the brand-new paradigm in “compression for recognition” and exploit various strategies to enable low bitrate mobile visual recognition (ACM MM’13).

Due to the prevalence of mobile devices, mobile search becomes a more convenient way than desktop search. Different from the traditional desktop search, mobile visual search needs more consideration for the limited resources o n mobile devices (e.g., bandwidth, computing power, and memory consumption). Based on the hashed binary codes, we propose a de-hashing process that reconstructs effective feature representations in the server by leveraging the computing power of remote servers. Experiment results show that the proposed method can achieve competitive retrieval accuracy as while only transmitting few bits from mobile devices. The work is published in IEEE Trans. on Multimedia 2017.
- Yu-Chuan Su, Tzu-Hsuan Chiu, Yan-Ying Chen, Chun-Yen Yeh, Winston H. Hsu. Enabling Low Bitrate Mobile Visual Recognition – A Performance versus Bandwidth Evaluation. In ACM Multimedia, October 2013.
- Yin-Hsi Kuo and Winston H. Hsu. Dehashing: Server-Side Context-Aware Feature Reconstruction for Mobile Visual Search. IEEE Trans. on Multimedia 2017.
Fashion Attribute Mining and Image Commenting
.
Fashion is a perpetual topic in human social life, and the mass has the penchant to emulate what large city residents and celebrities wear. Discovering fashion trends in the city could boost many applications such as clothing recommendation and advertising.

To meet the goals, we design a novel system that consists of three major components: (1) constructing a large dataset from the New York Fashion Shows and New York street chic in order to understand the likely clothing fashion trends in New York, (2) utilizing a learning-based approach to discover fashion attributes as the representative characteristics of fashion trends, and (3) comparing the analysis results from the New York Fashion Shows and street-chic images to verify whether the fashion shows have actual influence on the people in New York City.

Through the preliminary experiments over a large clothing dataset, we demonstrate the effectiveness of our proposed system, and obtain useful insights on fashion trends and fashion influence. The work is also hugely reported by the media.

Meanwhile, imitating an agent for working with the consumer in the fashion category, we also derive neural-networks-based methods to automatically generate characteristic comments to a user-contributed fashion photo and evaluate the subjective factors for user interactions and proper evaluation metrics.
- Kuan-Ting Chen, Kezhen Chen, Peizhong Cong, Winston Hsu, Jiebo Luo. Who are the Devils Wearing Prada in New York City? ACM Multimedia 2015.
- Fashion show styles really do translate into everyday trends, New York Post, August 28, 2015.
- How Machine Vision Is About to Change the Fashion World, MIT Technology Review, September 4, 2016.
Aesthetic Learning for Consumer Photos
To make photos more visually appealing, users usually apply filters on their photos. However, due to the growing number of filter types, choosing a proper filter is cumbersome. To address the issue, we propose a brand-new problem — filter recommendation for photo aesthetics.

To the best of our knowledge, there is no public dataset for aesthetic judgment with filtered images. We create a new dataset called Filter Aesthetic Comparison Dataset (FACD). It contains 28,160 filtered images based on the AVA dataset and 42,240 reliable image pairs with aesthetic annotations using Amazon Mechanical Turk. It is the first dataset containing filtered images and user preference labels. We also proposed neural networks based models augmented with multi-task and aesthetic-ranking-aware learning.

We also investigate different methods to compose atheistic photos during taking a photo and other factors to measure photo “interestedness”.
- Wei-Tse Sun , Ting-Hsuan Chao, Yin-Hsi Kuo, Winston Hsu. Photo Filter Recommendation by Category-Aware Aesthetic Learning. IEEE Trans. on Multimedia, 2017.
- Filter Aesthetic Comparison Dataset (FACD) dataset
- Hsiao-Hang Su, Tse-Wei Chen, Chieh-Chi Kao, Winston H. Hsu, Shao-Yi Chien. Preference-Aware View Recommendation System for Scenic Photos Based on Bag-of Aesthetics-Preserving Features. In IEEE Transactions on Multimedia, June 2012.
- Hsiao-Hang Su, Tse-Wei Chen, Chieh-Chi Kao, Winston H. Hsu, Shao-Yi Chien. Scenic Photo Quality Assessment with Bag of Aesthetics-Preserving Features. In ACM Multimedia, November 2011.
- Liang-Chi Hsieh, Winston H. Hsu, Hao-Chuan Wang. Investigating and Predicting Social and Visual Image Interestingness on Social Media by Crowdsourcing. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2014.