The touch-based displays (devices) have entailed rich interactions between the videos and users. The objects appearing in videos usually interest users in wanting to know relative knowledge about them. In this paper, we proposed a video playback system for users to interactively query objects of interest in videos. Since the text information accompanied with videos might not be strongly related to the object of interest, we adopt visual appearances as features to retrieve similar objects from large image collections. The tags associated with the retrieved images are used to reveal related information of the object of interest for further exploiting related knowledge. Solely relying on single viewpoint of the object to query may suffer from different poses, occlusions and is not robust. So we present a novel video object segmentation approach to improve retrieval precision. The approach is based on a 3D graph cut framework. To ensure prompt response and effectiveness, we augment the algorithm with compressed-domain motion vectors; compared with the prior method, the processing speed of our approach is significantly faster. The experiments on community-contributed videos demonstrate the effectiveness of our approach based on multi-frame object region query and the improvement of retrieval precision.