Volume 5 Issue 3
Winter 2009
ISSN 1937-7266

Multi-modal Surrogates for Retrieving and Making Sense of Videos:
Is Synchronization Always the Best and Necessary?

Yaxiao Song

School of Information Library Science
University of North Carolina at Chapel Hill
yaxiaos@email. unc.edu

ABSTRACT

Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sensemaking. In the past decades, most research and development efforts on video surrogates have focused on visual features of videos, and comparatively little work has been done on audio surrogates or examining their effectiveness in aiding users’ retrieval and sense making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio (i.e., hearing) and visual (i.e., seeing) modalities. One goal of my study is to address the potential values of audio surrogates, either alone or combined with some visual surrogates, for retrieving and making sense of videos. Another goal of my study is to examine the effectiveness of a number of multi-modal surrogates and investigate whether the synchronization between the audio and visual channels is always necessary and the best.

1. Introduction

We are living in a digital and information world. Most videos today are produced in digital forms. Computing technologies have promoted the creation, availability and distributions of massive amount of digital videos. As digital video cameras and webcams become common household appliances, making videos has become both easier and less expensive. As a result, large collections of digital videos (e.g., YouTube, Internet Archive, Open Video, etc.) are increasingly available for people to download and use on various devices ranging from desktop computers to small devices, such as cell phones and PDAs. The tremendous volume of digital videos, in turn, requires effective and efficient access to those videos. While digital video is becoming increasingly ubiquitous, the usability of web-based video retrieval is often quite poor. One problem is that searchers need better summaries, excerpts, or other highly condensed representations of the videos, so as to make judgments about whether to download and view the full videos, as well as to locate the most relevant minutes from thousands of hours of video content. We refer to these human-consumable summaries as surrogates.

Surrogates, as condensed information representing the primary information objects, are crucial for browsing and retrieving information in large digital data collections. Video surrogates, in particular, aim to help people quickly make sense of the content of a video before downloading or seeking further detailed information about the video. It has been proven by a number of usability studies (Ding et al., 1999; Yang et al., 2003; Wildemuth et al., 2002, 2003) and some real-life video retrieval and search applications (e.g., YouTube, Internet Archive, Open Video, etc.) that people can quickly make sense of videos by viewing the abbreviated video surrogates.

Different surrogates have a variety of different advantages and disadvantages, and the unique advantages of different surrogates can be selectively applied in video retrieval systems. The bulk of surrogates in today’s video retrieval systems are still text-based, although visual surrogates which represent visual cues of the videos, such as poster frames, storyboards, and fast forwards, have become available on some digital video repositories (e.g., Internet Archive, Open Video). Many years of TREC Video results in the past demonstrate that linguistic data generally lead to better performance in video retrieval than the visual features (Smeaton et al., 2004). In 2005, some groups showed better performance with visual features than linguistic features, but under very difficult linguistic conditions, where multiple language translation was automated (Over et al., 2005). Nevertheless, some studies report that people like to have visual surrogates regardless of their performance effects. For example, He et al. (2000) compared four possible ways of summarizing presentations, presentation slides, text transcript, transcript with highlighted points, and a manually created audio-video summary. They reported that users prefer audiovisual summaries to both text transcripts and presentation slides, though the four summarization methods offer comparable results.

Keep in mind that videos are indeed multi-modal presentations: they are a combination of a series of moving pictures playing at a constant speed (i.e., 25 to 30 frames per second) along with synchronized audio track. The bulk of surrogates in today’s video retrieval systems, however, are single-modal: they are either text-based surrogates or visual surrogates, whereas the notion of audio surrogates has not been well discussed or examined by researchers. Not surprisingly, few systems in practice utilize audio surrogates for aiding video browsing and retrieval. However, audio surrogates for digital videos, either stand-alone or combined with visual surrogates, can be very powerful and promising, because they engage the humans’ natural ability to hear and require no training for sensemaking. If reasonably designed, audio surrogates can very well assist people in understanding the content of the videos. Not only can audio surrogates be downloaded much faster than videos, but they also require fewer system resources and minimal screen real estate to be useful, and thus have the potential to be successful surrogate alternatives for limited displays on small devices (e.g., PDAs, cell phones, etc.). Moreover, audio surrogates can be used by people in situations where they can not use vision to work on videos, for example, when they are driving or walking on the street. Most importantly, audio surrogates can be even more powerful if successfully combined with visual surrogates to leverage multiple sensory channels without significantly increasing people’s cognitive loads. Those video surrogates that leverage multiple sensory channels are multi-modal surrogates. For instance, slide shows can be augmented by audio narrations to leverage both the hearing and seeing modalities. Movie trailers are another example of multi-modal surrogates for videos; however, they are usually extremely well-made by professionals and are very expensive. To make surrogates available for large video collections, we need inexpensive surrogates which should be created automatically.

My study will address the potential values of audio surrogates, either alone or combined with some visual surrogates, for retrieving and making sense of videos. In particular, spoken audio, either extracted from the video or spoken by speech synthesizer, will be of interest. Beside examining the audio alone surrogates, another goal of my study is to investigate the effectiveness of some multi-modal surrogates, for which humans' audio and visual sensory channels are both leveraged when consuming. The obvious gain of using multiple sensory channels is the increased usability: weaknesses of one modality can be offset by the strengths of another. More importantly, the synchronization necessity between the audio and visual channels of the surrogates will be studied. With surrogates consisting of audio and visual abstracts from the video, is it always necessary to carefully synchronize the audio and visual channels?

According to Paivio’s Dual Coding Theory (Paivio, 1986), verbal and visual information are simultaneously processed in two separate subsystems of human cognition. Paivio (1972) suggested that sensory modalities, for example, visual modality and auditory modality, may have additive effects on recall as well. The additive effects of multiple sensory modalities were confirmed in (Thompson and Paivio, 1994): hearing the sound and seeing the pictures of audiovisual objects (e.g., phone) yields an additive effect on object recall as compared to only having single modality stimulus. Moreover, the additive effects of multiple modalities were higher than simple repetitions of the same modality.

The additive effects of dual coding were achieved in other sensory modalities as well. Lyman and McDaniel (1990) found that seeing pictures (i.e., visual) and smelling odors (i.e., olfaction) at the same time led to higher recognition memory and free verbal recall of odors, while having auditory, visual, and olfaction altogether could further increase the recall.

Over the years, there have been a lot of criticisms of the Dual Coding Theory (DCT). Some criticisms focus on the inconsistent experiment findings, for example, pictures, which are easier to be dually coded than words, are not always recalled better than words, especially with young children (Dilley and Paivio, 1968). Paivio (2006) suggested this might be explained by the labeling difficulties by children. Experiments done by Cole et al. (1971) supported Paivio’s explanation about the inconsistency. Cole et al. explicitly asked the children in Grade 1 through 8 to name the pictures, and found that the children indeed recall images better than words, due to the additive effects of verbal and nonverbal dual coding. Also, as summarized in (Paivio, 2006), when people age, the superiority of images over words will vanish, such that for old people, the advantages of dual coding for images over words may not be found. For a detailed summary of the critiques and rejoinders of DCT, please refer to (Paivio, 2006, chapter 4, p. 8286).

As Paivio wrote in (Paivio, 2006, p.58):

“The verbal and nonverbal systems, although functionally independent, must coordinate their activities to achieve common goals... Independence means that the systems can be active separately or together. Cooperation is possible because each system can activate the other via their interconnections.”

Though it has already been accepted by a lot of researchers that for primary information objects, coordinated media channels lead to better understanding, retention, and satisfaction, yet there are no safe and sound conclusions on whether this coordination requirement for primary information objects also applies to highly abbreviated, condensed information objects, such as video surrogates. In fact, my two previous studies (Song and Marchionini, 2007; Marchionini et al., 2009) are among the first studies examining multi-modal surrogates and the synchronization necessity, and I will discuss them in more details in the next section.

In my study, in addition to the synchronization issues between the audio and visual channels, the strategies of sampling the most salient abstracts from the two channels will also be of interest. I will distinguish the pre-processed integration for video surrogates, where the audio and visual channels are sampled simultaneously and thus are precoordinated at indexing time, from the user-centered integration, where the audio and visual channels are sampled independently (i.e., are uncoordinated) and need to be integrated in user’s head at consumption time. Even though pre-coordination at indexing time may be desired from users’ affective point of view because we are more used to synchronized presentations, my hypothesis is that because the most salient abstracts from the audio channel do not necessarily align temporally with the most salient abstracts from the visual channel, thus more information may be carried in the abstracts than if the audio and visual of the abstracts are temporally coordinated; therefore, independently sampling the most salient samples across different channels and letting users to integrate the uncoordinated channels at consumption time may lead to more sense making potential.

With these research questions in mind, a series of usability studies have been designed to examine the efficacy of audio alone surrogates and the effectiveness of multi-modal surrogates utilizing both audio and visual channels. The benefits of synchronized surrogate channels as well as the benefits of independent sampling across channels will be investigated carefully. The next section will discuss two of my preliminary research studies and some study results achieved so far (Song and Marchionini, 2007; Marchionini et al., 2009). Section 3 will discuss the next usability study I propose to conduct to follow up the first two studies, so as to answer the questions about synchronization and sampling strategies. Section 4 will conclude the paper and summarize the contributions of my studies.

2 Previous Work

I conducted two usability studies in 2006 and 2007 respectively to examine the effectiveness of audio alone and multi-modal surrogates for making sense of videos.

2.1 The First Study

Our 2006 user study (Song and Marchionini, 2007) investigated the effectiveness of three different surrogates for making sense of digital videos in digital video libraries. One visual only, one audio only, and one audio and visual combined surrogate condition were examined (see Table 1).

Table 1: Three surrogates conditions for the 1st study

The following five tasks were used to investigate the effectiveness of the surrogates:

  • Task1: Written Gist Determination Task.
  • Task2: Keyword Recognition Task.
  • Task3: Title Selection Task.
  • Task4: Keyframe Recognition Task.
  • Task5: Verbal Gist Recognition Task.

The order of the tasks is important. For example, the open-ended gist writing task should be completed first so that participants will not gain extra information from the other tasks (especially the gist selection task which includes actual descriptions, which should definitely go after all other tasks). See Table 9 in Appendix for detailed explanation of all tasks used in this series of user studies.

To evaluate the effectiveness the three surrogate conditions, measures include performance, confidence, time to consume the surrogates, time to complete the tasks, and a suite of affective measures. Qualitative comments were also used to enrich the interpretation of results.

As shown in Table 2, the results demonstrate that combined surrogates are more effective and strongly preferred than both of the individual surrogates, and do not penalize efficiency. Nevertheless, it is found that spoken descriptions alone are almost as good as the combined surrogate and are much better than visual storyboards alone for video gisting.

Table 2: Result Summary for the 1st study.
(* For Surrogate Consumption Time and Task Completion Time, the short the time is, the more efficient the surrogate is.)

Note that for surrogate consumption time, there were very small time differences between the audio only and the combined condition. Examination of the log data suggested that participants replayed the audio portion a lot because they found the audio “hard to understand”, and it took the participants longer to consumer the combined condition than the visual only condition not because they had two channels of information to integrate, but merely because the audio part were played for a longer time. The small time differences between the audio only and the combined conditions suggest that people are able to integrate two distinct sets of surrogates that use different sensory channels even though they are not temporally coordinated at all. Thus, the expectation that the temporal coordination between visual and audio channels desired for primary information objects is also required for visual and audio surrogates was not borne out.

The open-ended comments from the participants reinforced the results from the quantitative data. 31 out of the 36 participants selected the combined condition as their favorite surrogate among the three conditions. As noted by one participant, “with the two together, the surrogate is more efficient, and understanding the surrogates becomes simpler than when they are apart.” The comments also reconfirmed the power of words in carrying the semantic information in the videos. One participant noted: “Even though a picture is worth a thousand words, a few selected pictures cannot explain the deeper meaning of the subject–-audio connected the dots." Furthermore, the participants also commented on the value the visual surrogates added. They stated “the storyboard was fun and engaging” and “images... helping me be more focused”.

This study have important implications for the design of video retrieval and video library user interfaces. The results recommend incorporating multi-modal surrogates into video retrieval user interfaces, and suggest that audio alone surrogates may have great values for video retrieval especially in small display interfaces. However, it is worth noting that 1/3 of the participants complained that the audio was hard to understand, and 1/3 of the participants complained that the keyframes in the storyboards were too small to see. Therefore, another design implication from the study is that the audio and visual quality of the surrogates are important. The audio should be clearly articulated and the visual should be easily viewable.

2.2 The Second Study

In a follow-up study (Marchionini et al., 2009) done in 2007, we compared several different types of non-textual surrogates (i.e., fast forwards, spoken descriptions, and spoken keywords) alone and in combination (see Table 3).

Table 3: Five surrogates conditions for the 2nd study

Note that in the combined conditions, the audio and the visual start playing at the same time but may end at different time points. When being replayed, the fastforward and the spoken descriptions both start playing from the beginning.

This study investigated the effectiveness of manually and automatically-generated spoken descriptions and keywords on six video gisting and recognition tasks. The automatically-generated descriptions and keywords were created using the text analysis tool that is part of a system called IBM MAGIC system (Li et al., 2005).

Note that Task 3 (Title selection task) in the 1st study was eliminated from the 2nd study because no performance differences were found among all three surrogate conditions using this task. Two new tasks (displayed in Bold Italics below) were designed and added to the remaining 4 tasks in the 1st study, in order to investigate how people articulate gist on different multimedia evidence. Details of the two new tasks are available in Table 9 in Appendix.

  • Task1:Gist written task
  • Task2: Keyword recognition task
  • Task3: Keyframe recognition task
  • Task4: Verbal gist recognition task
  • Task5: Audio Excerpt Selection
  • Task6: Verbal gist recognition task

We had four specific research questions related to spoken surrogates for video retrieval purposes:

  1. Do automatically generated descriptions and keywords surrogates approach the effectiveness of manually generated ones?
  2. How do spoken descriptions and spoken keywords compare for gisting tasks?
  3. How do fast forward surrogates compare with the spoken surrogates?
  4. What are the effects of combining fast forwards and spoken surrogates?

For question 4, the 1st user study (Song and Marchionini, 2007) suggested that combining two surrogate media channels must be done carefully. “On one hand, there is a possibility of interference and on the other there is the possibility of leveraging the independent perceptual channels to gather more gisting evidence in the same amount of time. We expected that the possibility of higher cognitive load would rule, and the participants would not like the combined condition, though task accuracy might be higher using the combined surrogates than using the individual surrogates” (Marchionini et al., 2009).

Table 4 summarizes some of the experimental results from the 2nd user study (for more details of the results, please refer to Marchionini et al., 2009). The results demonstrate that manually generated spoken descriptions are significantly reliably better than manually generated spoken keywords and fast forwards for video gisting; whereas when automatically generated, spoken descriptions and keywords either alone or combined with fast forwards are inferior to fast forwards alone. Furthermore, when the spoken descriptions were manually generated, they lead to almost as good gisting as combining the spoken descriptions with the fast forwards. Participants also commented that the combined surrogates were effective but sometimes annoying, due to the fact that the audio and visual were not synchronized. The study recommends incorporating the spoken description surrogates of good summarizing quality into the video retrieval systems.

Note that for both of the two user studies, the audio and visual channels of the combined surrogates are not temporally coordinated, whereas participants had mixed feelings about the unsynchronized combined surrogates. For the first study, because the storyboard is a static representation of the video, augmenting it with a constantly changing audio surrogate like the spoken narration of the description will not make the combination of the two very distracting for the users. In other words, the audio and visual are not terribly out-of-sync. Therefore,

Table 4: Result Summary for the 2nd study.
(*- For Task Completion Time, the short the time is, the more efficient the surrogate is.)

storyboard with spoken description can be a good example of multi-modal surrogates whose audio and visual channels are not synchronized. For the second study, however, because the fast forward is also rapidly constantly changing, playing it together with the spoken descriptions or spoken keywords makes the two pieces completely out-of-sync, so the combined surrogates become increasingly distracting for the users. As reported by the majority of the participants, they could not successfully focus on both channels simultaneously. Often, they had to close their eyes when listening to the spoken description, and had to take off their headphones when viewing the fast forward. Thus, fast forward with spoken descriptions may be a outsource example of unsynchronized multi-modal surrogates for videos.

These two studies both confirm the effectiveness of multi-modal surrogates for video retrieval and sense-making, but seem to have distinct conclusions on the necessity of the synchronized surrogate channels. Therefore, a third study has been designed to follow up these two studies to investigate the synchronization issues more carefully. The third study is discussed in the next section.

3 Method

With the synchronization question in mind, I propose to conduct a usability study to evaluate several multi-modal surrogates for digital videos in relation to their usefulness and usability in accomplishing certain gisting and retrieval tasks.

In this study, selected audio surrogates will be combined with selected visual surrogates, either carefully coordinated or not carefully coordinated. When the audio and visual channels are not carefully coordinated, the most salient samples can be taken from the two channels separately, and the most salient samples across different channels often do not occur at the same temporal points. Although users are required to integrate the two channels in head at consumption time, which leads to increased cognitive load, yet perhaps more sense making is possible with this user-centered integration, rather than pre coordination at indexing time.

3.1 Surrogates

Table 5 lists the methods of exacting excerpts from the audio and visual channels to be investigated in the next study. The audio extracts and the visual extracts will be combined together to create multi-modal surrogates. Therefore, all of the surrogates investigated in this study will be multi-modal surrogates with both audio and visual channels.

Table 5: Audio and Visual channels to be used in the next study.

Table 6 lists the multi-modal surrogate conditions to be examined in this study, where A denotes audio and V denotes visual. In particular, we will examine two perfectly temporally coordinated surrogate conditions and two uncoordinated surrogate conditions.

Note that we are eliminating two uncoordinated conditions. The condition [Sub sampling A + Storyboard (V)], is eliminated because we can already predict for certain that it is going to be outperformed by [Magic A + Storyboard (V)] according to results found in (Marchionini et al., 2009). The condition [Sub sampling A + Magic V] does not make a lot of sense because the Magic summaries are extracted using the textual information, i.e., the transcript, not the visual attributes.

The two uncoordinated multi-modal surrogates are examples where the most salient samples are independently extracted from the audio channel and visual channel. Ideally, because the most salient information carried in the audio channel may be different from the most salient information carried in the visual channel, combining the two channels together may provide nearly doubled information about the videos. However, given the limitation of the effectiveness of storyboard and the lack of intelligence of the systematic sub sampling scheme on the visual

Table 6: multi-modal Surrogate Conditions in the next study.

channel, the information carried in the visual channels of the uncoordinated surrogates may be insufficient comparing to the visual information of surrogate condition [Magic A+V] where at least some human intelligence is employed for extracting the text descriptions. To make sure that we do not arrive at wrong conclusions on synchronization and independent sampling due to the uneven comparison, in addition to the above 4 surrogates conditions, we will also include one gold standard condition of perfectly coordinated multi-modal surrogates and one gold standard condition of uncoordinated multi-modal surrogates (see Table 7).

Table 7: Gold Standard Surrogate Conditions.

Considering the space limit of this paper, I will leave out the details on the study of getting human participants to manually create the gold standard surrogates. Interested readers can refer to my dissertation publication later on.

3.2 Participants

Forty-eight participants will be recruited for the study by posting mass emails to our university-wide LISTSERV. Participants will be adult native English speakers with self-assessed adequate listening and visual abilities, who use computers daily and have experience with searching for videos using computers at least occasionally.

3.3 Test Videos

To evaluate and compare the above surrogate conditions, a set of 25 comparable videos will be selected as test videos for the study. We will use videos from the NASA Connect collection that target middle school science students. The lengths of the videos will also be similar, around 30 minutes each. All the 25 videos will have a common structural format and be at a similar conceptual level on various middle school science topics with which university students and staff would typically have basic familiarity.

3.4 Tasks

The same six recognition and sense-making tasks from my second user study (Marchionini et al., 2009) will be used in this study: Gist written task, Keyword recognition task, Keyframe recognition task, Verbal gist recognition task, Audio Excerpt Selection task, and Verbal gist recognition task.

In addition to the above 6 tasks, a newly designed retrieval task (see Table 8) will be introduced in this study to test the efficacy of the surrogates on a task more closely related to real-life retrieval tasks.

Table 8: The newly added retrieval task

Task 7: Retrieval from a Video Collection:
Based on the surrogate they have experienced, the participants will be asked to find the 4 most relevant videos (one of which should be the target video being gisted) on the Open Video Repository and list their titles.

3.5 Study Procedure

This study is designed to record both quantitative data, as well as user opinions regarding the tasks. First, an online protocol system will be developed to administer the study and collects data, and manage the counterbalancing of the surrogate condition orderings. The study participants will be run in groups of 6-10 people based on participant availability. The participants will be seated at alternating workstations in a computer laboratory with enough number of identical workstations so that participants will not see each other’s screen. Headphones will be provided and the participants will be asked to put on the headphones during the study.

The investigator will brief the study procedure and participants will be asked to read and sign the consent forms. The participants will then be asked to fill out a pre-session questionnaire about her/himself and their computer and video experiences.

During the study, the participant will interact with several browser-like interfaces with different surrogate conditions on a computer. The ordering of the surrogate conditions will be counterbalanced for each participant. Using one interface at a time, the user will complete a set of tasks designed for selected video segments. Upon completion of the tasks, the participants will complete a short questionnaire about their experience. The other interfaces will then be presented to the participant one by one and the same process will be repeated for different sets of video segments.

The first video a participant experiences in each of the surrogate conditions will be treated as a training video segment and discarded from the data analysis phase to reduce the first object effect. During the test, the investigator will be approachable by the participants, for giving instructions on the test procedures and answering any procedural questions. After completing all the tasks and the final short survey, participants will be thanked and awarded $20 for their participation. The investigator will analyze the data when a required number of participants have successfully completed the study.

4 Conclusions and Contributions

This paper summarizes a series of usability studies related to audio surrogates and multi-modal surrogates for making sense of videos from a digital video library. The usability of digital video libraries and video retrieval and browsing systems may be improved by careful design of video surrogates for quickly deriving gist and relevance of the available content. These studies pave the way for the understanding of some of the research questions proposed earlier, but new questions may also be brought up along the way. By running those studies and doing careful analysis of the study results, I hope to obtain a whole picture of practical video surrogates, especially audio surrogates and multi-modal surrogates in relation to their usefulness and usability in video related applications, such as video retrieval, browsing, and relevance judgment.

We have shown that audio surrogates such as spoken descriptions, even when created by a text-to-speech synthesizer, provide an effective surrogate for video gisting tasks. High summarizing quality spoken descriptions are more effective than spoken keywords and participants have higher confidence in their task responses when given these spoken descriptions as video surrogates.

My studies and results have design implications for user interfaces for video retrieval systems. The results recommend using combined surrogates rather than single-channel surrogates. They also recommend adding spoken descriptions as audio surrogates to video retrieval systems, alone if not in combination with visual surrogates. This seems especially appropriate for small display devices where screen real estate is minimal. Audio has great potential for mobile devices especially given limited screen estates, storage and bandwidth, with earphones and earpieces (either wireless or wired) allowing private listening and control. Furthermore, we recommend the use of spoken descriptions of good summarizing quality as video summaries. When high quality descriptions are not available, system designers should consider using fast forwards or automatically generated keywords instead. In addition, designers must pay attention to the visual and audio quality when creating the surrogates to make them usable. Visual surrogates must be easily viewable without consuming the entire screen. Likewise, audio must be clearly articulated and people should be able to replay it easily.

More importantly, investigations of the usability and satisfaction of unsynchronized media in surrogates are not only novel but also important. From my third usability study, we expect to draw conclusions on how necessary the synchronization between the surrogate channels is and whether unsynchronized sampling of individual channels provides more useful information than synchronized sampling from the channels. The new gisting and retrieving tasks introduced in these studies offer new approaches to evaluating retrieval techniques and may be useful to other researchers.

References

[1] Cole, M., Frankel, F., and Sharp, D. (1971). Development of free recall learning in children. Developmental Psychology. 4:109–123.
 
[2] Dilley, M. and Paivio, A. (1968). Pictures and words as stimulus and response items in paired/associate learning of young children. Journal of Experimental Child Psychology. 6:231C240.
 
[3] Ding, W., Marchionini, G., and Soergel, D. (1999). Multi-modal surrogates for video browsing. In DL ’99: Proceedings of the fourth ACM conference on Digital libraries, pages 85–93, New York, NY, USA. ACM.
 
[4] He, L., Sanocki, E., Guptra, A., and Grudin, J. (2000). Comparing presentation summaries: Slides vs. reading vs. listening. In Proc. CHI’00.
 
[5] Li, Y., Dorai, C., and Farrell, R. (2005). Creating magic: system for generating learning object metadata for instructional content. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, pages 367–370, New York, NY, USA. ACM.
 
[6] Lyman, B. and McDaniel, M. (1990). Memory for odors and odor names: Modalities of elaboration and imagery. Journal of Experimental Psychology: Learning, Memory, and Cognition. 16:656C664.
 
[7] Marchionini, G., Song, Y., and Farrell, R. (2009). Multimedia surrogates for video retrieval: Toward combining spoken words and imagery. Journal of Information Processing & Management. http://dx.doi.org/10.1016/j.ipm.2009.05.007),.
 
[8] Over, P., Kraaij, W., and Smeaton, A. (2005). Trecvid 2005: An introduction. In Proc. TRECVID 2005, pages 1–14. http://www.cdvp.dcu.ie/Papers/TRECVid2005_Overview. pdf.
 
[9] Paivio, A. (1972). Symbolic and sensory modalities of memory. In M.E. Meyer (Ed.), The Third Western symposium on learning: Cognitive learning. Bellingham, WA: Western Washington State College.
 
[10] Paivio, A. (1986). Mental Representations: A Dual Coding Approach. Oxford: Oxford U. Press.
[11] Paivio, A. (2006). Mind and Its Evolution: A Dual Coding Theoretical Approach. Lawrence Erlbaum.
 
[12] Smeaton, A. F., Over, P., and Kraaij, W. (2004). Trecvid: evaluating the effectiveness of information retrieval tasks on digital video. In MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pages 652–655, New York, NY, USA. ACM.
 
[13] Song, Y. and Marchionini, G. (2007). Effects of audio and visual surrogates for making sense of digital video. In CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 867–876, New York, NY, USA. ACM.
 
[14] Thompson, V. and Paivio, A. (1994). Memory for pictures and sounds: Independence of auditory and visual codes. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale. 48:380–398.
 
[15] Wildemuth, B. M., Marchionini, G., Wilkens, T., Yang, M., Geisler, G., Fowler, B., Hughes, A., and Mu, X. (2002). Alternative surrogates for video objects in a digital library: Users’ perspectives on their relative usability. In ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pages 493–507, London, UK. SpringerVerlag.
 
[16] Wildemuth, B. M., Marchionini, G., Yang, M., Geisler, G., Wilkens, T., Hughes, A., and Gruss, R. (2003). How fast is too fast?: evaluating fast forward surrogates for digital video. In JCDL ’03: Proceedings of the 3rd ACM/IEEECS joint conference on Digital libraries, pages 221–230, Washington, DC, USA. IEEE Computer Society.
 
[17] Yang, M., Wildemuth, B. M., Marchionini, G., Wilkens, T., Geisler, G., Hughes, A., Gruss, R., and Webster, C. (2003). Measures of user performance in video retrieval research. SILS technical report, University of North Carolina at Chapel Hill.
 

APPENDIX

Table 9: Gisting & Recognition tasks used in my studies

Written Gist Determination Task (openended):
Write a short summary of the video based on the surrogate they experience in the study.
Keyword Recognition Task:
Select keywords that are appropriate for the video from a set of words.
Title Selection Task (Eliminated from the 2nd and 3rd study):
Select the most appropriate title for the video segment that the surrogate represents.
Keyframe Recognition Task:
Select appropriate keyframes that they think come from a video from a set of keyframes.
Visual Excerpt Selection Task (New in the 2nd and 3rd study):
Select one visual excerpt (with no audio tracks) that they think come from the video from a set of four visual excerpts based on the surrogate they have experienced.
Audio Excerpt Selection Task (New in the 2nd and 3rd study):
Select one audio excerpt (without visual features) that they think come from the video from a set of four audio excerpts based on the surrogate they have experienced.
Verbal Gist Recognition Task:
Participants select the best description for the video from a set of four descriptions based on the surrogate they experience.
Retrieval from a Video Collection (New in the 3rd study):
Based on the surrogate they have experienced, the participants will be asked to find the 4 most relevant videos (one of which should be the target video being gisted) on the Open Video Repository and list their titles.