论文信息 - Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps

Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps

With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the rapid exchange and influence of ideas across traditional discipline boundaries have made the emphasis on multimedia multimodal research even more important than before. To seize these opportunities and respond to the challenges, we have organized a workshop to specifically address and brainstorm the challenges, opportunities, and research roadmaps for MM research. The two-day workshop, held on March 30 and 31, 2017 in Washington DC, was sponsored by the Information and Intelligent Systems Division of the National Science Foundation of the United States. Twenty-three (23) invited participants were asked to review and identify research areas in the MM field that are most important over the next 10-15 year timeframe. Important topics were selected through discussion and consensus, and then discussed in depth in breakout groups. Breakout groups reported initial discussion results to the whole group, who continued with further extensive deliberation. For each identified topic, a summary was produced after the workshop to describe the main findings, including the state of the art, challenges, and research roadmaps planned for the next 5, 10, and 15 years in the identified area.

[1] Frank-Michael Nack,et al. AUTEUR : the application of video semantics and theme representation for automated film editing , 1996 .

[2] Shih-Fu Chang,et al. Internet image archaeology: automatically tracing the manipulation history of photographs on the web , 2008, ACM Multimedia.

[3] Daniel M. Oppenheimer,et al. Corrigendum: The Pen Is Mightier Than the Keyboard: Advantages of Longhand Over Laptop Note Taking , 2018, Psychological science.

[4] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Shaohua Yang,et al. Physical Causality of Action Verbs in Grounded Language Understanding , 2016, ACL.

[6] Xavier Ochoa,et al. Multimodal learning analytics: assessing learners' mental state during the process of learning , 2018, The Handbook of Multimodal-Multisensor Interfaces, Volume 2.

[7] Sharon L. Oviatt,et al. The Paradigm Shift to Multimodality in Contemporary Computer Interfaces , 2015, Synthesis Lectures on Human-Centered Informatics.

[8] Shih-Fu Chang,et al. Event detection in baseball video using superimposed caption recognition , 2002, MULTIMEDIA '02.

[9] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Mor Naaman,et al. Less talk, more rock: automated organization of community-contributed collections of concert videos , 2009, WWW '09.

[11] Carter C. Price,et al. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations , 2013 .

[12] Daniel M. Oppenheimer,et al. The Pen Is Mightier Than the Keyboard , 2014, Psychological science.

[13] R. Cialdini. Influence: The Psychology of Persuasion , 1993 .

[14] R. Venkatesh Babu,et al. A survey on compressed domain video analysis techniques , 2014, Multimedia Tools and Applications.

[15] Subramanian Ramanathan,et al. Connecting Meeting Behavior with Extraversion—A Systematic Study , 2012, IEEE Transactions on Affective Computing.

[16] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[17] Alexandru Iosup,et al. CAMEO: Enabling social networks for Massively Multiplayer Online Games through Continuous Analytics and cloud computing , 2010, 2010 9th Annual Workshop on Network and Systems Support for Games.

[18] D. W. F. van Krevelen,et al. A Survey of Augmented Reality Technologies, Applications and Limitations , 2010, Int. J. Virtual Real..

[19] Eugene Zhang,et al. Interactive procedural street modeling , 2008, ACM Trans. Graph..

[20] Alexander G. Hauptmann,et al. Video Synchronization and Sound Search for Human Rights Documentation and Conflict Monitoring , 2016 .

[21] Marian Florin Ursu,et al. ShapeShifting Documentary: A Golden Age , 2008, EuroITV.

[22] Gloria Mark,et al. Confiding in and Listening to Virtual Agents: The Effect of Personality , 2017, IUI.

[23] Sharon L. Oviatt,et al. When do we interact multimodally?: cognitive load and multimodal communication patterns , 2004, ICMI '04.

[24] Shih-Fu Chang,et al. Learning Discriminative and Transformation Covariant Local Feature Detectors , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Ivan V. Bajic,et al. Saliency-Aware Video Compression , 2014, IEEE Transactions on Image Processing.

[26] David E. Millard,et al. Using a thematic model to enrich photo montages , 2009, HT '09.

[27] N. Shadbolt,et al. Ontological Approaches to Modelling Narrative , 2006 .

[28] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[29] James J. Lindsay,et al. Cues to deception. , 2003, Psychological bulletin.

[30] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Jayant Krishnamurthy,et al. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[32] Mohammad Soleymani,et al. The Benchmarking Initiative for Multimedia Evaluation: MediaEval 2016 , 2017, IEEE Multim..

[33] Gaynor Williams. Space to Think , 2004 .

[34] Touradj Ebrahimi,et al. Opportunities and Challenges of Global Network Cameras , 2015, ACM Multimedia.

[35] Alexandru Iosup,et al. Procedural content generation for games: A survey , 2013, TOMCCAP.

[36] Steve Dixon,et al. Digital Performance: A History of New Media in Theater, Dance, Performance Art, and Installation , 2007 .

[37] Roberto Agodini,et al. Effectiveness of Reading and Mathematics Software Products: Findings from Two Student Cohorts. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education , 2009 .

[38] Marcel Worring,et al. Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[39] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40] Glorianna Davenport,et al. Cinematic primitives for multimedia , 1991, IEEE Computer Graphics and Applications.

[41] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[42] Pablo César,et al. Enabling 'togetherness' in high-quality domestic video , 2012, ACM Multimedia.

[43] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[44] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Li Li,et al. A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[46] M. Shamim Hossain,et al. Multimedia Content Repurposing , 2008, Encyclopedia of Multimedia.

[47] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48] David Herman,et al. Story Logic: Problems and Possibilities of Narrative , 2002 .

[49] Bruce H. Thomas,et al. Applying spatial augmented reality to facilitate in-situ support for automotive spot welding inspection , 2011, VRCAI.

[50] Yejin Choi,et al. Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.

[51] Nello Cristianini,et al. Discovering Periodic Patterns in Historical News , 2016, PloS one.

[52] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[53] Shimei Pan,et al. Enabling context-sensitive information seeking , 2006, IUI '06.

[54] Pablo César,et al. Automatic generation of video narratives from shared UGC , 2011, HT '11.

[55] Bohyung Han,et al. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Rongrong Ji,et al. Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[57] Irma Perez-Johnson,et al. Variability in Pretest-Posttest Correlation Coefficients by Student Achievement Level. Washington, DC: U.S. Department of Education, National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences , 2011 .

[58] Lynda Hardman,et al. Automatic generation of matter-of-opinion video documentaries , 2008, J. Web Semant..

[59] Julia Hirschberg,et al. Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection , 2016, INTERSPEECH.

[60] Antonio Krüger,et al. The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations - Volume 1 , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[61] Charles Poynton,et al. Digital Video and HDTV Algorithms and Interfaces , 2012 .

[62] Yi Yang,et al. Monitoring and Coaching the Use of Home Medical Devices , 2015, Health Monitoring and Personalized Feedback using Multimedia Data.

[63] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Benoit Huet,et al. Generating TV summaries for CE-devices , 2002, MULTIMEDIA '02.

[65] David A. Shamma,et al. Watch what I watch: using community activity to understand content , 2007, MIR '07.

[66] C. Spence,et al. The Handbook of Multisensory Processing , 2004 .

[67] Bo Gao,et al. Accurate and low-delay seeking within and across mash-ups of highly-compressed videos , 2011, NOSSDAV.

[68] Mitchell P. Marcus,et al. Generating narrative variation in interactive fiction , 2007 .

[69] Juliane Hahn,et al. Security And Game Theory Algorithms Deployed Systems Lessons Learned , 2016 .

[70] David D. Clark,et al. Architectural considerations for a new generation of protocols , 1990, SIGCOMM '90.

[71] Alessandro Vinciarelli,et al. Face-Based Automatic Personality Perception , 2014, ACM Multimedia.

[72] B. Stein. The new handbook of multisensory processes , 2012 .

[73] Nicu Sebe,et al. Please, tell me about yourself: automatic personality assessment using short self-presentations , 2011, ICMI '11.

[74] Jeffrey S. Magers. Compstat , 2004 .

[75] Vladimir Propp,et al. Morphology of the folktale , 1959 .

[76] Sven J. Dickinson,et al. Video In Sentences Out , 2012, UAI.

[77] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[78] Mohammad Soleymani,et al. A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[79] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[80] Jeffrey T. Hancock,et al. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication , 2007 .

[81] Luc Moreau,et al. Generating Narratives from Provenance Relationship Chains , 2015, NHT@HT.

[82] Kazunori Sugiura,et al. Vox populi: enabling community-based narratives through collaboration and content creation , 2013, EuroITV.

[83] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[84] Luc Moreau,et al. Provenance-based reproducibility in the Semantic Web , 2011, J. Web Semant..

[85] Matteo Gaeta,et al. A mash-up authoring tool for e-learning based on pedagogical templates , 2009, MTDL '09.

[86] David E. Millard,et al. A semiotic approach for the generation of themed photo narratives , 2010, HT '10.

[87] Hervé Glotin,et al. LifeCLEF 2014: Multimedia Life Species Identification Challenges , 2014, CLEF.

[88] Marian Florin Ursu,et al. Interactive documentaries: A Golden Age , 2009, CIE.

[89] Wolfgang Effelsberg,et al. Video abstracting , 1997, CACM.

[90] M. Bal,et al. Narratology: Introduction to the Theory of Narrative , 1988 .

[91] Antonio Torralba,et al. LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[92] Wolfgang Effelsberg,et al. Automatic generation of video summaries for historical films , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[93] Aljoscha Smolic,et al. Multi-View Video Plus Depth Representation and Coding , 2007, 2007 IEEE International Conference on Image Processing.

[94] Jun Wang,et al. Discovering Multidimensional Motifs in Physiological Signals for Personalized Healthcare , 2016, IEEE Journal of Selected Topics in Signal Processing.

[95] Peter H. N. de With,et al. Automatic mashup generation from multiple-camera concert recordings , 2010, ACM Multimedia.

[96] Rita Noumeir,et al. Methods for image authentication: a survey , 2008, Multimedia Tools and Applications.

[97] Ronald Azuma,et al. A Survey of Augmented Reality , 1997, Presence: Teleoperators & Virtual Environments.

[98] David Simões,et al. Enhancing the SCORM modelling scope , 2004, IEEE International Conference on Advanced Learning Technologies, 2004. Proceedings..

[99] E. Gibney. The scientist who spots fake videos , 2017 .

[100] K. James. Sensori-motor experience leads to changes in visual processing in the developing brain. , 2010, Developmental science.

[101] J. Gee. Learning by design: Games as learning machines , 2004 .

[102] Sharon Oviatt,et al. The Design of Future Educational Interfaces , 2013 .

[103] Jason Weber,et al. Creation and rendering of realistic trees , 1995, SIGGRAPH.

[104] Nello Cristianini,et al. NOAM: news outlets analysis and monitoring system , 2011, SIGMOD '11.

[105] Zhenyu He,et al. The Visual Object Tracking VOT2016 Challenge Results , 2016, ECCV Workshops.

[106] Hans-Peter Kriegel,et al. Metropolis Algorithms for Representative Subgraph Sampling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[107] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[108] Larry S. Davis,et al. AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[109] Kyung-Sup Kwak,et al. The Internet of Things for Health Care: A Comprehensive Survey , 2015, IEEE Access.

[110] Angeliki Lazaridou,et al. Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.

[111] D. Laplane. Thought and language. , 1992, Behavioural neurology.

[112] J. Versalovic,et al. The generation of narrative interpretations in laboratory medicine: a description of service-specific sign-out rounds. , 2001, American journal of clinical pathology.

[113] Marina Bosch,et al. ImageCLEF, Experimental Evaluation in Visual Information Retrieval , 2010 .

[114] Mubarak Shah,et al. High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[115] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[116] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[117] D. Light,et al. Laptop Programs for Students , 2009, Science.

[118] Nicu Sebe,et al. Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[119] James J. Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers , 2007 .

[120] Harry W. Agius,et al. Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[121] Andreas Stolcke,et al. Distinguishing deceptive from non-deceptive speech , 2005, INTERSPEECH.

[122] Remigiusz Baran,et al. A smart camera for the surveillance of vehicles in intelligent transportation systems , 2015, Multimedia Tools and Applications.

[123] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[124] J. Pennebaker,et al. Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[125] Saskatchewan Polytechnic,et al. Resources and Support , 2017 .

[126] Timothy W. Bickmore,et al. First Impressions in Human--Agent Virtual Encounters , 2016, ACM Trans. Comput. Hum. Interact..

[127] Shih-Fu Chang,et al. Physics-motivated features for distinguishing photographic images and computer graphics , 2005, ACM Multimedia.

[128] Markus Oeser,et al. Video Based Intelligent Transportation Systems – State of the Art and Future Development , 2016 .

[129] Shih-Fu Chang,et al. Event Specific Multimodal Pattern Mining with Image-Caption Pairs , 2016, ArXiv.

[130] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[131] Paul Over,et al. Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[132] Takeshi Arikuma,et al. Intelligent multimedia surveillance system for safer cities , 2016, APSIPA Transactions on Signal and Information Processing.

[133] Wolfgang Effelsberg,et al. Video composition by the crowd: a system to compose user-generated videos in near real-time , 2015, MMSys.

[134] Hari Sundaram,et al. Task-driven sampling of attributed networks , 2016, ArXiv.

[135] Gary Friedman,et al. The trustworthy digital camera: restoring credibility to the photographic image , 1993 .

[136] Lyle H. Ungar,et al. Analyzing Personality through Social Media Profile Picture Choice , 2016, ICWSM.

[137] Julian Togelius,et al. Search-Based Procedural Content Generation: A Taxonomy and Survey , 2011, IEEE Transactions on Computational Intelligence and AI in Games.

[138] Eileen Fitzpatrick,et al. Verification and Implementation of Language-Based Deception Indicators in Civil and Criminal Narratives , 2008, COLING.

[139] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[140] Frank Keller,et al. Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[141] John M. Evans,et al. HelpMate®, the trackless robotic courier: A perspective on the development of a commercial autonomous mobile robot , 1998 .

[142] Jiebo Luo,et al. Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.

[143] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[144] Yejin Choi,et al. Collective Generation of Natural Image Descriptions , 2012, ACL.

[145] Hugh McCabe,et al. A Survey of Procedural Techniques for City Generation , 2006 .

[146] A. Shleifer,et al. Coarse Thinking and Persuasion , 2006 .

[147] Paul Over,et al. Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[148] David Schlangen,et al. Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution , 2015, ACL.

[149] Pablo César,et al. A Distributed Theatre Experiment with Shakespeare , 2015, ACM Multimedia.

[150] Carolyn Penstein Rosé,et al. The Architecture of Why2-Atlas: A Coach for Qualitative Physics Essay Writing , 2002, Intelligent Tutoring Systems.

[151] Trevor Darrell,et al. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[152] Sharon L. Oviatt,et al. Toward adaptive conversational interfaces: Modeling speech convergence with animated personas , 2004, TCHI.

[153] Fernando Diaz,et al. Predicting Salient Updates for Disaster Summarization , 2015, ACL.

[154] John Langford,et al. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[155] Luiz Fernando Gomes Soares,et al. Introduction to special issue: Human-centered television—directions in interactive digital television research , 2008, TOMCCAP.

[156] Luke S. Zettlemoyer,et al. A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[157] A. Dighe,et al. Narrative interpretations for clinical laboratory evaluations: an overview. , 2001, American journal of clinical pathology.

[158] Frank M. Shipman,et al. Creating navigable multi-level video summaries , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[159] Holly A. Yanco,et al. Effects of changing reliability on trust of robot systems , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[160] Wei Tsang Ooi,et al. MoViMash: online mobile video mashup , 2012, ACM Multimedia.

[161] Sharon L. Oviatt,et al. The impact of interface affordances on human ideation, problem solving, and inferential reasoning , 2012, TCHI.

[162] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[163] Sharon L. Oviatt,et al. Quiet interfaces that help students think , 2006, UIST.

[164] Steffen Beich,et al. Digital Video And Hdtv Algorithms And Interfaces , 2016 .

[165] Susanne P. Lajoie,et al. Sherlock: A Coached Practice Environment for an Electronics Troubleshooting Job. , 1988 .

[166] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[167] Thomas Brox,et al. A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[168] Vincent P. Wade,et al. Adaptive Educational Games: Providing Non-invasive Personalised Learning Experiences , 2008, 2008 Second IEEE International Conference on Digital Game and Intelligent Toy Enhanced Learning.

[169] Jianping Fan,et al. Exploring video content structure for hierarchical summarization , 2004, Multimedia Systems.

[170] Luke S. Zettlemoyer,et al. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[171] Rene Kaiser,et al. Virtual Director Adapting Visual Presentation to Conversation Context in Group Videoconferencing: An Interactive Demo , 2014, ACM Multimedia.

[172] M. Frank,et al. Human Behavior and Deception Detection , 2008 .

[173] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[174] Shih-Fu Chang,et al. Passive-blind Image Forensics , 2006 .

[175] Kelly S. Shapley,et al. Evaluation of the Texas Technology Immersion Pilot: Final Outcomes for a Four-Year Study (2004-05 to 2007-08). , 2009 .

[176] Shih-Fu Chang,et al. Frontiers of Multimedia Research , 2018 .

[177] Julian Togelius,et al. Towards Automatic Personalized Content Generation for Platform Games , 2010, AIIDE.

[178] Steven McCanne,et al. Receiver-driven layered multicast , 1996, SIGCOMM '96.