MotoSum: A Video Summarization Experiment

By Luis Vitor Zerkowski, Dr. Simon Hecker, Prof Dr. Flavio Soares and Prof. Dr. Luc Van Gool

Files and Presentations

Proposal

In a universe as connected and full of data as ours, we are exposed to a lot of content all the time. Given this reality, the ability to summarize information is extremely important to allow the information to reach us in a more objective way and its consumption, when in a summarized form, is more meaningful and time limited. This project, therefore, attempts to address a sub-domain of this problem, aiming to develop a video summarization pipeline for the motorcycle environment. Several different networks are studied and trained on the MotoSum benchmark and then combined to build a final summarization network. The agent tries to incorporate both objective and subjective analysis of the interesting parts of the video to compile the key segments into a summary. As for objectivity, the main criteria taken into consideration to evaluate the results are diversity, representativeness, and image quality assessment. The criterion for subjectivity, on the other hand, is the similarity in feature space between the summary and a text query written by the author of the video. The technical aspects of this paper are not only related to the networks themselves, but also to the optimizations, since this project was developed to be deployed in an app. Finally, a simpler filtering model based on speed and rolling angle of the motorcycle is developed for more user control over the content of the final summary.

Formal Definition and Restrictions

Even if the proposal of the project is made explicit, by the very nature of the video summarization task, several concepts remain open and questions remain about what the project really is. In order to resolve some of these questions, the following definitions are posed:

  1. A video V is a set of frames allocated sequentially and preferably in an order that makes semantic-cognitive sense to a viewer watching it.
  2. A summary of a video V is another video V' composed solely and exclusively of segments from V allocated in such a way as not to disrespect the temporal hierarchy posed in V: if sequence X comes before sequence Y in V, then sequence Y cannot come before sequence X in V'.
  3. Segments are subsets of the frames of V temporally allocated so as not to disrupt the original time allocation, so if frame X comes before frame Y in V, the sequence using these two frames has to respect the order X and then Y.
  4. Significantly reducing its duration is quite relative. For this project, assuming a video V of duration L, a summary V' is considered valid in terms of duration if its duration L' is less than or equal to 15% of the value of L.

Even though the definitions help to better systemize the problem, the main challange of the field remains: summarizing a video in the best possible way. Understanding why this is so difficult, however, is quite simple and it's directly associated to the optimality of a summary being subjective. By handing a video V to 100 different people and asking each of them to manually build the best summary of V is an experiment that can easily result in almost 100 different summaries.

Schedule

TCC Schedule - 27 hours a week on average

Month Activities Phase
Sep/2021 Creation of the project proposal

Initial research on video summarization:
  • Unsupervised, supervised, and adversarial learning
  • Temporal segmentation techniques
Proposal and Studies.
Oct/2021 Implementation of first models:
  • Uniform video sampling
  • Using the DSNet network with the GoogLeNet feature extraction network
Study of techniques and implementations of the DSNet network

Study of the ERFNet semantic segmentation network and tests of replacing the GoogLeNet network for the feature extraction task

Building the video summarization pipeline:
  • Virtual machine setup on Azure
  • Inference library setup
Prototyping, Studies, and Deployment.
Nov/2021 Study and implementation of model evaluation techniques:
  • Label-based
  • Using agents previously trained on other tasks
  • Image quality
Evaluation of the results:
  • Qualitative analysis of the scores obtained for the summaries
  • Comparison of the uniform sampling model with the DSNet+GoogLeNet network
Pipeline optimization tests:
  • Inference optimization with TensorRT
  • Reading and pre-processing video with Nvidia DALI
Studies, Evaluation and Optimization.
Dec/2021 Evaluation of the results:
  • Comparison of the DSNet+GoogLeNet network with the DSNet+ERFNet network
Complementary activities:
  • DCNv2 network optimization with TensorRT
  • vis4d network optimization with TensorRT
  • Study and development of Docker image for vis4d network inference environment
  • Object detection network deployment study on a Nvidia Jetson
Evaluation and Complementary Activities.
Jan/2022 Review of everything that had been done in the project:
  • Revisit steps
  • Study content gaps
Writing and formalizing the proposal:
  • Draft of the first version of the Course Completion Thesis (TCC)
Review and Documentation.
Feb/2022 Writing and formalizing the proposal:
  • Continuing the draft of the first version of the TCC
Review and restructuring of the company's database:
  • Formulation of the cloud data pipeline
  • Design of the resources to be used
  • Implementation of query scripts
Development of the MotoSum database:
  • Video organization and compilation
  • Development of data annotation tool for individual use and for use in AMT
Documentation, Complementary Activities and Prototyping.
Mar/2022 Writing and formalization of the proposal:
  • Continuing the draft of the first version of the TCC
Benchmark:
  • Finalizing the implementation of the different agent evaluation techniques
  • Comparison between the scores of the baseline model and the first learning-based agent
Development of the MotoSum database:
  • Leadership of the annotation team
  • Annotation of the videos themselves
  • Evaluation of annotation quality and human consistency
First version of semantic embedding space
Documentation, Evaluation, Management and Prototyping
Apr/2022 Development of the first version of the TCC website

MotoSum database development:
  • Leadership of the annotation team
  • Evaluation of annotation quality and human consistency
Development of textual query tool for summarization:
  • Second version of semantic embedding space
  • Implementation of video filtering based on textual query
First DSNet retraining process with partially annotated MotoSum

Feature filter development
Documentation, Management and Prototyping.
May/2022 Third version of semantic embedding space

DSNet retraining with MotoSum fully annotated

DSNet training with ERFNet as feature extractor
Prototyping.
Jun/2022 Benchmark:
  • Calculating all scores for all models
  • Discussion and documentation of the results
Final adjustments to integrate the models into the app pipeline
Evaluation, Documentation and Deployment.
Jul/2022 Finalizing the TCC report

Finalizing the TCC website

Preparing the TCC presentation

Organizing project repository
Documentation.