MotoSum

Proposal

In a universe as connected and full of data as ours, we are exposed to a lot of content all the time. Given this reality, the ability to summarize information is extremely important to allow the information to reach us in a more objective way and its consumption, when in a summarized form, is more meaningful and time limited. This project, therefore, attempts to address a sub-domain of this problem, aiming to develop a video summarization pipeline for the motorcycle environment. Several different networks are studied and trained on the MotoSum benchmark and then combined to build a final summarization network. The agent tries to incorporate both objective and subjective analysis of the interesting parts of the video to compile the key segments into a summary. As for objectivity, the main criteria taken into consideration to evaluate the results are diversity, representativeness, and image quality assessment. The criterion for subjectivity, on the other hand, is the similarity in feature space between the summary and a text query written by the author of the video. The technical aspects of this paper are not only related to the networks themselves, but also to the optimizations, since this project was developed to be deployed in an app. Finally, a simpler filtering model based on speed and rolling angle of the motorcycle is developed for more user control over the content of the final summary.

Formal Definition and Restrictions

Even if the proposal of the project is made explicit, by the very nature of the video summarization task, several concepts remain open and questions remain about what the project really is. In order to resolve some of these questions, the following definitions are posed:

A video V is a set of frames allocated sequentially and preferably in an order that makes semantic-cognitive sense to a viewer watching it.
A summary of a video V is another video V' composed solely and exclusively of segments from V allocated in such a way as not to disrespect the temporal hierarchy posed in V: if sequence X comes before sequence Y in V, then sequence Y cannot come before sequence X in V'.
Segments are subsets of the frames of V temporally allocated so as not to disrupt the original time allocation, so if frame X comes before frame Y in V, the sequence using these two frames has to respect the order X and then Y.
Significantly reducing its duration is quite relative. For this project, assuming a video V of duration L, a summary V' is considered valid in terms of duration if its duration L' is less than or equal to 15% of the value of L.

Even though the definitions help to better systemize the problem, the main challange of the field remains: summarizing a video in the best possible way. Understanding why this is so difficult, however, is quite simple and it's directly associated to the optimality of a summary being subjective. By handing a video V to 100 different people and asking each of them to manually build the best summary of V is an experiment that can easily result in almost 100 different summaries.

Schedule

TCC Schedule - 27 hours a week on average

Month	Activities	Phase
Sep/2021	Creation of the project proposal Initial research on video summarization: Unsupervised, supervised, and adversarial learning Temporal segmentation techniques	Proposal and Studies.
Oct/2021	Implementation of first models: Uniform video sampling Using the DSNet network with the GoogLeNet feature extraction network Study of techniques and implementations of the DSNet network Study of the ERFNet semantic segmentation network and tests of replacing the GoogLeNet network for the feature extraction task Building the video summarization pipeline: Virtual machine setup on Azure Inference library setup	Prototyping, Studies, and Deployment.
Nov/2021	Study and implementation of model evaluation techniques: Label-based Using agents previously trained on other tasks Image quality Evaluation of the results: Qualitative analysis of the scores obtained for the summaries Comparison of the uniform sampling model with the DSNet+GoogLeNet network Pipeline optimization tests: Inference optimization with TensorRT Reading and pre-processing video with Nvidia DALI	Studies, Evaluation and Optimization.
Dec/2021	Evaluation of the results: Comparison of the DSNet+GoogLeNet network with the DSNet+ERFNet network Complementary activities: DCNv2 network optimization with TensorRT vis4d network optimization with TensorRT Study and development of Docker image for vis4d network inference environment Object detection network deployment study on a Nvidia Jetson	Evaluation and Complementary Activities.
Jan/2022	Review of everything that had been done in the project: Revisit steps Study content gaps Writing and formalizing the proposal: Draft of the first version of the Course Completion Thesis (TCC)	Review and Documentation.
Feb/2022	Writing and formalizing the proposal: Continuing the draft of the first version of the TCC Review and restructuring of the company's database: Formulation of the cloud data pipeline Design of the resources to be used Implementation of query scripts Development of the MotoSum database: Video organization and compilation Development of data annotation tool for individual use and for use in AMT	Documentation, Complementary Activities and Prototyping.
Mar/2022	Writing and formalization of the proposal: Continuing the draft of the first version of the TCC Benchmark: Finalizing the implementation of the different agent evaluation techniques Comparison between the scores of the baseline model and the first learning-based agent Development of the MotoSum database: Leadership of the annotation team Annotation of the videos themselves Evaluation of annotation quality and human consistency First version of semantic embedding space	Documentation, Evaluation, Management and Prototyping
Apr/2022	Development of the first version of the TCC website MotoSum database development: Leadership of the annotation team Evaluation of annotation quality and human consistency Development of textual query tool for summarization: Second version of semantic embedding space Implementation of video filtering based on textual query First DSNet retraining process with partially annotated MotoSum Feature filter development	Documentation, Management and Prototyping.
May/2022	Third version of semantic embedding space DSNet retraining with MotoSum fully annotated DSNet training with ERFNet as feature extractor	Prototyping.
Jun/2022	Benchmark: Calculating all scores for all models Discussion and documentation of the results Final adjustments to integrate the models into the app pipeline	Evaluation, Documentation and Deployment.
Jul/2022	Finalizing the TCC report Finalizing the TCC website Preparing the TCC presentation Organizing project repository	Documentation.

MotoSum: A Video Summarization Experiment

Files and Presentations

Proposal

Formal Definition and Restrictions

Schedule

TCC Schedule - 27 hours a week on average