By Luis Vitor Zerkowski, Dr. Simon Hecker, Prof Dr. Flavio Soares and Prof. Dr. Luc Van Gool
In a universe as connected and full of data as ours, we are exposed to a lot of content all the time. Given this reality, the ability to summarize information is extremely important to allow the information to reach us in a more objective way and its consumption, when in a summarized form, is more meaningful and time limited. This project, therefore, attempts to address a sub-domain of this problem, aiming to develop a video summarization pipeline for the motorcycle environment. Several different networks are studied and trained on the MotoSum benchmark and then combined to build a final summarization network. The agent tries to incorporate both objective and subjective analysis of the interesting parts of the video to compile the key segments into a summary. As for objectivity, the main criteria taken into consideration to evaluate the results are diversity, representativeness, and image quality assessment. The criterion for subjectivity, on the other hand, is the similarity in feature space between the summary and a text query written by the author of the video. The technical aspects of this paper are not only related to the networks themselves, but also to the optimizations, since this project was developed to be deployed in an app. Finally, a simpler filtering model based on speed and rolling angle of the motorcycle is developed for more user control over the content of the final summary.
Even if the proposal of the project is made explicit, by the very nature of the video summarization task, several concepts remain open and questions remain about what the project really is. In order to resolve some of these questions, the following definitions are posed:
Even though the definitions help to better systemize the problem, the main challange of the field remains: summarizing a video in the best possible way. Understanding why this is so difficult, however, is quite simple and it's directly associated to the optimality of a summary being subjective. By handing a video V to 100 different people and asking each of them to manually build the best summary of V is an experiment that can easily result in almost 100 different summaries.
Month | Activities | Phase |
---|---|---|
Sep/2021 | Creation of the project proposal Initial research on video summarization:
|
Proposal and Studies. |
Oct/2021 | Implementation of first models:
Study of the ERFNet semantic segmentation network and tests of replacing the GoogLeNet network for the feature extraction task Building the video summarization pipeline:
|
Prototyping, Studies, and Deployment. |
Nov/2021 | Study and implementation of model evaluation techniques:
|
Studies, Evaluation and Optimization. |
Dec/2021 | Evaluation of the results:
|
Evaluation and Complementary Activities. |
Jan/2022 | Review of everything that had been done in the project:
|
Review and Documentation. |
Feb/2022 | Writing and formalizing the proposal:
|
Documentation, Complementary Activities and Prototyping. |
Mar/2022 | Writing and formalization of the proposal:
|
Documentation, Evaluation, Management and Prototyping |
Apr/2022 | Development of the first version of the TCC website MotoSum database development:
Feature filter development |
Documentation, Management and Prototyping. |
May/2022 | Third version of semantic embedding space DSNet retraining with MotoSum fully annotated DSNet training with ERFNet as feature extractor |
Prototyping. |
Jun/2022 | Benchmark:
|
Evaluation, Documentation and Deployment. |
Jul/2022 | Finalizing the TCC report Finalizing the TCC website Preparing the TCC presentation Organizing project repository |
Documentation. |