10th International Congress on Information and Communication Technology in concurrent with ICT Excellence Awards (ICICT 2025) will be held at London, United Kingdom | February 18 - 21 2025.
Authors - Francisco Seipel-Soubrier, Jonathan Cyriax Brast, Eicke Godehardt, Jorg Schafer Abstract - We propose an architecture of a proof-of-concept for automated video summarization and evaluate its performance, addressing the challenges posed by the increasing prevalence of video content. The research focuses on creating a multi-modal approach that integrates audio and visual analysis techniques to generate comprehensive video descriptions. Evaluation of the system across various video genres revealed that while video-based large language models show improvements over image-only models, they still struggle to capture nuanced visual narratives, resulting in generalized output for videos without a strong speech based narrative. The multi-modal approach demonstrated the ability to generate useful short summaries for most video types, but especially in speech-heavy videos offers minimal advantages over speech-only processing. The generation of textual alternatives and descriptive transcripts showed promise. While primarily stable for speech-heavy videos, future investigation into refinement techniques and potential advancements in video-based large language models holds promise for improved performance in the future.