10th International Congress on Information and Communication Technology in concurrent with ICT Excellence Awards (ICICT 2025) will be held at London, United Kingdom | February 18 - 21 2025.
Authors - Quang-Vinh Dang Abstract - Product retrieval in e-commerce systems has traditionally relied on text-based matching between user queries and product descriptions. While recent advances have introduced image-based search capabilities leveraging deep learning techniques, existing systems typically operate in isolation, processing either textual or visual queries independently. However, contemporary user behavior increasingly demonstrates the need for multi-modal search capabilities, particularly as smartphones enable users to seamlessly combine photographic content with textual descriptions in their product searches. This paper presents a novel multimodal retrieval augmented generation (RAG) framework that unifies text and image inputs for enhanced product discovery. Our approach addresses the limitations of conventional single-modality systems by simultaneously processing and correlating both visual and textual features. By leveraging the complementary nature of these modalities, our system achieves more nuanced and contextually aware product matching. Experimental results demonstrate that our multi-modal RAG framework significantly improves search accuracy and relevance compared to traditional single-modality approaches. Furthermore, user studies indicate enhanced satisfaction and reduced search friction, suggesting meaningful improvements to the e-commerce user experience. Our findings contribute to the growing body of research on multi-modal information retrieval and offer practical insights for implementing more sophisticated product search systems in commercial applications.