Large language models (LLMs) have emerged as game changers in the natural language processing domain. They are becoming an important part of our daily lives. The most famous example of an LLM is ChatGPT and it’s safe to assume that almost everyone knows about it at this point and most of us use it on a daily basis.
LLMs are characterized by their enormous size and capacity to learn from vast amounts of textual data. This enables them to generate coherent and contextually relevant human-like text. These models are built on deep learning architectures, such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), which use attention mechanisms to capture long-range dependencies in a language.
By leveraging pretraining on large-scale datasets and fine-tuning specific tasks, LLMs have shown remarkable performance in various language-related tasks, including text generation, sentiment analysis, machine translation, and question answering. As LLMs continue to improve, they hold enormous potential to revolutionize natural language understanding and generation, bridging the gap between machines and human-like language processing.
On the other hand, some people believed that LLMs were not utilizing their full potential as they are limited to text input only. They have worked to expand the potential of LLMs beyond languages. Some of the studies have successfully integrated LLMs with various input signals, such as images, videos, speech, and audio, to build powerful multimodal chatbots.
However, there is still a long way to go as most of these models lack the understanding of the relationship between visual objects and other modalities. While visually enhanced LLMs can generate high-quality descriptions, they do so in a black-box fashion without explicitly relating to the visual context.
Establishing an explicit and informative correspondence between text and other modalities in multimodal LLMs can improve the user experience and enable a new set of applications for these models. Let’s meet with BuboGPT, which tackles this limitation.
BuboGPT is the first attempt to incorporate visual grounding into LLMs by connecting visual objects with other modalities. BuboGPT enables joint multimodal understanding and chat for text, vision, and sound by learning a shared representational space that matches well with pretrained LLMs.
Visual grounding is not an easy task to achieve, so it plays a crucial role BuboGPT‘s pipeline. To achieve this, BuboGPT builds a pipeline based on a self-attention mechanism. This mechanism establishes fine-grained relationships between visual objects and modalities.
The pipeline includes three modules: a tagging module, a grounding module and an entity-matching module. The tagging module generates relevant text tags/labels for the input image, the grounding module locates semantic masks or boxes for each tag, and the entity-matching module uses LLM reasoning to retrieve matched entities from tags and image descriptions. By connecting visual objects and other modalities through language, BuboGPT increases understanding of multimodal input.
To enable a multimodal understanding of arbitrary combinations of inputs, BuboGPT uses a two-stage training regimen similar to the Mini-GPT4. In the first step, it uses ImageBind as audio encoders, BLIP-2 as vision encoders, and Vicuna as LLM to learn a Q-former that aligns vision or audio features with language. In the second phase, it performs multimodal instruction alignment on a high-quality instruction-following dataset.
The construction of this data set is crucial for the LLM to recognize the modalities provided and whether the inputs are well matched. Therefore, BuboGPT builds a new high-quality dataset with subsets for visual instruction, audio instruction, audio localization with positive image-audio pairs, and image-audio subtitling with negative pairs for semantic reasoning. By introducing negative picture-sound pairs, BuboGPT learns better multimodal adaptation and exhibits stronger joint comprehension skills.
Check out Paper, Github and Project. All credit for this research goes to the researchers on this project. Also, don’t forget to participate our 28k+ ML SubReddit, 40k+ Facebook Community, Discord channeland Email newsletterwhere we share the latest AI research news, cool AI projects and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Turkey. He wrote his M.Sc. thesis on Image Decomposition Using Deep Convolutional Networks. He got his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his thesis titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video coding and multimedia networks.
#ChatGPT #eyes #ears #BuboGPT #approach #enables #visual #grounding #multimodal #LLMs