Featured image
Large Multimodal Model

Open-Source LLaVA challenges GPT-4 in Multimodal Language and Vision Understanding

avatar

Sven

October 9th, 2023

~ 3 min read

In the rapidly evolving field of language and vision understanding, researchers are constantly pushing the boundaries of what is possible. A recent development called LLaVA (Large Language and Vision Assistant) has been making waves with its impressive capabilities. In this blog post, we will explore the groundbreaking features of LLaVA, its performance in chat applications and science reasoning, and its potential impact on the field of multimodal AI.

LLaVA: Combining Visual and Language Understanding

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Developed by researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University, LLaVA aims to bridge the gap between language-only models and multimodal models.

Improving Zero-Shot Capabilities in Multimodal Field

One of the key contributions of LLaVA is its ability to improve zero-shot capabilities in the multimodal domain. While instruction tuning large language models (LLMs) using machine-generated instruction-following data has shown promise in the language domain, it has been less explored in the multimodal field. LLaVA presents the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.

Impressive Chat Capabilities and State-of-the-Art Accuracy

LLaVA's early experiments have shown impressive multimodal chat abilities, sometimes exhibiting behaviors similar to multimodal GPT-4 on unseen images and instructions. It achieves an 85.1% relative score compared to GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.

Open-Source Accessibility

In the spirit of collaboration and transparency, the researchers have made GPT-4 generated visual instruction tuning data, the LLaVA model, and their codebase publicly available. This open-source approach allows other researchers and developers to build upon and contribute to the advancement of multimodal AI.

Building Multimodal GPT-4 Level Chatbot

LLaVA is fine-tuned on the researchers' generated multimodal instruction-following data for daily user-oriented applications. The evaluation dataset consists of 30 unseen images associated with conversation, detailed description, and complex reasoning instructions. LLaVA achieves an impressive 85.1% relative score compared to GPT-4, showcasing the effectiveness of the proposed self-instruct method in multimodal settings.

Pushing the Boundaries of State-of-the-Art

In the science domain, LLaVA alone achieves an accuracy of 90.92% on a multimodal reasoning dataset. To further enhance its performance, the researchers employ a "GPT-4 as judge" scheme, where they use the text-only GPT-4 to predict the final answer based on its own previous answers and the LLaVA answers. This innovative approach yields a new state-of-the-art accuracy of 92.53%.

Conclusion

LLaVA represents a significant breakthrough in multimodal language and vision understanding. Its combination of a vision encoder and Vicuna, along with its impressive chat capabilities and state-of-the-art accuracy, make it a promising tool for various applications in the field of AI. By openly sharing their data and codebase, the researchers behind LLaVA are encouraging collaboration and fostering innovation in the multimodal AI community. arXiv paper