Skip to content

8Opt/MultimodalLLMs-on-VQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Let's unlock Multi-modal Large Language Models!

This journey investigates the concept of a multi-modal large language model and how to implement it.

About the project

In the rapidly evolving landscape of artificial intelligence, one area  that has garnered significant attention in recent years is multimodal  large language models (MLLMs). These revolutionary tools have the ability to process and generate text, as well as other forms of data such as  images, audio, and video. By leveraging the power of machine learning and deep learning algorithms, MLLMs can capture the nuances of human  communication across multiple modalities, unlocking new possibilities for applications in areas such as natural language processing, computer  vision, and multimedia analysis.

A multimodal large language model (MLLM) is a type of artificial intelligence (AI) model that can process and generate text, as well as other forms of data, such as:

  • Images: MLLMs can understand the content of images, including objects, scenes, and actions.
  • Audio: MLLMs can analyze audio recordings, including speech, music, and sounds.
  • Video: MLLMs can process video content, including visual and auditory information.

MLLMs are designed to handle the complexities of multi-modal data, which can include:

  • Semantics: understanding the meaning of words, phrases, and sentences in multiple languages.
  • Syntax: analyzing the structure of language, including grammar and sentence structure.
  • Vision: recognizing objects, scenes, and actions in images and videos.
  • Audition: analyzing speech, music, and sounds.

The BLIP-2 model architecture, processes both a question and an image as input. Its output is an answer grounded in the question and the image context. BLIP-2 is composed of the following components:

  • Image Encoder: CLIP ViT (Vision Transformer) is used to extract visual features from the input image.
  • Input Projector: Q-Former is responsible for projecting the question and image features into a unified representation suitable for the LLMs.
  • LLMs: We use Flan-T5/OPT as its language model core, which generates the final answer based on the combined question and image context.

Read more at: https://medium.com/@minhle_0210/lets-unlock-multi-modal-large-language-models-9ced1c9dde75

Usage

At the moment, we serve the notebook setting.

Contact us:

About

Using the multimodal large language model for the task of visual question-answering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published