What happened this week in AI by Louie This week, we witnessed the introduction of LLaVA v1.5, a new open-source multimodal model stepping onto the scene as a contender against GPT-4 with multimodal capabilities. It uses a simple projection matrix to connect the pre-trained CLIP ViT-L/14 vision encoder with Vicuna LLM, resulting in a robust model that can handle images and text. The model is trained in two stages: first, updated the projection matrix based on a subset of CC3M for better alignment, and then, fine-tuned the entire model for two specific use cases, Visual Chat and Science QA, which resulted in state-of-the-art accuracy on the latter benchmark.
This AI newsletter is all you need #68
This AI newsletter is all you need #68
This AI newsletter is all you need #68
What happened this week in AI by Louie This week, we witnessed the introduction of LLaVA v1.5, a new open-source multimodal model stepping onto the scene as a contender against GPT-4 with multimodal capabilities. It uses a simple projection matrix to connect the pre-trained CLIP ViT-L/14 vision encoder with Vicuna LLM, resulting in a robust model that can handle images and text. The model is trained in two stages: first, updated the projection matrix based on a subset of CC3M for better alignment, and then, fine-tuned the entire model for two specific use cases, Visual Chat and Science QA, which resulted in state-of-the-art accuracy on the latter benchmark.