Vision Transformer Models Encoder/Decoder Structure

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Abstract: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2 [45], a generative vision foundation model.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Feedback

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Trending now