Vision models · Ollama

gemma4

Gemma 4 models are designed to deliver frontier-level performance at each size. They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

vision tools thinking audio cloud e2b e4b 12b 26b 31b

19.6M Pulls 49 Tags Updated 3 weeks ago

qwen3.5

Qwen 3.5 is a family of open-source multimodal models that delivers exceptional utility and performance.

vision tools thinking cloud 0.8b 2b 4b 9b 27b 35b 122b

16.3M Pulls 64 Tags Updated 2 months ago

qwen3.6

Qwen3.6 delivers substantial upgrades in agentic coding and thinking preservation than previous Qwen models.

vision tools thinking 27b 35b

4.6M Pulls 30 Tags Updated 1 month ago

glm-ocr

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture.

vision tools

6.2M Pulls 3 Tags Updated 5 months ago

nemotron3

NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document intelligence workflows.

vision tools thinking audio 33b

624.2K Pulls 4 Tags Updated 3 months ago

minimax-m3

MiniMax M3: Coding & Agentic Frontier. 1M context window. Native Multimodality.

vision tools thinking cloud

316.2K Pulls 1 Tag Updated 1 month ago

kimi-k2.7-code

Kimi K2.7 Code is Moonshot AI's coding-focused agentic model built upon Kimi K2.6, with substantial improvements on real-world long-horizon coding tasks and roughly 30% lower thinking-token usage.

vision tools thinking cloud

188.3K Pulls 1 Tag Updated 1 month ago

kimi-k2.6

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

vision tools thinking cloud

400.6K Pulls 1 Tag Updated 3 months ago

mistral-medium-3.5

Mistral Medium 3.5 is the first flagship model of Mistral AI that merged instruction-following, reasoning, and coding in a single set of 128B weights.

vision tools thinking 128b

243K Pulls 5 Tags Updated 2 months ago

medgemma

MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension.

vision 4b 27b

205.6K Pulls 9 Tags Updated 3 months ago

medgemma1.5

MedGemma 1.5 4B is an updated version of the MedGemma 4B model.

vision 4b

96.9K Pulls 5 Tags Updated 3 months ago

minicpm-v4.6

A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone

vision 1b

21.9K Pulls 13 Tags Updated 1 month ago

minicpm-v4.5

A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone

vision 8b

17.5K Pulls 13 Tags Updated 1 month ago

qwen3-vl

The most powerful vision-language model in the Qwen model family to date.

vision tools thinking 2b 4b 8b 30b 32b 235b

4.7M Pulls 57 Tags Updated 9 months ago

translategemma

A new collection of open translation models built on Gemma 3, helping people communicate across 55 languages.

vision 4b 12b 27b

1.9M Pulls 13 Tags Updated 6 months ago

gemini-3-flash-preview

Gemini 3 Flash offers frontier intelligence built for speed at a fraction of the cost.

vision tools thinking cloud

2.3M Pulls 1 Tag Updated 7 months ago

ministral-3

The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware.

vision tools 3b 8b 14b

1.3M Pulls 13 Tags Updated 7 months ago

devstral-small-2

24B model that excels at using tools to explore codebases, editing multiple files and power software engineering agents.

vision tools 24b

916.7K Pulls 6 Tags Updated 7 months ago

kimi-k2.5

Kimi K2.5 is an open-source, native multimodal agentic model that seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

vision tools thinking cloud

369.2K Pulls 1 Tag Updated 6 months ago

deepseek-ocr

DeepSeek-OCR is a vision-language model that can perform token-efficient OCR.

vision 3b

498K Pulls 3 Tags Updated 8 months ago