Google Gemma 4 12B: Run multimodal AI in the community with an encoder-free structure

Gemma 4 12B is Google DeepMind’s newest open-source type that processes textual content, pictures, and audio natively on client {hardware}, working on simply 16GB of VRAM.

Maximum multimodal fashions lift a hidden reminiscence tax: separate encoder stacks for imaginative and prescient and audio that inflate overhead earlier than a unmarried token is generated. Gemma 4 12B gets rid of the encoders completely. Imaginative and prescient runs thru a light-weight embedding module, audio is projected as uncooked sign without delay into the token house, and the LLM spine handles the remaining.

The result’s a type that benchmarks with reference to Google’s higher 26B MoE variant whilst becoming with ease on a shopper pc.

Key features come with:

🧠 Encoder-free structure for local textual content, imaginative and prescient, and audio processing
💻 Runs in the community on 16GB VRAM or unified reminiscence
🤖 Reasoning efficiency nearing the 26B MoE Gemma type
⚡ Multi-Token Prediction drafters for lowered native inference latency
📦 Apache 2.0 license, to be had now on Hugging Face and Kaggle
🛠️ Suitable with Ollama, LM Studio, llama.cpp, vLLM, and HF Transformers

It’s constructed for ML engineers and AI builders development on-device or edge packages that want multimodal capacity and not using a cloud API dependency. Obtain the weights on Hugging Face or Kaggle and get started development as of late.

P.S. I hunt the most recent and biggest launches in tech, SaaS and AI, practice to be notified → @rohanrecommends

Google Gemma 4 12B: Run multimodal AI in the community with an encoder-free structure

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

Related Posts

Leave a Comment Cancel Reply