Google Gemma 4 12B: Run multimodal AI in the community with an encoder-free structure

7007b159 b458 45d0 8ff0 074d4fd43299.png


Gemma 4 12B is Google DeepMind’s newest open-source type that processes textual content, pictures, and audio natively on client {hardware}, working on simply 16GB of VRAM.

Maximum multimodal fashions lift a hidden reminiscence tax: separate encoder stacks for imaginative and prescient and audio that inflate overhead earlier than a unmarried token is generated. Gemma 4 12B gets rid of the encoders completely. Imaginative and prescient runs thru a light-weight embedding module, audio is projected as uncooked sign without delay into the token house, and the LLM spine handles the remaining.

The result’s a type that benchmarks with reference to Google’s higher 26B MoE variant whilst becoming with ease on a shopper pc.

Key features come with:

  • đź§  Encoder-free structure for local textual content, imaginative and prescient, and audio processing

  • đź’» Runs in the community on 16GB VRAM or unified reminiscence

  • 🤖 Reasoning efficiency nearing the 26B MoE Gemma type

  • ⚡ Multi-Token Prediction drafters for lowered native inference latency

  • 📦 Apache 2.0 license, to be had now on Hugging Face and Kaggle

  • 🛠️ Suitable with Ollama, LM Studio, llama.cpp, vLLM, and HF Transformers

It’s constructed for ML engineers and AI builders development on-device or edge packages that want multimodal capacity and not using a cloud API dependency. Obtain the weights on Hugging Face or Kaggle and get started development as of late.

P.S. I hunt the most recent and biggest launches in tech, SaaS and AI, practice to be notified → @rohanrecommends


Leave a Comment

Your email address will not be published. Required fields are marked *