Gemma 4 12B is Google DeepMind’s newest open-source type that processes textual content, pictures, and audio natively on client {hardware}, working on simply 16GB of VRAM.
Maximum multimodal fashions lift a hidden reminiscence tax: separate encoder stacks for imaginative and prescient and audio that inflate overhead earlier than a unmarried token is generated. Gemma 4 12B gets rid of the encoders completely. Imaginative and prescient runs thru a light-weight embedding module, audio is projected as uncooked sign without delay into the token house, and the LLM spine handles the remaining.
The result’s a type that benchmarks with reference to Google’s higher 26B MoE variant whilst becoming with ease on a shopper pc.
Key features come with:
-
đź§ Encoder-free structure for local textual content, imaginative and prescient, and audio processing
-
đź’» Runs in the community on 16GB VRAM or unified reminiscence
-
🤖 Reasoning efficiency nearing the 26B MoE Gemma type
-
⚡ Multi-Token Prediction drafters for lowered native inference latency
-
📦 Apache 2.0 license, to be had now on Hugging Face and Kaggle
-
🛠️ Suitable with Ollama, LM Studio, llama.cpp, vLLM, and HF Transformers
It’s constructed for ML engineers and AI builders development on-device or edge packages that want multimodal capacity and not using a cloud API dependency. Obtain the weights on Hugging Face or Kaggle and get started development as of late.
P.S. I hunt the most recent and biggest launches in tech, SaaS and AI, practice to be notified → @rohanrecommends



