I changed Cursor with a fully native VS Code setup, and I overlooked lower than anticipated

Making an allowance for that AI equipment can take on mind-numbingly bogus duties, there’s no denying that they’re a godsend for productiveness. However with nearly each cloud-based platform charging common subscription charges for his or her clanker-powered services and products, we’re beginning to get to some degree the place you want to finally end up paying loads of greenbacks to steer clear of hitting price limits on coding platforms.

In truth, the hugely limited token utilization at the loose variations of Cursor and Antigravity has made me steer clear of their choices, with the latter’s timeouts particularly being a significant turnoff for initiatives the place I wish to question the LLMs more than one occasions to get one thing significant out of them. In the meantime, I’ve began experimenting with MoE fashions, and with the appropriate extension on VS Code, they’ve solely changed their cloud opposite numbers for my dev duties.

Comparable

I ran native LLMs on Intel’s most cost-effective iGPU, and the consequences had been unusually respectable

It ain’t no fit for a devoted GPU, however you’ll run some mild LLMs at the N100

The llama-vscode extension fuels my coding escapades

I’ve paired it with the MoE fashions I host on my house lab nodes

Again after I first dove into LLMs, I caught with 9B and 12B fashions for probably the most phase. And whilst they’re beautiful respectable for producing OCR textual content or growing tags for my paperwork, links, and notes, they’re a ways from supreme for coding duties – and now not simply vibe-coding, both. The most typical use case for LLMs in my house lab is querying them about failed initiatives, inspecting terminal logs, and accomplishing vulnerability scans on my code. The small-sized fashions that’d are compatible in consumer-tier GPUs lack the sheer computational prowess for those duties, particularly whenever you pit them in opposition to the reasoning powerhouses you’ll leverage with Cursor and Antigravity.

Then again, Combination-of-Mavens fashions turn the entire scenario on its head. In any case, with the ability to host cumbersome 35B fashions on vulnerable 12GB VRAM GPUs with out taking large efficiency hits or turning down the quantization price makes them a power to be reckoned with. And having examined GPT-OSS-20B, Gemma-4-26B-A4B, and Qwen3.6-35B-A3B with my VS Code example over the last couple of months, I will be able to verify that they’re easiest for dev duties, with Qwen3.6 conserving its personal in opposition to its cloud-based opponents.

As for my coding toolkit, VS Code – the very software that Cursor and Antigravity are forked from – serves as the center-piece of my setup. I’d to begin with used Proceed right through my Ollama days, however after you have a style of MoE fashions, I’ve since shifted to llama-vscode, which pairs extremely smartly with the llama-server circumstances working on my Proxmox server and gaming workstation.

Because the llama-vscode extension accepts the whole lot from code recordsdata to random paperwork, the potential of my LLMs hallucinating is diminished even additional. Pair it with the appropriate LLM, and it might generate totally purposeful code snippets, whilst its auto-completion options are simply as dependable. That mentioned, I’ve had higher good fortune with Qwen 2.5 Coder (the decrease parameter variants) because the auto-completion type, as Qwen3.6 and Gemma 4 would take a few seconds to generate code. However for easy RAG-based chat or troubleshooting help, those LLMs generally tend to supply correct leads to underneath a minute.

Comparable

Your previous GPU can nonetheless run giant LLMs – you simply want the appropriate tweaks

There is a lot you’ll do with those fashions

Together with the utilities uncovered by the use of MCP servers

Any other neat side of llama-vscode is that it helps agentic workflows, and the default agent is flexible sufficient to conform to maximum coding eventualities. Then again, the true amusing starts while you get started growing brokers for devoted duties. There’s even an agent designed to create different brokers (and sub-agents), and it really works smartly so long as I give it an in depth description of what I would like within the chat phase.

Likewise, llama-vscode additionally we could me fine-tune the other facets of an agent, and I will be able to select the precise collection of equipment at its disposal. Talking of equipment, llama-vscode works with MCP servers, which means I will be able to use my LLMs to regulate further packages, as a substitute of simply depending on them for coding duties.

The most efficient phase? I don’t need to pay any subscription charges for this setup

A laugh truth: Burst inference duties don’t devour numerous power

In comparison to cloud LLMs that may generate whole code recordsdata in a handful of seconds, the relatively longer time taken by means of my MoE fashions to respond to queries isn’t dangerous in any respect. If the rest, I’d take this slight efficiency problem over working out of price limits any day, particularly since my native LLMs spare me from paying additional subscription charges each month.

When you’re questioning concerning the energy intake of my LLM-hosting workstations, then no, my self-hosted fashions slightly give a contribution to my power expenses. You spot, there’s an enormous false impression about LLM utilization within the tinkering group – whilst AI fashions can siphon an ungodly quantity of energy right through the learning segment, inference duties are a unique tale altogether. After I run LLM-powered duties, my GPUs spring to existence for a couple of seconds, procedure the duties, and return to an idle state. If the rest, working the servers 24/7 drains extra watts than the inference duties, however I already use one workstation for my Proxmox experiments, whilst the opposite is my major gaming/video-editing/coding device.

Then there’s the privateness good thing about hooking native LLMs as much as my house lab paperwork, early-access codebases, and confidential initiatives. In point of fact, the few efficiency tradeoffs are definitely worth the personal and subscription-free nature of my native VS Code setup.

I changed Cursor with a fully native VS Code setup, and I overlooked lower than anticipated

I ran native LLMs on Intel’s most cost-effective iGPU, and the consequences had been unusually respectable

The llama-vscode extension fuels my coding escapades

I’ve paired it with the MoE fashions I host on my house lab nodes

Your previous GPU can nonetheless run giant LLMs – you simply want the appropriate tweaks

Together with the utilities uncovered by the use of MCP servers

The most efficient phase? I don’t need to pay any subscription charges for this setup

A laugh truth: Burst inference duties don’t devour numerous power

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

I ran native LLMs on Intel’s most cost-effective iGPU, and the consequences had been unusually respectable

The llama-vscode extension fuels my coding escapades

I’ve paired it with the MoE fashions I host on my house lab nodes

Your previous GPU can nonetheless run giant LLMs – you simply want the appropriate tweaks

Together with the utilities uncovered by the use of MCP servers

The most efficient phase? I don’t need to pay any subscription charges for this setup

A laugh truth: Burst inference duties don’t devour numerous power

Related Posts

Leave a Comment Cancel Reply