April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini

(gist.github.com)

125 points | by greenstevester 4 hours ago

11 comments

mark_l_watson 19 minutes ago
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
anonyfox 10 minutes ago
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
milchek 1 hour ago
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
[-]
- internet101010 25 minutes ago
  I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
  Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
- jasonjmcghee 19 minutes ago
  Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
aetherspawn 1 hour ago
Which harness (IDE) works with this if any? Can I use it for local coding right now?
[-]
- lambda 1 hour ago
  Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
  I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
boutell 2 hours ago
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
redrove 3 hours ago
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
[-]
- faitswulff 2 hours ago
  Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
  [-]
  - DiabloD3 1 hour ago
    I don't think it does, but llama.cpp does, and can load models off HuggingFace directly (so, not limited to ollama's unofficial model mirror like ollama is).
    There is no reason to ever use ollama.
    [-]
    - ffsm8 1 hour ago
      > I don't think it does, but llama.cpp does
      I just checked their docs and can't see anything like it.
      Did you mistake the command to just download and load the model?
      [-]
      - u8080 30 minutes ago
        -hf ModelName:Q4_K_M
        [-]
        ffsm8 11 minutes ago
        Did you mistake the command to just download and load the model too?
        Actually that shouldn't be a question, you clearly did.
        Hint: it also opens Claude code configured to use that model
    - beanjuiceII 1 hour ago
      sure there's a reason...it works fine thats the reason
- alifeinbinary 3 hours ago
  I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
- gen6acd60af 2 hours ago
  LM Studio is closed source.
  And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
- meltyness 2 hours ago
  I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
- logicallee 36 minutes ago
  >Ollama is slower
  I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
  [1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
  [-]
  - dminik 25 minutes ago
    It depends on the hardware, backend and options. I've recently tried running some local AIs (Qwen3.5 9B for the numbers here) on an older AMD 8GB VRAM GPU (so vulkan) and found that:
    llama.cpp is about 10% faster than LM studio with the same options.
    LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
    Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
    Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.
- iLoveOncall 3 hours ago
  > There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
  Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
  [-]
  - DiabloD3 1 hour ago
    Ollama is quasi-open source.
    In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.
    There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.
    [-]
    - simondotau 44 minutes ago
      If it’s MIT code derived from MIT code, in what way is its openness ”quasi”? Issues of attribution and crediting diminish the karma of the derived project, but I don’t see how it diminishes the level of openness.
- lousken 3 hours ago
  lm studio is not opensource and you can't use it on the server and connect clients to it?
  [-]
  - jedisct1 2 hours ago
    LM Studio can absolutely run as as server.
    [-]
    - walthamstow 2 hours ago
      IIRC it does so as default too. I have loads of stuff pointing at LM Studio on localhost
logicallee 2 hours ago
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
easygenes 3 hours ago
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
[-]
- diflartle 2 hours ago
  Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
  So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
- polotics 3 hours ago
  Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
  What does unsloth-studio bring on top?
  [-]
  - easygenes 3 hours ago
    LM Studio has been around longer. I’ve used it since three years ago. I’d also agree it is generally a better beginner choice then and now.
    Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
    [-]
    - xenophonf 1 hour ago
      LM Studio isn't free/libre/open source software, which misses the point of using open weights and open source LLMs in the first place.
      [-]
      - vonneumannstan 44 minutes ago
        Disagree, there are a lot of reasons to use open source local LLMs that aren't related to free/libre/oss principles. Privacy being a major one.
- DiabloD3 1 hour ago
  Advertising, mostly.
  Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
  Only way to win is to uninstall it and switch to llama.cpp.
robotswantdata 3 hours ago
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
[-]
- Bigsy 2 hours ago
  For MLX I'd guess.
  [-]
  - wronglebowski 1 hour ago
    That also comes upstream from llama.cpp https://github.com/ggml-org/llama.cpp/discussions/4345
  - redrove 1 hour ago
    https://omlx.ai/
    [-]
    - leftnode 1 minute ago
      Does this have a CLI only interface?
volume_tech 1 hour ago
[dead]
greenstevester 4 hours ago
[flagged]
[-]
- krzyk 2 hours ago
  By desk you mean that "Mac mini"? Because it is pricey. In my country it is 1000 USD (from Apple for basic M4 with 24GB). My desk was 1/5th of that price.
  And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
  Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?
  [-]
  - lambda 1 hour ago
    Right now, open models that run on hardware that costs under $5000 can get up to around the performance of Sonnet 3.7. Maybe a bit better on certain tasks if you fine tune them for that specific task or distill some reasoning ability from Opus, but if you look at a broad range of benchmarks, that's about where they land in performance.
    You can get open models that are competitive with Sonnet 4.6 on benchmarks (though some people say that they focus a bit too heavily on benchmarks, so maybe slightly weaker on real-world tasks than the benchmarks indicate), but you need >500 GiB of VRAM to run even pretty aggressive quantizations (4 bits or less), and to run them at any reasonable speed they need to be on multi-GPU setups rather than the now discontinued Mac Studio 512 GiB.
    The big advantage is that you have full control, and you're not paying a $200/month subscription and still being throttled on tokens, you are guaranteed that your data is not being used to train models, and you're not financially supporting an industry that many people find questionable. Also, if you want to, you can use "abliterated" versions which strip away the censoring that labs do to cause their models to refuse to answer certain questions, or you can use fine-tunes that adapt it for various other purposes, like improving certain coding abilities, making it better for roleplay, etc.
  - zhongwei2049 59 minutes ago
    I have the same setup (M4 Pro, 24GB). The e4b model is surprisingly snappy for quick tasks. The full 26B is usable but not great — loading time alone is enough to break your flow.
    Re: subscriptions vs local — I use both. Cloud for the heavy stuff, local for when I'm iterating fast and don't want to deal with rate limits or network hiccups.