In this article you’ll learn how to perform inference with a 117-billion-parameter model on Macs. I’m an AI Tinkerers Organizer for the Ottawa chapter and I’m happy to share this tutorial which covers how to use Apple’s MLX-LM library to perform inference on Apple Silicon.
Even if you just have a MacBook Air with 8 GB of memory, that’s enough to run a 3-billion-parameter model on a Mac. I assume many of you have more powerful Macs, and on an M4 Max chip with 128 GB of memory I was able to generate 79 tokens/second with a 117-billion-parameter model (5.1 billion active parameters), which is comparable to some cloud AI chat apps.

Why Should I Run AI Models Locally?
The privacy advantages of running AI locally with open-source software instead of through a cloud service are obvious, especially for an AI Tinkerer. There are other advantages of running AI locally which are often overshadowed by the privacy aspects.
- Run models without an internet connection.
- Free with no usage limits.
- Unlimited customization for Tinkerers.
As time goes on the list of advantages may continue to grow so now’s a good time to familiarize yourself with tools to build your own local AI applications.
Apple’s MLX-LM Package
MLX-LM is a Python packaged by Apple for running AI models on Mac. MLX-LM uses Apple’s MLX which is an array framework, much like NumPy or PyTorch, but built specifically for Apple Silicon. I believe MLX-LM is the best option for running I models on Mac, especially long-term, as the underlying code is maintained by Apple. I expect MLX-LM to make better use of MPS compared to third-party options.
MLX-LM supports text generation and fine-tuning for large language models. It integrates with Hugging Face’s Model Hub and supports top model architectures. It’s available on PyPI and has a straightforward API. I am very impressed with its performance and, as I mentioned, I was able to perform inference with a 117-billion-parameter model on a MacBook Pro.
Machine Requirements
This repository contains the scripts I used to quantize three models with MLX-LM. I used mixed quantization, keeping sensitive components at 8 bits while quantizing most other supported modules to 4 bits. The models are available on Hugging Face’s Model Hub and the table below shows the recommended available unified memory for each one. Given that the smallest model uses about 5 GB of memory, it’s possible to run it on a MacBook Air with 8 GB of memory.
| Model | Recommended memory |
|---|---|
| EricFillion/smollm3-3b-mlx | ~5 GB |
| EricFillion/gpt-oss-20b-mlx | ~20 GB |
| EricFillion/gpt-oss-120b-mlx | ~65 GB |
MLX-LM Usage
First off we need to install mlx-lm.
pip install mlx-lm
We can run the following code to perform inference. The output text will contain thinking tokens, so based on the model type, we can split the text and remove them. When calling the “generate()” function, we set verbose to “True” to show the tokens as they’re generated along with statistics like tokens per second.
# Licensed under Apache-2.0
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
MODEL_NAME = "EricFillion/smollm3-3b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-20b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-120b-mlx"
VERBOSE=False
model_type = "smollm3" if MODEL_NAME == "EricFillion/smollm3-3b-mlx" else "gpt-oss"
# model_type = "gpt-oss"
model, tokenizer = load(MODEL_NAME)
prompt = "Explain NLP transformer models"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
sampler = make_sampler(temp=0.7, top_p=0.8, top_k=30)
text = generate(model, tokenizer, sampler=sampler, verbose=VERBOSE, max_tokens=2048, prompt =prompt)
def split_text(full_text, seperator):
parts = full_text.split(seperator)
if len(parts)> 1:
result = parts[1].lstrip()
else:
result = full_text
return result
if model_type == "smollm3":
output_text = split_text(text, "</think>")
else:
output_text = split_text(text, "<|channel|>final<|message|>")
print(output_text)
Output:
NLP Transformer Models: A Comprehensive Explanation
1. Core Concept of Transformers:
Transformers are a type of neural network architecture that excels in Natural Language Processing (NLP) tasks. Introduced in 2017 by Vaswani et al., they differ from Recurrent Neural Networks (RNNs) by leveraging self-attention mechanisms to capture long-range dependencies in sequential data. This allows transformers to process input sequences without relying on sequential order, improving contextual understanding and performance on NLP tasks.
2. Key Components of a Transformer:
- Encoder-Decoder Structure: The input sequence is processed by an encoder, and the output sequence is generated by a decoder…
117 Billion Parameters at 77 Tokens per Second
We ran the script from above with “VERBOSE” set to “True” on a MacBook Pro with an M4 Max chip using “EricFillion/gpt-oss-120b-mlx”. The performance is shown below, notice how we generated 79 tokens per second.
Generation: 2048 tokens, 79.166 tokens-per-sec
Peak memory: 63.372 GB
3 Billion Parameters on a MacBook Air
It’s possible to run a 3-billion-parameter quantized model with modest hardware. Below is the result from running the script from above running “EricFillion/smollm3-3b-mlx” on a MacBook Air with 16 GB of memory.
Generation: 2008 tokens, 35.518 tokens-per-sec
Peak memory: 2.613 GB
Streaming
We can perform streaming by using MLX-LM’s “stream_generate()” function instead of its “generate()” function. This is great if you want to show live progress to your users.
# Licensed under Apache-2.0
from mlx_lm import load, generate, stream_generate
from mlx_lm.sample_utils import make_sampler
MODEL_NAME = "EricFillion/smollm3-3b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-20b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-120b-mlx"
model, tokenizer = load(MODEL_NAME)
prompt = "Explain NLP transformer models"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True,
)
sampler = make_sampler(temp=0.7, top_p=0.8, top_k=30)
text = generate(model, tokenizer, sampler=sampler, max_tokens=2048, prompt=prompt)
for token in stream_generate(
model,
tokenizer,
prompt,
max_tokens=2048,
sampler=sampler,
):
print(token.text, end="")
Conversational
Let’s edit the first script to make it conversational. So, you’ll be able to chat back and fourth with the model right from your terminal.
# Licensed under Apache-2.0
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
MODEL_NAME = "EricFillion/smollm3-3b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-20b-mlx"
# MODEL_NAME = "EricFillion/gpt-oss-120b-mlx"
TEMP =0.7
TOP_P = 0.8
TOP_K = 30
VERBOSE=False
model_type = "smollm3" if MODEL_NAME == "EricFillion/smollm3-3b-mlx" else "gpt-oss"
# model_type = "gpt-oss"
model, tokenizer = load(MODEL_NAME)
sampler = make_sampler(temp=TEMP, top_p=TOP_P, top_k=TOP_K)
messages = [{"role": "system", "content": "You are a helpful chatbot"},
""]
print("Submit 'exit' to end\n\n")
def split_text(full_text, seperator):
parts = full_text.split(seperator)
if len(parts) > 1:
result = parts[1].lstrip()
else:
result = full_text
return result
while True:
user_input = input("You: ")
if user_input == "exit":
print("\nEND.")
break
messages.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
text = generate(model, tokenizer, sampler=sampler, verbose=VERBOSE, max_tokens=2048, prompt=prompt)
if model_type == "smollm3":
output_text = split_text(text, "</think>")
else:
output_text = split_text(text, "<|channel|>final<|message|>")
messages.append({"role": "assistant", "content": output_text})
print("Assistant:", output_text)
Here’s a conversation I had with “EricFillion/smollm3-3b-mlx”
You: Hi
Assistant: Hello! How can I help you today?
You: Tell me a fun fact about pumpkins
Assistant: Here’s a fun fact about pumpkins: A single pumpkin can have up to 500 seeds! While some pumpkins may have fewer, the number varies depending on the type and size. That’s a lot of potential pumpkin pie toppings! 🎃
You: Another fact
Assistant: Absolutely! Here’s another fun pumpkin fact: Pumpkins are technically fruits, not vegetables. They contain seeds and grow from a flower, making them fall under the category of fruits in the botanical world.
You: exit
END.
Sampling Settings
You can adjust the sampling settings to make the model more or less creative. In some cases, like writing poetry, you may want the model to use low-probability tokens and so it’s okay to increase the temperature, top_p and top_k, provided that the text doesn’t become nonsensical. In other cases, like performing math, you may want to lower those parameters, but if you lower them too much the output will become plain and repetitive.
temp=0.9, top_p =0.9, top_k =50
You: write a short poem about about a red bicycle
Assistant: A crimson wheel, a fiery ride,
Red bicycle spins through my mind.
On cobblestone, it screeches and wails,
Rushing through the night, a single trail…
temp=0.3, top_p=0.4, top_k=15
You: what’s 9820 + 3029
Assistant: To find the sum of 9820 and 3029, you simply add the two numbers together.
9820 + 3029 = 12849
So, the sum of 9820 and 3029 is 12849.
Conclusion
Apple’s MLX-LM package allows you to perform local inference on Macs with Apple Silicon. With it it’s possible to run a 117-billion-parameter model right from your MacBook Pro. I hope this article provides some inspiration for your next AI project, perhaps you’ll even present it at your local AI Tinkerers chapter. Keep on tinkering!
Apache-2.0 License
Copyright 2026 Eric Fillion
A copy of the license can be obtained from http://www.apache.org/licenses/LICENSE-2.0
How to Run Open-Source LLMs Locally on a Mac with MLX-LM