Docs / Local AI

Local AI

Download a model once and run every rewrite entirely on your device. No API key, no internet required, zero text sent to external servers.

Why use Local AI

Reason Detail
Privacy Your text never leaves your computer — not to ReWryte, not to any AI provider.
Offline use Works without internet — on a plane, in a meeting room, anywhere.
Zero cost No API keys, no tokens, no billing — download once, use forever.
No rate limits No RPD or RPM caps — rewrite as fast as your hardware allows.
Sensitive documents Legal, medical, HR, financial — content that should not go to a third-party cloud.

Trade-off: Local models are slower than cloud on most machines, and the quality ceiling is lower than top-tier cloud models like GPT-4 or Claude Sonnet. For short-to-medium rewrites, the difference in quality is often imperceptible.

System detection

When you open Main Window → Local AI, ReWryte automatically scans your machine and shows you exactly which model fits your hardware — in real time, no manual configuration required.

What gets analyzed

Total RAM Primary factor — larger models need more RAM to load
Available RAM Prevents the model from competing with other apps
CPU cores More cores = faster inference on non-Apple Silicon
Architecture Apple Silicon (aarch64) enables GPU acceleration via Metal
Free disk space Model files range from ~940 MB to ~4.4 GB

What you see on the page

Your system specs (chip, total RAM, available RAM, CPU cores, free disk)
The recommended tier for your machine
A "Recommended" badge on the model that matches your tier
Models you do not have enough RAM for are disabled with an explanation
Apple Silicon advantage: Macs with M1, M2, M3, or M4 chips show a "Fast GPU mode" badge. Apple Silicon uses unified memory shared between CPU and GPU — the model runs on the GPU via Apple Metal, giving significantly faster inference than CPU-only x86 machines with equivalent RAM.

Tier assignment by RAM

Your RAM Recommended tier Model selected
Under 8 GB Light Qwen 2.5 1.5B (~940 MB)
8 GB – 15 GB Light Qwen 2.5 1.5B (~940 MB)
16 GB – 23 GB Balanced Qwen 2.5 3B (~1.9 GB)
24 GB or more Quality Qwen 2.5 7B (~4.4 GB)

The three Qwen 2.5 variants

ReWryte ships three variants of Qwen 2.5 Instruct, quantized to Q4_K_M format (4-bit quantization that balances quality and file size). All three run fully on-device via llama.cpp.

Model Download RAM (min / rec.) Context Best for

Qwen 2.5 1.5B

Light
~940 MB 3 GB → 4 GB 2,048 tokens Any laptop, quick rewrites

Qwen 2.5 7B

Quality
~4.4 GB 10 GB → 16 GB 2,048 tokens High-RAM Macs, best local quality
Q4_K_M quantization: Model weights are compressed from 32-bit floats to 4-bit integers using a "K-quant" method. This reduces file size and RAM requirement by roughly 8× compared to full-precision, with quality loss that is barely noticeable for rewriting tasks.

Download and activate

  1. Open Main Window → Local AI. ReWryte analyzes your system and displays the model catalog with a "Recommended" badge on the right model for your hardware.
  2. Read the system card at the top — it shows your chip type, total RAM, available RAM, CPU cores, free disk, and recommended tier. Confirm the recommended model fits your setup.
  3. Click Download on the model with the Recommended badge. A progress bar shows download percentage and MB in real time.
    • Qwen 2.5 1.5B: ~940 MB (~3–5 min on typical connection)
    • Qwen 2.5 3B: ~1.9 GB (~6–10 min)
    • Qwen 2.5 7B: ~4.4 GB (~15–25 min)
  4. Once downloaded, click Activate. ReWryte automatically switches to local inference mode.
  5. Go to Main Window → Pick your AI to confirm the local model shows as "Active" with an "On your device" tag.
First inference is slower. The model loads into memory on the first rewrite request (~3–8 seconds). Subsequent rewrites are significantly faster as the model stays resident in memory.
You can have multiple models downloaded simultaneously. Only one can be active at a time. To switch between downloaded models: click Activate on another. To free disk space: click Delete — the catalog entry remains so you can re-download anytime.

Switching between cloud and local

You can toggle between cloud and local inference at any time — no restart needed.

1

Method 1 — Local AI page toggle

Main Window → Local AI → "Where rewrites run" toggle (On your device / In the cloud)

2

Method 2 — Pick your AI

Main Window → Pick your AI → click "Use this →" on any cloud provider (switches to cloud) or on a downloaded local model (switches to local)

3

Method 3 — Dashboard

Main Window → Dashboard → "Currently writing with" card → "Change" button → opens Pick your AI

Model comparison

How the Qwen 2.5 family compares to other local models for on-device text rewriting. All ratings use Q4_K_M quantization tested on Apple Silicon M2.

Model Size (Q4) RAM needed Instructions Multilingual

Llama 3.2 3B

Meta's small model, English-first
~2.0 GB 6–8 GB Good Moderate

Llama 3.1 8B

Strong English quality
~4.7 GB 10–16 GB Very good Moderate

Phi-3.5 Mini 3.8B

Microsoft's compact model
~2.3 GB 6–8 GB Good Moderate

Gemma 2 2B

Google's smallest model
~1.4 GB 4–6 GB Fair Limited

Mistral 7B v0.3

General purpose, good baseline
~4.4 GB 10–16 GB Good Moderate

Phi-4 Mini 3.8B

Microsoft, strong reasoning
~2.3 GB 6–8 GB Very good Moderate

Blue-bordered rows are models used by ReWryte. Ratings reflect performance on tone-guided rewriting tasks, not general benchmarks.

Why Qwen 2.5 is our choice

We chose the Qwen 2.5 family after testing each option across all four built-in tones in all 10 supported output languages. Here is the reasoning:

+

Multilingual strength

Qwen 2.5 was trained on a much larger multilingual corpus than Llama or Phi. If you write in any language other than English, Qwen performs significantly better. This matters because ReWryte supports output in 10 languages.

+

Instruction following at small sizes

The 1.5B and 3B Qwen models follow tone instructions more reliably than comparable-sized Llama or Gemma models. For a rewriting app where the prompt IS the instruction, this consistency is the most important factor.

+

Efficiency above weight class

Qwen 2.5 achieves competitive quality at lower parameter counts. The 3B and 7B variants punch above their weight relative to file size and RAM usage compared to models of similar size from other families.

+

Quantization quality

Q4_K_M is one of the most quality-preserving 4-bit quantization formats available. Qwen 2.5's architecture retains more capability after quantization than older model families, meaning less quality loss for the same file size reduction.

When other models might outperform Qwen 2.5: For purely English creative or long-form writing, Llama 3.1 8B and Mistral 7B can produce more natural-sounding prose. For code-adjacent tasks, Phi-4 Mini is stronger. Qwen 2.5 is specifically the right tool for structured, tone-guided rewriting — which is exactly what ReWryte does.

Frequently asked questions

Does Local AI work offline?

Yes. Once the model is downloaded, ReWryte requires no internet connection for rewrites. The model file is stored in the app data directory on your device.

Can I use Local AI and cloud providers at the same time?

Yes. Cloud providers remain connected while you use local inference. If you toggle local off, ReWryte falls back to the active cloud provider instantly. Switch between them any time using Pick your AI or the Local AI page toggle.

How much disk space does the model take?

Qwen 2.5 1.5B is ~940 MB, 3B is ~1.9 GB, and 7B is ~4.4 GB. You can have multiple models downloaded simultaneously. Delete any model from Main Window → Local AI to reclaim disk space — the catalog entry remains so you can re-download anytime.

Is GPU acceleration required?

No. llama.cpp runs on CPU only. However, Apple Silicon Macs use Metal GPU acceleration by default, providing roughly 3–5× faster inference compared to CPU-only mode. On an M1 Mac with 16 GB RAM, Qwen 2.5 3B typically responds in 1–3 seconds per rewrite.

Which model should I use on an Intel Mac?

CPU-only inference is inherently slower on Intel Macs. Use Qwen 2.5 1.5B for the fastest possible response. Expect 5–15 seconds per rewrite depending on RAM and the complexity of the text.

Can I download a model while using the app?

Yes. You can use cloud AI providers for rewrites while a local model is downloading in the background. The download appears as a progress bar on the Local AI page.

Can I resume a stopped download?

No. If a download is cancelled or fails, clicking Download restarts from the beginning. Ensure a stable connection before starting the 3B or 7B models.

Next steps