Privacy-First Mobile Browser with Local AI (Kotlin + CoreML)

Ship a privacy-first mobile browser with on-device AI for summarization and code completion using Kotlin and Core ML.

Build a privacy-first mobile browser with local AI (Kotlin + CoreML) — project walkthrough

Hook: You want a browser that respects privacy, ships intelligent features like summarization and code completion, and runs entirely on-device so nothing you browse ever leaves your phone. In 2026, with powerful NPUs on mobile chips and compact, quantized models available, that goal is realistic. This tutorial shows you how to build a production-minded, privacy-first mobile browser that runs a lightweight local AI model for summarization and code-completion — using Kotlin on Android and a Core ML pipeline for iOS parity, inspired by Puma’s local-AI approach.

What you'll build — high level (inverted pyramid)

By the end of this article you'll have:

A minimal, secure mobile browser UI with a WebView (Android) and instructions for a WKWebView equivalent (iOS)
An on-device AI inference path: Android runs an ONNX/ORT-based quantized model called from Kotlin; iOS runs a converted Core ML model
Feature examples: page summarization and in-page code completion (dev flow)
Production considerations: model conversion, quantization, encryption, performance tuning, update strategy

Why this matters in 2026

Edge AI matured rapidly in 2024–2026. Modern phones include dedicated NPUs (Apple Neural Engine, Qualcomm Hexagon/NPU, MediaTek APU) and frameworks — Core ML continues to get optimizations, and ONNX Runtime Mobile / NNAPI delegates make low-latency inference possible on Android. Meanwhile, users and regulators demand privacy-first apps that minimize cloud calls. That creates an ideal window to ship local AI-powered browsers that rival cloud-driven experiences in usefulness while keeping data private.

Key 2026 trends this project leverages

Widespread mobile NPUs: faster on-device inference and better energy efficiency.
Model quantization advances: reliable int8 / 4-bit pathways reducing memory and latency.
Tooling for conversion: robust workflows converting PyTorch/ONNX -> Core ML for iOS parity.
Privacy-first demand: users increasingly choose local-first products (inspired by apps like Puma).

Project architecture

Keep the architecture simple and modular:

UI Layer: WebView (Android) or WKWebView (iOS) + lightweight controls for AI actions
Extraction Layer: JavaScript bridge to grab page text and code blocks
Prompt Engineering: compact prompts, token budget control, progressive summarization for long pages
Inference Engine: ONNX Runtime Mobile or a native C++ runtime on Android; Core ML on iOS
Storage & Security: encrypted local model files, optional model-store + signed updates

Step 0 — Prerequisites

Android Studio (2024.3+), Kotlin 1.9+
Xcode 15+ for iOS Core ML workflows (if you build iOS parity)
Python 3.10+ with packages: transformers, torch, onnx, onnxruntime, coremltools
ONNX Runtime Mobile AAR (for Android) or a prebuilt native runtime
Small quantized LLM (community models optimized for mobile)

Step 1 — Choose & prepare a mobile model

Pick a compact model that fits the memory and latency constraints of phones. In 2026, many community-maintained quantized models (4-bit / int8) are available. You can start with a model sized to run under ~2–4 GB of RAM for reasonable speed on modern high-end phones; lower-end devices will need smaller models.

Recommended pipeline:

Start from a PyTorch or Hugging Face-compatible checkpoint.
Convert to ONNX with opset 17 for broad runtime compatibility.
Apply post-training quantization (onnxruntime.quantization) or GPTQ-style quantization where supported.
Convert ONNX -> Core ML using coremltools for iOS.

Example: export PyTorch -> ONNX -> Core ML

Here’s an actionable Python script that shows the main pieces. Adapt it to your model's tokenizer/forward signature.

# export_to_coreml.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
import coremltools as ct

model_name = "my-small-llm"  # replace with chosen checkpoint

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model.eval()

# Example: export a single-token forward for shape inference
input_ids = tokenizer.encode("Hello world", return_tensors="pt")

# Export to ONNX
onnx_path = "model.onnx"
torch.onnx.export(model, (input_ids,), onnx_path, opset_version=17, do_constant_folding=True)

# Quantize ONNX (dynamic quantization)
quant_path = "model.quant.onnx"
quantize_dynamic(onnx_path, quant_path, weight_type=QuantType.QInt8)

# Convert to Core ML
onnx_model = onnx.load(quant_path)
mlmodel = ct.converters.onnx.convert(onnx_model)
mlmodel.save("LocalLLM.mlpackage")

Notes:

Quantization choices affect accuracy vs. memory/latency. Test several (int8, 4-bit).
Some models require custom operator support; test early on device emulators.

Step 2 — Android: Kotlin app with WebView + ONNX runtime

App structure

MainActivity: hosts a WebView and an overlay bottom sheet for AI results
JS bridge: get page HTML/text and send to Kotlin
InferenceService: wraps ONNX Runtime session & manages tokenization

Extract page content from WebView — minimal Kotlin

// In your activity
webView.settings.javaScriptEnabled = true

webView.addJavascriptInterface(object {
    @android.webkit.JavascriptInterface
    fun onContentExtracted(text: String) {
        runOnUiThread {
            // send to inference pipeline
            InferenceService.shared.summarize(text) { summary ->
                // show summary in UI
            }
        }
    }
}, "AndroidBridge")

webView.evaluateJavascript(
    "(function(){return document.body.innerText})();"
) { result ->
    // result is a JSON string, but we prefer JS bridge for large content
}

ONNX Runtime invocation in Kotlin (conceptual)

Use ONNX Runtime Mobile AAR or a JNI wrapper. The sample below is a conceptual call flow — your runtime API may vary.

// Pseudocode / skeleton for inference
class InferenceService(private val context: Context) {
    companion object { lateinit var shared: InferenceService }

    private lateinit var session: OrtSession  // from ORT Java API

    fun initialize() {
        // load model from encrypted file into memory-mapped buffer
        session = OrtEnvironment.getEnvironment().createSession("/data/user/0/.../model.quant.onnx")
    }

    fun summarize(text: String, callback: (String) -> Unit) {
        // 1) Tokenize
        val tokens = tokenizer.encode(text)
        // 2) Create input tensor and run
        val inputTensor = OnnxTensor.createTensor(env, tokens)
        val output = session.run(Collections.singletonMap("input_ids", inputTensor))
        // 3) Decode and post-process
        val summary = tokenizer.decode(outputAsTokens(output))
        callback(summary)
    }
}

Practical tips:

Memory-map model files (mmap) to avoid copying large files into RAM.
Use NNAPI delegate where possible for device acceleration and lower power.
Batch smaller requests; for summarization use a progressive approach for very long pages.

Step 3 — iOS parity with Core ML

To achieve parity on iOS, convert the same quantized ONNX model to a Core ML package (see the Python script above). In iOS, use MLModel APIs or Vision / Create ML delegates that leverage Apple Neural Engine.

Swift example — run Core ML model

// Swift: Core ML inference (simplified)
import CoreML

func runModel(inputIds: [Int]) throws -> [Int] {
    let model = try LocalLLM(configuration: MLModelConfiguration())
    let mlMultiArray = try MLMultiArray(shape: [NSNumber(value: inputIds.count)], dataType: .int32)
    for (i, v) in inputIds.enumerated() { mlMultiArray[i] = NSNumber(value: v) }

    let input = LocalLLMInput(input_ids: mlMultiArray)
    let out = try model.prediction(input: input)
    // decode output tokens
    return decodeOutput(out.output_tokens)
}

Notes:

Core ML packages (.mlpackage) can include metadata and custom processing layers.
Use MLShapedArray / MLFeature providers as needed for custom models.
Deploying Core ML on-device benefits from Apple's hardware acceleration by default.

Feature patterns: summarization & code completion

Summarization (page-level)

Extract visible text and important metadata (title, headings).
Chunk long content into manageable token-size pieces (e.g., 1k tokens).
Run summarization per chunk, then compose a concise summary via a second pass.

In-page code completion (developer workflow)

Detect code blocks using simple heuristics (pre/code tags or language hints).
Send the code block with a compact prompt: "Continue this code with context X".
Stream tokens where possible to show live completions.

Prompt hygiene example (compact):

Prompt: "You are a local assistant. Continue this JavaScript function without network calls. Input:



Provide only the continuation and explain one-line if changes are non-trivial."

Privacy, security, and UX considerations

Your app's privacy guarantees depend on implementation details. Here are concrete, actionable rules:

No network by default: design the inference stack to run offline. Any model-update flow must be explicit and opt-in.
Encrypt model files at rest: use platform keystore (Android Keystore / iOS Keychain) to protect model decryption keys.
Local-only logs: store logs locally or offer opt-in crash reports that scrub PII.
Consent & transparency: show a clear banner describing on-device AI, the model size, and update/rollback controls.
Fail-safe UI: if inference is slow or fails, gracefully provide a non-AI fallback (e.g., plain text reader mode).

Performance tuning & costs

Key knobs to tweak for production:

Quantization level: 4-bit vs 8-bit has different tradeoffs. Run A/B tests on real devices.
Delegate selection: prefer NNAPI / Core ML / Metal / Apple Neural Engine delegates for low power.
Model sharding: load only parts of a model when needed for code completion vs. summarization.
Session reuse: warm the runtime session when the app is foregrounded to reduce first inference latency.
Streaming: stream token generation to the UI for perceived speed — decode tokens incrementally.

Edge cases to plan for

Device memory pressure — gracefully unload models when backgrounded.
Tokenizers that use large vocab tables — consider built or compressed tokenizers (sentencepiece optimized).
Licensing — verify model licenses before embedding in a shipped app.
Accessibility — expose summarization results to screen readers and support text scaling.

Testing & metrics

Measure these KPIs during development:

First-token latency and full-response latency (ms)
Peak memory usage and sustained RAM while model loaded
Battery drain per inference minute
Summary quality: ROUGE/BLEU approximations + human evaluation
Privacy compliance tests and automated audits for outbound requests

Advanced strategies & future-proofing

1. Progressive summarization

Use multi-pass summarization: summarise chunks, then summarize summaries. This reduces token pressure and works well for long-form pages.

2. Mixed precision & dynamic quantization

Use mixed precision where certain layers remain higher precision while others get aggressive quantization. This balances quality and performance.

3. Federated telemetry for model improvement (opt-in)

Offer opt-in federated learning or aggregated telemetry to improve prompt templates and heuristics without collecting raw browsing data.

4. KMM for shared logic

If you plan both Android and iOS apps, consider Kotlin Multiplatform Mobile (KMM) for shared business logic (prompt composition, chunking, cache policies). Keep runtime-specific code in platform layers.

Production checklist

Verify model license for redistribution.
Measure real-device performance on target devices (low, mid, high tiers).
Encrypt models and implement signed update checks.
Design transparent settings for users to manage models and opt-in telemetry.
Prepare fallback flows and graceful degradation for low-memory devices.

Case study: Inspired by Puma

Apps such as Puma demonstrated a consumer appetite for local-AI browsers on both iOS and Android — users value a responsive assistant that doesn't send their pages to the cloud. Use Puma-style choices as guidelines: small default models, easy model selection, offline-first UX, and clear metadata on model behavior. That approach reduces friction and increases trust.

Final thoughts & next steps

Building a privacy-first mobile browser with local AI is no longer an experiment — in 2026 it's a viable product strategy. The combination of compact quantized models, improved toolchains (ONNX -> Core ML), and mobile NPUs gives developers the ability to ship useful, private features like summarization and code completion entirely on-device.

Actionable next steps:

Clone a minimal WebView browser template and add a JS bridge.
Pick a compact model and test an ONNX Runtime Mobile inference end-to-end on a flagship device.
Convert the same model to Core ML and test on an iPhone with real pages for parity checks.
Iterate on prompt templates and quantization for the best quality vs. performance tradeoff.

“Local AI in browsers offers a unique combination of privacy and utility — the trick is choosing the right model and building robust inference and UX layers.”

Resources & tooling shorthand (2026)

Model conversion: transformers, onnx, coremltools
Android runtime: ONNX Runtime Mobile, NNAPI delegate
iOS runtime: Core ML (.mlpackage) with Neural Engine acceleration
Quantization: onnxruntime.quantization, GPTQ frameworks (where applicable)
Security: Android Keystore, Keychain, hardware-backed encryption

Call to Action

Ready to ship a local-AI browser? Start with a minimal WebView + tokenizer + quantized model pipeline, benchmark on real devices, and iterate. If you want, grab our reference repo (we maintain sample code and conversion scripts) to cut months off your rollout. Join our community to share model selection results and device benchmarks — together we can make private, intelligent browsing the default.