Build a Privacy-First Mobile Browser with Local AI (Kotlin + CoreML)
Ship a privacy-first mobile browser with on-device AI for summarization and code completion using Kotlin and Core ML.
Build a privacy-first mobile browser with local AI (Kotlin + CoreML) — project walkthrough
Hook: You want a browser that respects privacy, ships intelligent features like summarization and code completion, and runs entirely on-device so nothing you browse ever leaves your phone. In 2026, with powerful NPUs on mobile chips and compact, quantized models available, that goal is realistic. This tutorial shows you how to build a production-minded, privacy-first mobile browser that runs a lightweight local AI model for summarization and code-completion — using Kotlin on Android and a Core ML pipeline for iOS parity, inspired by Puma’s local-AI approach.
What you'll build — high level (inverted pyramid)
By the end of this article you'll have:
- A minimal, secure mobile browser UI with a WebView (Android) and instructions for a WKWebView equivalent (iOS)
- An on-device AI inference path: Android runs an ONNX/ORT-based quantized model called from Kotlin; iOS runs a converted Core ML model
- Feature examples: page summarization and in-page code completion (dev flow)
- Production considerations: model conversion, quantization, encryption, performance tuning, update strategy
Why this matters in 2026
Edge AI matured rapidly in 2024–2026. Modern phones include dedicated NPUs (Apple Neural Engine, Qualcomm Hexagon/NPU, MediaTek APU) and frameworks — Core ML continues to get optimizations, and ONNX Runtime Mobile / NNAPI delegates make low-latency inference possible on Android. Meanwhile, users and regulators demand privacy-first apps that minimize cloud calls. That creates an ideal window to ship local AI-powered browsers that rival cloud-driven experiences in usefulness while keeping data private.
Key 2026 trends this project leverages
- Widespread mobile NPUs: faster on-device inference and better energy efficiency.
- Model quantization advances: reliable int8 / 4-bit pathways reducing memory and latency.
- Tooling for conversion: robust workflows converting PyTorch/ONNX -> Core ML for iOS parity.
- Privacy-first demand: users increasingly choose local-first products (inspired by apps like Puma).
Project architecture
Keep the architecture simple and modular:
- UI Layer: WebView (Android) or WKWebView (iOS) + lightweight controls for AI actions
- Extraction Layer: JavaScript bridge to grab page text and code blocks
- Prompt Engineering: compact prompts, token budget control, progressive summarization for long pages
- Inference Engine: ONNX Runtime Mobile or a native C++ runtime on Android; Core ML on iOS
- Storage & Security: encrypted local model files, optional model-store + signed updates
Step 0 — Prerequisites
- Android Studio (2024.3+), Kotlin 1.9+
- Xcode 15+ for iOS Core ML workflows (if you build iOS parity)
- Python 3.10+ with packages: transformers, torch, onnx, onnxruntime, coremltools
- ONNX Runtime Mobile AAR (for Android) or a prebuilt native runtime
- Small quantized LLM (community models optimized for mobile)
Step 1 — Choose & prepare a mobile model
Pick a compact model that fits the memory and latency constraints of phones. In 2026, many community-maintained quantized models (4-bit / int8) are available. You can start with a model sized to run under ~2–4 GB of RAM for reasonable speed on modern high-end phones; lower-end devices will need smaller models.
Recommended pipeline:
- Start from a PyTorch or Hugging Face-compatible checkpoint.
- Convert to ONNX with opset 17 for broad runtime compatibility.
- Apply post-training quantization (onnxruntime.quantization) or GPTQ-style quantization where supported.
- Convert ONNX -> Core ML using coremltools for iOS.
Example: export PyTorch -> ONNX -> Core ML
Here’s an actionable Python script that shows the main pieces. Adapt it to your model's tokenizer/forward signature.
# export_to_coreml.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
import coremltools as ct
model_name = "my-small-llm" # replace with chosen checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model.eval()
# Example: export a single-token forward for shape inference
input_ids = tokenizer.encode("Hello world", return_tensors="pt")
# Export to ONNX
onnx_path = "model.onnx"
torch.onnx.export(model, (input_ids,), onnx_path, opset_version=17, do_constant_folding=True)
# Quantize ONNX (dynamic quantization)
quant_path = "model.quant.onnx"
quantize_dynamic(onnx_path, quant_path, weight_type=QuantType.QInt8)
# Convert to Core ML
onnx_model = onnx.load(quant_path)
mlmodel = ct.converters.onnx.convert(onnx_model)
mlmodel.save("LocalLLM.mlpackage")
Notes:
- Quantization choices affect accuracy vs. memory/latency. Test several (int8, 4-bit).
- Some models require custom operator support; test early on device emulators.
Step 2 — Android: Kotlin app with WebView + ONNX runtime
App structure
- MainActivity: hosts a WebView and an overlay bottom sheet for AI results
- JS bridge: get page HTML/text and send to Kotlin
- InferenceService: wraps ONNX Runtime session & manages tokenization
Extract page content from WebView — minimal Kotlin
// In your activity
webView.settings.javaScriptEnabled = true
webView.addJavascriptInterface(object {
@android.webkit.JavascriptInterface
fun onContentExtracted(text: String) {
runOnUiThread {
// send to inference pipeline
InferenceService.shared.summarize(text) { summary ->
// show summary in UI
}
}
}
}, "AndroidBridge")
webView.evaluateJavascript(
"(function(){return document.body.innerText})();"
) { result ->
// result is a JSON string, but we prefer JS bridge for large content
}
ONNX Runtime invocation in Kotlin (conceptual)
Use ONNX Runtime Mobile AAR or a JNI wrapper. The sample below is a conceptual call flow — your runtime API may vary.
// Pseudocode / skeleton for inference
class InferenceService(private val context: Context) {
companion object { lateinit var shared: InferenceService }
private lateinit var session: OrtSession // from ORT Java API
fun initialize() {
// load model from encrypted file into memory-mapped buffer
session = OrtEnvironment.getEnvironment().createSession("/data/user/0/.../model.quant.onnx")
}
fun summarize(text: String, callback: (String) -> Unit) {
// 1) Tokenize
val tokens = tokenizer.encode(text)
// 2) Create input tensor and run
val inputTensor = OnnxTensor.createTensor(env, tokens)
val output = session.run(Collections.singletonMap("input_ids", inputTensor))
// 3) Decode and post-process
val summary = tokenizer.decode(outputAsTokens(output))
callback(summary)
}
}
Practical tips:
- Memory-map model files (mmap) to avoid copying large files into RAM.
- Use NNAPI delegate where possible for device acceleration and lower power.
- Batch smaller requests; for summarization use a progressive approach for very long pages.
Step 3 — iOS parity with Core ML
To achieve parity on iOS, convert the same quantized ONNX model to a Core ML package (see the Python script above). In iOS, use MLModel APIs or Vision / Create ML delegates that leverage Apple Neural Engine.
Swift example — run Core ML model
// Swift: Core ML inference (simplified)
import CoreML
func runModel(inputIds: [Int]) throws -> [Int] {
let model = try LocalLLM(configuration: MLModelConfiguration())
let mlMultiArray = try MLMultiArray(shape: [NSNumber(value: inputIds.count)], dataType: .int32)
for (i, v) in inputIds.enumerated() { mlMultiArray[i] = NSNumber(value: v) }
let input = LocalLLMInput(input_ids: mlMultiArray)
let out = try model.prediction(input: input)
// decode output tokens
return decodeOutput(out.output_tokens)
}
Notes:
- Core ML packages (.mlpackage) can include metadata and custom processing layers.
- Use MLShapedArray / MLFeature providers as needed for custom models.
- Deploying Core ML on-device benefits from Apple's hardware acceleration by default.
Feature patterns: summarization & code completion
Summarization (page-level)
- Extract visible text and important metadata (title, headings).
- Chunk long content into manageable token-size pieces (e.g., 1k tokens).
- Run summarization per chunk, then compose a concise summary via a second pass.
In-page code completion (developer workflow)
- Detect code blocks using simple heuristics (pre/code tags or language hints).
- Send the code block with a compact prompt: "Continue this code with context X".
- Stream tokens where possible to show live completions.
Prompt hygiene example (compact):
Prompt: "You are a local assistant. Continue this JavaScript function without network calls. Input:
Provide only the continuation and explain one-line if changes are non-trivial."
Privacy, security, and UX considerations
Your app's privacy guarantees depend on implementation details. Here are concrete, actionable rules:
- No network by default: design the inference stack to run offline. Any model-update flow must be explicit and opt-in.
- Encrypt model files at rest: use platform keystore (Android Keystore / iOS Keychain) to protect model decryption keys.
- Local-only logs: store logs locally or offer opt-in crash reports that scrub PII.
- Consent & transparency: show a clear banner describing on-device AI, the model size, and update/rollback controls.
- Fail-safe UI: if inference is slow or fails, gracefully provide a non-AI fallback (e.g., plain text reader mode).
Performance tuning & costs
Key knobs to tweak for production:
- Quantization level: 4-bit vs 8-bit has different tradeoffs. Run A/B tests on real devices.
- Delegate selection: prefer NNAPI / Core ML / Metal / Apple Neural Engine delegates for low power.
- Model sharding: load only parts of a model when needed for code completion vs. summarization.
- Session reuse: warm the runtime session when the app is foregrounded to reduce first inference latency.
- Streaming: stream token generation to the UI for perceived speed — decode tokens incrementally.
Edge cases to plan for
- Device memory pressure — gracefully unload models when backgrounded.
- Tokenizers that use large vocab tables — consider built or compressed tokenizers (sentencepiece optimized).
- Licensing — verify model licenses before embedding in a shipped app.
- Accessibility — expose summarization results to screen readers and support text scaling.
Testing & metrics
Measure these KPIs during development:
- First-token latency and full-response latency (ms)
- Peak memory usage and sustained RAM while model loaded
- Battery drain per inference minute
- Summary quality: ROUGE/BLEU approximations + human evaluation
- Privacy compliance tests and automated audits for outbound requests
Advanced strategies & future-proofing
1. Progressive summarization
Use multi-pass summarization: summarise chunks, then summarize summaries. This reduces token pressure and works well for long-form pages.
2. Mixed precision & dynamic quantization
Use mixed precision where certain layers remain higher precision while others get aggressive quantization. This balances quality and performance.
3. Federated telemetry for model improvement (opt-in)
Offer opt-in federated learning or aggregated telemetry to improve prompt templates and heuristics without collecting raw browsing data.
4. KMM for shared logic
If you plan both Android and iOS apps, consider Kotlin Multiplatform Mobile (KMM) for shared business logic (prompt composition, chunking, cache policies). Keep runtime-specific code in platform layers.
Production checklist
- Verify model license for redistribution.
- Measure real-device performance on target devices (low, mid, high tiers).
- Encrypt models and implement signed update checks.
- Design transparent settings for users to manage models and opt-in telemetry.
- Prepare fallback flows and graceful degradation for low-memory devices.
Case study: Inspired by Puma
Apps such as Puma demonstrated a consumer appetite for local-AI browsers on both iOS and Android — users value a responsive assistant that doesn't send their pages to the cloud. Use Puma-style choices as guidelines: small default models, easy model selection, offline-first UX, and clear metadata on model behavior. That approach reduces friction and increases trust.
Final thoughts & next steps
Building a privacy-first mobile browser with local AI is no longer an experiment — in 2026 it's a viable product strategy. The combination of compact quantized models, improved toolchains (ONNX -> Core ML), and mobile NPUs gives developers the ability to ship useful, private features like summarization and code completion entirely on-device.
Actionable next steps:
- Clone a minimal WebView browser template and add a JS bridge.
- Pick a compact model and test an ONNX Runtime Mobile inference end-to-end on a flagship device.
- Convert the same model to Core ML and test on an iPhone with real pages for parity checks.
- Iterate on prompt templates and quantization for the best quality vs. performance tradeoff.
“Local AI in browsers offers a unique combination of privacy and utility — the trick is choosing the right model and building robust inference and UX layers.”
Resources & tooling shorthand (2026)
- Model conversion: transformers, onnx, coremltools
- Android runtime: ONNX Runtime Mobile, NNAPI delegate
- iOS runtime: Core ML (.mlpackage) with Neural Engine acceleration
- Quantization: onnxruntime.quantization, GPTQ frameworks (where applicable)
- Security: Android Keystore, Keychain, hardware-backed encryption
Call to Action
Ready to ship a local-AI browser? Start with a minimal WebView + tokenizer + quantized model pipeline, benchmark on real devices, and iterate. If you want, grab our reference repo (we maintain sample code and conversion scripts) to cut months off your rollout. Join our community to share model selection results and device benchmarks — together we can make private, intelligent browsing the default.
Related Reading
- Gadgets from CES You Can Actually Use for Pets in 2026
- Late-Night Final? How to Plan Sleep and Alertness Around International Sports Streaming in the Emirates
- Guide: Enabling Tables and Rich Data in Lightweight Desktop Apps (Lessons from Notepad)
- Local Partnerships: How Wellness Practitioners Can Get Referrals Through Real Estate Programs
- Starting a Local Podcast or YouTube Show in Saudi? Lessons from Global Platform Deals and Monetisation Changes
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
4-Step Android Tune-Up for Developers: Automate the Routine with ADB and Shell Scripts
Runbook: Handling Vendor Pivots in Emerging Tech (VR, AI Glasses, and More)
Contributing to a Linux Distro: How to Pitch UI Improvements and Get Them Merged
Optimizing UX for Navigation: Lessons from the Google Maps vs Waze Debate
Moderator’s Toolkit: Managing Community-Contributed Micro Apps and Mods
From Our Network
Trending stories across our publication group