Diary | Local AI Experiment

Momo: Building a Local First Voice Layer for My Mac

A personal builder diary on creating a local first Mac dictation and voice coaching prototype with Codex, Claude Code, WhisperKit, LM Studio, and local learning.

Experimental note. Momo is an alpha stage local experiment for learning through building. It is not a public product launch, not a download offer, and not a claim that the idea is complete.

SpeakKeep localLearn by doingEvolve into a private assistant

I keep experimenting. But this is not experimentation for its own sake. We are living through a time when artificial intelligence is evolving so quickly that no one can fully know in advance where it will settle.

That raises a practical question for me: how do I keep myself upskilled in such a stage? Training matters. Reading, courses, talks, and demos all help. But I have also realized that training can remain passive unless I force myself to build, test, and use these systems in real situations.

So I am trying to build more: with Codex, Claude Code, local agents, local models, and small product experiments that touch my own daily work. While building, I am also trying to deduce the world around these tools: what is real, what is fragile, what is powerful, what is risky, and what may become normal.

Momo is one such effort: part product experiment, part coding diary, and part learning journey. The honest way to understand this shift is to build, test, fail, improve, and keep evolving.

Build Model

The coding journey and the product journey move together.

Momo is not only an app screen. It is a working loop where I learn by building, test the app in real use, observe the hard parts, and keep adding what the system teaches me.

Coding Journey

Ask

A product question

Can I keep voice local and still make it useful in daily work?

Build

Agent assisted coding

Codex implements, Claude Code reviews, and I keep product judgment in the loop.

Use

Mac reality test

Hotkeys, audio, latency, paste behavior, and permissions reveal what actually works.

Learn

Diary and backlog

Each round becomes a note, a screenshot, a fix, or the next product decision.

Product Pipeline

Transcribe

Voice becomes accurate local text.

Insert

Text moves into the active app safely.

Act

Voice can request small actions with review.

Assist

A refined local agent helps with context, memory, and execution.

Path To A Personal Agent

First transcription. Then action. Then a refined assistant.

The long term direction is not to jump directly into a grand agent. The product has to earn trust one layer at a time. Reliable dictation comes first. Small reviewable actions come next. Only after that should it become a more capable personal assistant. Agent tools already exist in the ecosystem, but the real product challenge is making this feel refined, private, explainable, and useful in daily work.

Transcribe

Voice becomes accurate local text.

Insert

Text moves into the active app safely.

Act

Voice can request small actions with review.

Assist

A refined local agent helps with context, memory, and execution.

Learning Through Doing

In this era, possibility is becoming practical.

The real lesson is not that every experiment will become a product. The lesson is that active building creates a different kind of learning. It lets me move from passive awareness to practical judgment.

AI world shifts

The future is not fully knowable.

Passive training

Useful, but incomplete by itself.

Build with tools

Codex, Claude Code, local agents, local models.

Use in real work

Latency, privacy, UI, workflow, and trust become visible.

Deduce the world

Understanding comes from doing, not watching alone.

Training gives language

Courses, talks, and reading help create vocabulary, but they can remain passive if I do not use the tools myself.

Building reveals reality

When I build with Codex, Claude Code, local agents, and local models, I see the actual limits, tradeoffs, and possibilities.

Testing creates judgment

Real use shows what breaks: latency, UI clarity, permissions, model quality, and whether the idea is genuinely useful.

Iteration builds understanding

The world is moving fast, so the learning loop must also keep moving: build, observe, question, improve, and repeat.

Why Momo

What if my voice did not have to leave my machine?

Modern dictation tools are powerful. Once you get used to speaking and seeing text appear where you are typing, typing starts to feel slower.

But voice is personal. It carries words, tone, hesitation, emotion, names, habits, and context. If a dictation tool becomes part of daily work, it starts to see a very intimate layer of how one thinks.

Can I build a Mac native, local first voice assistant that lets me speak instead of type, while keeping audio and learning on my own machine?

Why Local Matters

Voice is not just input. It is intimate data.

Voice can reveal much more than the words spoken. It can carry identity-like cues, rhythm, mood, hesitation, vocabulary, names, habits, and the work context around a person. That is why I keep coming back to the local first question.

Online tools are useful, but for a personal voice layer I want to understand what can run offline or locally. The goal is not fear of technology. The goal is a more thoughtful balance: use powerful AI, but keep sensitive personal signals close to the user wherever possible.

Voice

Words
Names
Tone
Hesitation
Rhythm
Context
Accent
Habits

Local first

What It Does Today

A simple loop, with many local systems underneath.

01Hold a shortcut
02Speak
03Release
04Local transcription
05Local cleanup and polish
06Text insertion

Momo also has a dashboard for dictation history, dictionary rules, local model status, latency metrics, insertion diagnostics, and early voice coaching insights. The app is not only trying to convert speech into text. It is slowly becoming a personal writing and speaking companion.

Animated Loop

The interesting part is what happens between speaking and seeing text.

The loop below is intentionally simple, but it shows why this is an engineering project and not only a UI experiment. Each stage has to work locally, report its state, and recover when something goes wrong.

Record

Capture starts only when I hold the shortcut.

Transcribe

WhisperKit turns local audio into text.

Polish

Deterministic cleanup and local model polish improve readability.

Paste

The result is inserted into the active app through a controlled route.

Learn

Dictionary rules, snippets, and insights stay visible and local.

Build History

From a small MVP to a fuller local voice system.

The project did not begin as a complete product. It began as a working loop and then kept absorbing real problems: reliability, latency, interface clarity, iconography, diagnostics, and learning.

MVP

Make the voice loop work

The first milestone was deliberately small: hold a shortcut, record audio, run local transcription, clean the text, and insert it into the active app.

Reliability

Stabilize the Mac behavior

The project then moved into the less glamorous but essential work: hotkey reliability, app bundle identity, Accessibility permissions, and paste behavior across different apps.

Experience

Improve the visible product

The floating recording state, waveform feedback, menu bar behavior, dashboard layout, and iconography started becoming part of the product judgment, not decoration.

Intelligence

Add local learning and polish

Dictionary rules, snippets, deterministic cleanup, local LLM polish, and voice coaching ideas turned the prototype from transcription into a personal writing layer.

Observability

Expose what is happening

Momo Doctor, Developer diagnostics, latency by stage, model health, insertion route, and local logs became important because local first systems need to explain themselves.

Local First Architecture

No audio leaves the Mac.

The current design keeps speech to text, polish, learning, and diagnostics local. That constraint shaped almost every architectural choice.

Shortcut

Push-to-talk trigger

Audio

AVAudioEngine capture

WhisperKit

Local speech to text

Text Engine

Cleanup, dictionary, snippets

Local LLM

LM Studio and Qwen polish

Paste

Clipboard insertion

Learning

Local records and rules

Dashboard

Diagnostics and insight

Engineering Insights

The hard learning is in the invisible edges.

The screenshots show a polished alpha surface, but the real education came from the hidden edges: latency, permissions, vocabulary, paste behavior, and trust. These are the places where a voice idea becomes a usable system.

Latency budget

Focus: The product must feel immediate even when many local stages are running.
Design choice: Expose stage timing in the Developer view: capture, STT, polish, paste, and system load.
Learning: A fast model is not enough. The user experiences the whole path, including permissions, app focus, warmup, and paste behavior.

Waveform feedback

Focus: Without feedback, a local voice app feels uncertain because the user cannot see whether audio is being heard.
Design choice: Treat the waveform as product infrastructure, not decoration. It should confirm listening without overwhelming the writing flow.
Learning: Small visual signals create trust. They also reveal microphone or permission problems earlier.

Dictionary and corrections

Focus: Names, acronyms, and personal vocabulary need a memory layer, but silent rewriting is risky.
Design choice: Keep learned phrases visible in the Hub so corrections can be reviewed, pinned, edited, or deleted.
Learning: Local learning should be explainable and reversible. Otherwise the assistant becomes mysterious.

Paste route

Focus: Every target app behaves differently, so insertion is a real engineering problem.
Design choice: Start with clipboard based paste insertion and record route status instead of pretending every app will behave the same.
Learning: The final centimeter into the active app can decide whether the whole product feels reliable.

AI Coding Workflow

Codex as engineering cockpit, Claude Code as reviewer.

This is where the experiment became more than a dictation app. It became a way to learn the new building process itself: how to brief AI tools, how to review their work, how to keep product judgment in human hands, and how to improve by testing in real situations.

I described the product behavior I wanted.
Codex inspected the codebase, implemented changes, and ran checks.
I tested the app in real usage across writing, browsers, terminals, and daily workflows.
Claude Code acted as a second-opinion reviewer on architecture, UI, and edge cases.
The loop repeated: product judgment, engineering, review, testing, and iteration.

Engineering Diary

The difficult parts became the real education.

The visible app is only one part of the story. The deeper learning came from seeing where a simple idea becomes difficult in practice: latency, recording feedback, UI trust, and product identity.

Latency became a product problem

The visible action is simple, but the hidden path includes audio flush, model warmup, local STT, cleanup, optional local polish, paste insertion, and diagnostics. Each stage needs measurement.

The waveform changed the feeling

A voice product must show that it is listening, but without becoming noisy. Waveform feedback helped convert an invisible audio process into a more legible interaction.

UI moved from utility to trust

The dashboard became a trust surface: history, dictionary, health checks, model status, latency, and insertion outcomes help the user understand the local system.

Iconography gave the app identity

Even for an experiment, icons matter. They help separate recording, learning, diagnostics, local model status, and developer tools without making the interface wordy.

Hotkey reliability

Global shortcuts on macOS are not as simple as they look. Momo moved toward a battle-tested shortcut library and a proper app bundle identity.

Audio capture

The capture path uses the hardware-native input format first, then converts separately to a 16 kHz mono WAV segment for WhisperKit.

Transcription quality

The practical default is large-v3-turbo: good enough quality with local latency that can still fit into a daily workflow.

Latency

A simple user action hides many stages: audio flush, speech to text, cleanup, optional local polish, and insertion into the active app.

App insertion

Different apps behave differently. V1 uses clipboard paste insertion, records outcomes, and avoids keystroke-by-keystroke injection.

Learning carefully

Dictionary rules and learning suggestions need to stay visible, reviewable, reversible, and local.

Data Layer

The learning store is part of the product.

Momo is not only a microphone and a model. Once the app starts remembering dictionary entries, corrections, snippets, session history, and diagnostics, the data layer becomes a design decision. A simple local JSON-style record is useful for early learning, but a more structured database can become important as the product matures.

Lightweight local records

The early learning layer can begin with simple local records because the goal is to test behavior before overbuilding infrastructure.

Why structure matters

As sessions, dictionary entries, snippets, corrections, and diagnostics grow, the data model becomes part of the product, not only storage.

Possible next move

A future version can move from simple JSON style records to a more structured local database or Postgres-backed workflow if the product needs richer querying.

Trust Layer

The dashboard matters because local systems need observability.

A local AI product must help the user understand local state: model availability, permissions, target app behavior, audio health, latency, and insertion route.

That is why Momo has a Developer view, Momo Doctor, diagnostics export, and local insight cards. A magical product is good, but a trustworthy product must explain itself when something goes wrong.

Future Roadmap

From dictation toward a private personal agent.

Many people are now building small, personal AI systems around their own workflows. That is the direction that interests me most: tools that remain grounded in the user, respect privacy, and grow from daily use.

Phase 1

Voice to text

Make local dictation dependable: hold a shortcut, speak, transcribe locally, polish locally, and insert text where the cursor is active.

Phase 2

Voice to action

Let spoken intent trigger small, reviewable actions such as opening a note, preparing a message, searching local context, or routing a task.

Phase 3

Personal agent

Add memory, preferences, workflow context, and coaching so the assistant understands how I work without becoming careless with private data.

Phase 4

Refined assistant layer

Move toward a polished local assistant that can coordinate tools, explain its actions, and stay under human control.

The long term imagination is a Jarvis style local assistant: not a dramatic science-fiction promise, but a practical companion that helps with speech, writing, search, context, learning, and everyday execution.

Cleaner first run setup for local models
Better speech to text benchmarking
A more structured settings experience
Audio input health coaching
Stronger dictionary learning
Local memory
Deeper voice coaching
Optional oMLX experiments for longer context local workflows
Eventually, a full local assistant mode

Effort Ledger

AI made building faster, but not effortless.

One important part of this story is the effort behind the build: prompts, reviews, tests, failed paths, debugging, rebuilds, and many rounds of judgment. AI reduced the distance between idea and prototype, but the project still required attention.

Human role: Product owner, tester, reviewer, taste-maker
AI coding loop: Codex for implementation and checks; Claude Code for second-opinion review
Iteration style: MVP first, then reliability, UI, diagnostics, learning, and latency work
Token usage: Substantial, though I have not yet reconstructed exact usage from session logs
Why track it: To understand the real cost of AI assisted building, not only the visible result

Actual Screenshots

The current alpha already has a visible product language.

These screenshots show the current local alpha. They are included as a record of the build, not as a launch promise. Any future screenshot added here should continue to avoid private transcripts, personal documents, secrets, and internal logs.

Developer pipeline screen from the Momo local first voice experiment — **Developer pipeline**The local control room: microphone, speech model, local polish, paste route, latency, and system health.

Console screen from the Momo local first voice experiment — **Console**The central push to talk surface, built around a calm local voice agent interaction.

Settings and Momo Doctor screen from the Momo local first voice experiment — **Settings and Momo Doctor**A setup checklist for the app lane, microphone, Accessibility, WhisperKit, local polish, and first dictation.

Voice insights screen from the Momo local first voice experiment — **Voice insights**Saved local sessions can become a private coaching layer for pace, energy, vocabulary, and speaking patterns.

Hub and dictionary screen from the Momo local first voice experiment — **Hub and dictionary**The learning layer keeps phrases, names, and corrections visible rather than silently rewriting the user's voice.

This page is a record of exploration. It reflects a belief that the future will belong not only to those who consume technology, but also to those who keep learning, keep questioning, and keep building with responsibility.

Follow the Momo build on GitHub