Andrew Brown

Vociferous v7.0

Andrew Brown — Thu, 28 May 2026 13:31:08 GMT

A year ago, I started building a tool to solve a specific problem: I think faster than I type, and I was losing ideas. The solution I arrived at was a local, offline, privacy-first speech-to-text application that would transcribe my voice in real time, refine the output through a configurable AI pipeline, and store everything in a database that belonged entirely to me. No cloud. No subscription. No one else’s servers processing my words.

This dream was called Vociferous.

This week, after 1,200 hours of development, version 7.0 shipped. It is the release I set out to build.

What changed since v6.0

Version 6.0 shipped two months ago at roughly the 600-hour mark. The architecture was sound, the core pipeline was reliable, and the refinement engine — which takes raw transcription output and restructures it according to AI-driven custom prompts — was functional. What version 6.0 could not solve was the hardware requirement. Vociferous ran local AI models on a local GPU. Without capable hardware, the experience degraded substantially, which meant the tool I had built for myself was inaccessible to most people who asked about it.

Version 7.0 resolves this by introducing Groq and LM Studio as first-class external providers for both transcription and refinement. Groq is free to use, takes minutes to configure, and requires nothing beyond an internet connection. Local inference remains the default for users who want complete offline privacy — the original architecture is unchanged for them. But the hardware ceiling is gone.

Between version 6.0 and 7.0, eight intermediate releases shipped. What follows is a summary organized by what actually matters.

Platform

Version 6.1 brought full Windows support: a native installer handling Python detection, virtual environment creation, CUDA DLL chain validation, frontend build, model provisioning, and Desktop shortcut creation. CUDA validation was a genuine engineering problem — driver presence does not guarantee that CTranslate2 can use the GPU, so the installer now validates the specific DLL chain the runtime actually requires rather than assuming the driver is sufficient.

Refinement Engine

The refinement subsystem received its most substantial overhaul since its introduction. Custom prompts are now persisted and selectable. Session-level prompt assignment — choosing how a transcript will be refined before recording begins — shipped in version 7.0. A Smart Refinement gate skips refinement for transcripts that don’t warrant it (can be turned off), scoring across filler density, punctuation, capitalization, sentence structure, and word repetition. Bulk refinement accepts a prompt picker. CPU thread count auto-scales at import time based on detected core count. SLM stall handling was introduced so that a locked inference engine fails gracefully rather than blocking the application.

Durability

Prior to version 7.0, a program crash during recording meant a serious risk of data loss. Audio is now promoted into an encrypted vault at configurable intervals, persisted to disk, and surfaced as a recoverable session the next time the application starts. This was the most architecturally demanding feature of the release and the one I am most satisfied with.

Architecture

Version 6.7 was an internal refactor release — no behavior changes, 740 tests passing — focused entirely on Single Responsibility and explicit module boundaries. The transcription post-processing pipeline, the refinement provider package, the insight helpers, and the database models were each extracted from their respective monolithic homes into focused submodules. A full test suite audit replaced hollow coverage with assertions against actual behavioral contracts. The application is meaningfully more maintainable than it was at version 6.0, and the test suite now proves behavior rather than simply executing code paths.

Analytics

Version 7.0 ships a full usage analytics dashboard: word volume, estimated time saved, speech speed, activity heatmaps, filler reduction rates, readability improvements, provider and model distribution, and AI-generated insight paragraphs that update as usage crosses configurable thresholds. Every transcript now records complete processing provenance — which provider, model, runtime, compute type, and prompt produced it, along with token counts for refined output.

Editor

The editing surface was rebuilt on TipTap, with Markdown rendering enabled by default. Users are now able to harness the power of inline editing and formatting buttons arranged on a traditional toolbar.

What real use looks like

On my secondary Windows machine over the past week: 19,804 words captured across 62 transcriptions, 59 minutes saved versus manual typing, 116 words per minute average speech rate, and 10 hours and 50 minutes of estimated editing time saved by the refinement feature alone. 61 transcripts were refined, 197 filler words were removed, and average reading level dropped from 9.2 raw to 8.6 after refinement.

My Linux daily driver is sitting at over 200,000 words captured and 108 hours reclaimed — 98 of those from refinement.

This article was produced by dictating into Vociferous! Minimal editing was required since my custom prompts respect my voice’s autonomy.

What this year has actually taught me

I want to be honest about this, because the technical summary above does not capture the more significant part of what happened.

I built Vociferous because I needed it. The motivation was entirely practical. What I did not expect was what 1,200 hours of sustained, deliberate software development would reveal about how software actually works — and about how I work.

I learned what it means to make an architectural decision and then live inside its consequences across hundreds of subsequent commits. I learned the difference between code that passes tests and code that is genuinely trustworthy. I learned — through building, not through studying — what security-conscious development looks like in practice: mapping your own attack surface, working with people actively trying to break your work, and treating every external input as a potential vector. I learned that the distance between a program and an application is composed almost entirely of unglamorous work that no one sees.

I also learned that I am capable of finishing something difficult. That knowledge is not a small thing to come away with.

Version 7.0 is feature complete. I have no major additions planned — only bug fixes and quality-of-life improvements as they surface. The application does what I set out to make it do. Reaching that point in the same twelve months that I competed in the National Cyber League three times, built organizational infrastructure for two student clubs, earned three certifications, and watched people I taught begin to teach others — that is not coincidental. These things were conversing with each other the entire time, building off shared energy. I’m immensely satisfied with the result and what I have learned since I began.

Vociferous is open-source and free. GitHub link below. If you want the longer account of how it was built and what I learned building it, that is what Speaking into Existence is for.

🔗 GitHub: https://github.com/WanderingAstronomer/Vociferous

All about me: https://wanderingastronomer.github.io/

Speaking into Existence #30

Andrew Brown — Wed, 27 May 2026 22:17:47 GMT

The v7.0.1 hotfix is my favorite kind of small release. There was an insidious bug to be fixed!

This particular superficial bug was that some reasoning-capable models, especially Qwen variants through LM Studio, could consume the completion budget on hidden reasoning and return little or no visible answer. There were related provider quirks on Groq as well. The tempting response would have been to keep papering over each model family with one more branch.

That would have been cowardly! It was time to solve the problem by targeting the source.

The Boolean Had Been Lying for Too Long

The old use_thinking toggle was doing too many jobs badly. It was standing in for whether the task wanted hidden reasoning, how much visible output budget should remain, whether the provider should suppress or expose thought, and what kind of response shape the caller actually needed.

A single boolean carrying all of that meaning was not a justified simplification. It was an ambiguity bomb waiting to level my metaphorical city.

The hotfix finally corrected the abstraction itself. Generation requests became explicit about task kind, visible output budget, reasoning policy, and response shape. Providers translated from that shared contract into their own peculiar controls. LM Studio got the empty assistant prefill path when needed. Groq got family-specific reasoning controls instead of generic misuse. Analytics and title generation stopped pretending they were just ordinary chat completions with a smaller token count.

🛠️ Tech Check
Why was an empty prefill useful?
Some reasoning-tuned models will eagerly spend tokens in an internal-thinking mode before producing visible text. Prefilling an empty thinking block and continuing the assistant turn can coerce certain model stacks into moving past hidden-reasoning ceremony and actually returning the final visible answer the application requested.

This Was About Application Sovereignty

The important part is not the provider-specific trick. The important part is that the application stopped letting providers define the meaning of a request.

The product knows whether it wants a final answer only, a visible JSON envelope, a short title, or a richer analytical response. That intent has to exist at the application layer first. Providers should translate it, not reinterpret it.

Once that becomes the standard, reasoning support stops being a novelty checkbox and becomes another bounded capability the product can either use or suppress according to its own needs.

That is the right ending for this arc because it is the most distilled expression of the series’ architectural theme: the system gets better every time it names its contracts more clearly than convenience would prefer.

💡 Apt Architecture
The fix for provider weirdness is rarely “one more special case.” It is usually a clearer product contract that makes the special cases subordinate to application intent rather than the other way around.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #29

Andrew Brown — Wed, 27 May 2026 22:13:46 GMT

There is a point in a software project where the exciting work becomes the least important work—and that’s totally fine!

Version 7 was that point for me.

The headline features felt real enough: external transcription providers, external refinement providers, durable audio vaults, provenance, Markdown editing, bulk prompt workflows, analytics, command bars, architecture docs, a rebuilt README. Oh! And a shiny new Wiki.

But what actually made the release feel serious was the density of boring work orbiting those features. And I swear on every good and evil god it was so boring.

Hardening Was the Real Story

Runtime readiness checks. Diagnostics redaction. Foreign-key repair paths. Linux and Windows CUDA alignment. Logging behavior. Search highlighting edge cases. Export pagination. Filename sanitization. Mutable intent payload cleanup. Frontend guards against stale responses. Recovery paths for interrupted data. Contract tests. Focused integration tests. Installer metadata. Model provisioning clarity.

That is release-grade work, and it was tiring.

To be fair, I think it is called boring mostly by people who have not yet had a product violently betray them in production.

The v7.0.0 changelog is valuable not merely because it lists a lot of work, but because the work reveals a different standard of seriousness. The app is no longer being improved as a promising tool. It is being hardened as a system expected to survive contact with users, hardware, providers, package managers, race conditions, and long-running sessions full of nontrivial data.

💡 Apt Architecture
Release-grade software is distinguished less by the novelty of its features than by the thoroughness of its refusal to fail in stupid ways.

This Is Where Trust Gets Earned

Any product can impress during a happy-path demo. What users remember is whether the tool still behaves coherently when the environment is messy: the model server is down, the key is missing, the GPU is only half configured, the recording was interrupted, the search query includes hostile punctuation, the export set is large, the prompt was edited elsewhere, or the async response arrives late.

The v7 line spent a huge amount of effort on exactly those conditions. That is why it matters.

You do not earn trust by promising that those failures are rare. You earn it by proving that the system remains intelligible when they happen anyway. Think of it like going bungee jumping into a dark cavern and getting stranded. Just before you go SPLAT on the ground, your cord saves you, but it breaks in the process. Thankfully, there’s someone already down there for situations just like this: they are your guide out of this hole. They take you by the hand and lead you through a well-lit tunnel back to the surface. You may want to try to go bungee jumping again, after all: and I will not comment on either the mental fortitude or insanity required to make that decision. But, you get the idea. The aforementioned efforts establish the presence of our guide out of said cavern.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #28

Andrew Brown — Wed, 27 May 2026 22:01:41 GMT

There is a lazy version of provider support where the architecture simply gives up.

You wire in the external APIs, let each one dictate its own assumptions, bolt a few conditional branches around the weird parts, and tell yourself you have gained flexibility. What you have actually gained is a system whose behavior is now partly owned by vendors with incompatible contracts.

I wanted the other version—the one that worked more reliably.

The Question Was Never “Local or External?”

One of my brain-dump transcript sessions on reasoning support asked the right question: is reasoning even worth supporting, and if so, where? It also states the more important principle even more plainly: I do not care about seeing the reasoning. I care about the result. This is going to be true for 99% of people.

That is exactly the mindset that made external provider support viable without becoming conceptually sloppy or a betrayal of my prior core values.

By v6.6.0, transcription and refinement could run locally or through OpenAI-compatible providers like LM Studio and Groq. But the product intent remained the same. Produce a refined result. Respect user prompts. Surface runtime truth. Handle failures sanely. Do not punch holes through the command model. Do not let provider peculiarities leak into the basic meaning of the feature.

💡 Apt Architecture
Supporting external providers does not require surrendering architectural control. It requires defining the product’s own contract clearly enough that providers become interchangeable runtimes rather than sources of meaning or objects of reliance.

Smart Refinement and Provider Settings Were Part of the Same Maturity Curve

Once multiple providers existed, the settings surface had to stop pretending every runtime behaved like local CTranslate2. Provider-specific cards, key management, connectivity checks, timeouts, base URLs, model IDs, capability declarations, and readiness reporting all became necessary.

So did restraint. Smart Refinement mattered more in a provider-flexible world because unnecessary calls now had clearer performance and cost implications. Prompt pickers for bulk refinement and session-level prompts mattered more because runtime variability increased the value of explicit user intent.

The system was becoming less of a single clever stack and more of a disciplined orchestration layer over several possible engines.

Provenance Prevented Provider Support from Becoming Vague

The transcript asking to store provider, model, prompt, and timing context was exactly right. Once the app can use multiple providers, the record of which provider produced a result stops being interesting trivia and starts being analytical truth, especially when there are so many different models out there now.

That is why provenance and provider support belong together. The minute the app can choose among local and external paths, every meaningful output should be able to say where it came from at the very least.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #27

Andrew Brown — Wed, 27 May 2026 21:56:25 GMT

Software lies most often through implication.

It shows a control that looks active but does not reflect current state. It hides the dangerous action inside an innocent container. It duplicates logic across views until one screen behaves slightly differently and no one notices until trust has already leaked away. It allows the same kind of action to appear in three different visual grammars and calls that flexibility instead of drift.

That was the UI problem Vociferous had to outgrow, and quickly; I am an impatient master.

Action Bars Were the First Obvious Symptom

The March transcript complaining about overflowing action bars was not really about wrapping buttons. It was about scattered ownership.

If every view invents its own button strip, every view eventually invents its own failure mode as well. Compression breaks differently. Privilege logic diverges. Overflow becomes inconsistent. State-dependent actions appear or disappear according to whatever that one screen happened to implement.

The eventual answer in v6.5.4 and v7.0.0 was not “tweak more CSS.” It was command bars, command menus, and command nodes: one local model for action visibility, grouping, overflow, and nested commands.

That is the recurring pattern in useful front-end cleanup. The fix is rarely another conditional class name. The fix is reclaiming ownership from duplicated surface logic.

💡 Apt Architecture
Interface inconsistency is often an ownership bug wearing visual clothes. When the same action pattern is implemented separately across views, drift is not an accident. It is the expected outcome.

Settings Needed Structure, Not Rearrangement

There is a transcript fragment where the right answer becomes embarrassingly obvious in real time: as more settings arrive, reorganization alone will not save the surface. Tabs or pages are needed, because otherwise every new feature means reshuffling the same crowded space again.

That instinct was right. The settings surface kept being split and clarified through v6.5.4, v6.6.0, and v7.0.0: diagnostics, safety, export, confirmation, recording, provider, maintenance. The point was not visual sophistication. The point was to stop forcing unrelated consequences into one generic settings landfill.

The same maturity appeared in editing. v6.5.5 and v7.0.0 moved the project toward a dedicated Markdown editing surface, protected prompt hygiene, explicit retitle flows, and routing that no longer leaked users into stale inline editors.

The UI Needed to Tell the Truth About State

Prompt synchronization bugs, tag-search staleness, false-ready runtime indicators, restart-required false positives, and stale async overwrites all belong in the same family.

Each is a case where the system technically had the right answer somewhere, but the surface the user was actually touching failed to represent it. The frontend hardening across v6.5.x through v7.0.0 was largely a campaign against that mismatch. Search results refreshed when tags changed. Runtime-aware comparisons reduced fake restart warnings. Event consumers guarded against stale responses overwriting current state.

None of this makes for romantic release notes. It does make for a UI that stops gaslighting the user! Which is always a lovely bonus.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #26

Andrew Brown — Wed, 27 May 2026 21:51:06 GMT

My transcript from May 9 says the problem without euphemism: if I record for thirty minutes or an hour and lose power, there is currently no recovery path.

That sentence should terrify any software handling user-created speech. I was scared, and I made the goddamn thing! What was I thinking?!

Durability Had to Become a First-Class Concern

Once you accept that recorded speech may be irreplaceable, the design questions change. The issue is no longer merely how to store the final transcript. It is how to survive interruption before the transcript exists at all.

That is what pushed the system toward incremental persistence, audio durability settings, recovery flows, and eventually the v7 audio vault. Recording could no longer be treated as a temporary prelude to the real data. The audio itself had become part of the durable record.

By v7.0.0, recoverable recordings were surfaced after interrupted sessions, re-transcription could use cached audio, and durable vault assets could be promoted intentionally. That entire branch of work is simply the architecture acknowledging what the product had always implied: speech records matter before they are convenient.

💡 Apt Architecture
If a workflow creates irreplaceable data before the final save point, durability has to begin before the final save point too. Anything else is just hoping the crash happens after the part you designed for.

Append Semantics Were Quietly a Data-Integrity Issue

The append and continue work revealed a related mistake.

Earlier implementations treated the visible aggregate as though it were the only truth that mattered. In practice that meant source segments could be deleted after append, which is an astonishingly bad instinct once you say it aloud. The transcript notes caught the concern before the implementation was fully corrected: perhaps compound transcription needed its own system tag; perhaps the old and new material should both remain visible truths instead of one erasing the other.

v6.4.1 fixed it properly. Source transcript rows survived as compound children with compound_root_id and compound_order. Default queries hid those children from ordinary list totals so analytics did not double count them. The visible root was still the aggregate, but the historical segments were no longer sacrificed to keep the UI simple.

That is the adult version of append.

🛠️ Tech Check
What is a compound transcript in this context?
It is an aggregate transcript built from multiple recorded segments while still preserving the individual source rows underneath. The user gets one visible combined result, but the system retains segment history instead of pretending the earlier pieces never existed.

Provenance Completed the Same Moral Arc

Once durability and append semantics were taken seriously, provenance became the natural next step.

v6.6.1 and v7.0.0 expanded transcript records to include transcription provider, model, prompt, resolved device, refinement metadata, re-transcription history, thinking mode, and token counts. A transcript was no longer just text plus timestamp. It became a speech record with lineage.

That is not bureaucratic excess. It is how the system earns the right to answer basic questions later: where did this result come from, how was it processed, what changed, and can I trust the statistics derived from it?

Durability, append integrity, and provenance are really the same lesson stated three ways: never throw away reality just because a thinner abstraction is easier to render.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #25

Andrew Brown — Wed, 27 May 2026 21:49:18 GMT

One of the recurring lies in desktop software is the cheerful status surface.

The app smiles and tells you everything is fine because some configuration exists, some driver is installed, or some toggle is turned on. Meanwhile the actual runtime conditions required for the feature to function are missing.

I have grown to despise that style of deception due to the fact that:

Linux + NVIDIA GPU = Dependency Hell.

Driver Detected Is Not the Same as Runtime Ready

The CUDA work made this painfully concrete.

Users do not care that an NVIDIA driver exists in the abstract. They care whether CTranslate2 can actually use the GPU on this machine, in this environment, with this packaging setup, right now. Those are very different claims.

The transcript notes about CUDA 13.x and CPU-priority architecture already showed the pressure: do not make CUDA a hard dependency, but also do not leave users guessing about what the application can and cannot do. v6.3.0 and v6.5.3 finally enforced that principle properly. Runtime detection inspected actual DLL chains and runtime libraries. Windows installer guidance stopped conflating driver presence with usable acceleration. Linux verification landed too.

The machine had to stop flattering the user with approximate truth.

🛠️ Tech Check
Why was CUDA detection more complicated than “is there an NVIDIA GPU?”
Because GPU acceleration depends on more than hardware. The app needed the correct CUDA runtime libraries, in the correct environment, visible to the inference stack actually being used. A detected GPU without a usable runtime is only a promising rumor; you gotta have the rest of the goods to get up and running.

Diagnostics Are Part of the Product

The same lesson applied to microphones, runtime status, and provider readiness.

AudioService.detect_microphone() stopped returning vague success or failure and started surfacing actual detail: device name, host API, channels, sample rate, 16 kHz support, human-readable explanation. /api/engine/status became a canonical truth surface instead of an improvisation. Runtime summaries began exposing resolved models, devices, package context, rate-limit hints, key availability, and eventually the translated provider request policy itself.

This is not glamorous engineering, which is precisely why it is so often skipped. But the moment software interacts with hardware, packaging, providers, threads, or user data, diagnosis becomes part of the user experience whether you want it to be or not.

Bad diagnostics force the user into superstition—good diagnostics let them reason!

Supportability Is a Design Decision

v6.4.2 added startup support snapshots and richer timing diagnostics. That was an architectural maturity signal.

If a system only explains itself after you have already attached a debugger and begun reading source, then for practical purposes it does not explain itself at all. Persistent logs capturing runtime context, settings, resolved models, platform details, and timing behavior are not support fluff. They are the difference between postmortem analysis and guesswork.

The product became more trustworthy precisely because it became less coy about its own conditions.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #24

Andrew Brown — Wed, 27 May 2026 21:44:17 GMT

The coordinator problem never really goes away. In fact, it just became much more expensive to ignore.

By the time Vociferous entered the v6.2.x through v6.7.0 stretch, the project had accumulated enough successful architecture that a subtler danger appeared: too much useful logic was still collecting near the center.

The software was not collapsing, it was merely becoming the sort of system where one good file can grow into a bad institution.

The Split Was About Ownership, Not Tidiness

v6.2.2 split system.py into domain modules. config.py, models.py, and window.py were extracted, and handler registration moved to decorator-based self-declaration. That kind of refactor can look cosmetic if you describe it lazily. It was not.

The real improvement was that features stopped needing to touch more files than their own weight justified. New intent bindings no longer required spelunking through a giant registration switchboard. Domain routes no longer had to pretend they all belonged in one civic monolith because history had put them there.

Later passes kept pressing in the same direction. v6.5.2 pulled lifecycle and runtime plumbing out of the coordinator without pretending the composition root should disappear. v6.7.0 extracted transcription submodules, provider packages, refinement helpers, insight helpers, and database models. _transcribe_and_store stopped being a god-method and became named stages.

None of this was performed in service of abstract purity. It was in service of reading the code without feeling trapped inside someone else’s accumulation.

💡 Apt Architecture
Good decomposition is not about making the file tree prettier. It is about restoring local ownership so that adding one feature does not require ceremonial contact with half the system.

Architecture Also Needed a Public Memory

One of the most important additions in v6.7.0 was not executable at all: ARCHITECTURE.md.

Composition root. H-pattern. Threading model. Persistence contract. Invariants. Things that had previously been scattered across instruction files, habits, and tribal (if you can’t count one dude as a tribe) knowledge were finally written down in one canonical place.

That matters for human maintainers. It matters even more once AI is part of the toolchain. If the architecture only exists as the shape of code plus a few long conversations, then every future editor has to rediscover it indirectly. That is wasteful for humans and dangerous for models.

Backward Compatibility Was the Constraint That Made the Refactor Honest

The v6.7.0 sweep also preserved import compatibility through __all__ re-exports. That is important because it distinguishes a serious refactor from a self-indulgent one.

Breaking everything to achieve local neatness is easy. Splitting responsibilities while preserving external contracts is harder, and therefore more meaningful. The project was not being rewritten for the aesthetic thrill of movement. It was being decomposed under the constraint that working surfaces should remain working.

That is how you know the refactor was for the system rather than for the engineer’s self-image.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #23

Andrew Brown — Wed, 27 May 2026 21:42:10 GMT

I still think the daily-driver test is the only metric that matters first.

If you build a personal tool and then subtly stop using it, the project has answered its own question. Whatever pain you thought you were solving was either not acute enough or not solved honestly enough to survive ordinary life.

Vociferous kept passing that test. It keeps passing that test. In fact, every article in this series was written in nearly a single pass through the use of Vociferous! I jump in for five minutes post-game to add my own thoughts (just like I am doing right now), but that’s about it.

The Real Return Was Never Just the App

The superficial result is obvious. I built a speech-to-text workspace I use constantly. It records, transcribes, refines, stores, and increasingly survives the kinds of failure that make ordinary software feel careless.

The deeper return is harder to summarize and more valuable.

I left the project with sharper instincts about ownership, failure, provider boundaries, documentation drift, runtime diagnostics, mutation paths, prompt policy, and the difference between software that works in a demo and software that can be trusted after hours of real use.

That is the part people routinely undervalue when they argue that you should never build a thing the market already sells. The artifact matters. The judgment you acquire while building it often matters more.

💡 Apt Architecture
Building a tool deeply enough changes the builder. The visible product is only part of the payoff. The more durable return is the judgment you carry into every later architecture decision.

The Market Question Is Still the Wrong First Question

Yes, commercial dictation products exist. Platform speech APIs exist. Hosted providers exist. Subscriptions exist. Browser tabs full of rented convenience exist.

That was never the whole point.

Privacy mattered. Control mattered. Understanding mattered. Being able to open the stack and repair it myself mattered. Being able to keep sensitive dictated text on my machine mattered. Finances? Project ideas? Diary? My novel? All 100% safe and sound here on my machines. Being able to test, instrument, and refactor the system according to my own priorities mattered.

Those priorities do not become irrational merely because a SaaS vendor also has a landing page.

The More Useful Closing Thought

What I would amend now is only the old finality.

I once wrote that the application worked, the architecture was documented, the tests passed, and that this was a complete result. It was a truthful sentence and still incomplete. Software like this is never complete in the absolute sense. It becomes trustworthy in layers.

The later releases proved that. There was still decomposition work to do. Still durability to harden. Still provider contracts to clean up. Still diagnostics to tell the truth. Still UI surfaces to stop lying by implication. Still recovery paths to add before the product had really earned the quiet confidence I wanted from it.

So here is the better ending for this phase: the tool justified itself, and then it demanded to be taken more seriously. The last few chapters will prove I needed to expand my horizons by changing the size of the window I was looking out of.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #22

Andrew Brown — Wed, 27 May 2026 21:36:59 GMT

Hindsight improved after version 6 because the project kept going long enough to invalidate some of my cleaner earlier conclusions.

That is useful. Regret that survives contact with more evidence is often just architecture trying to teach you something.

I Would Not Have Declared the Core Product “Complete” So Early

I understood why I said it. The central loop was good. The daily-driver test had been passed. The project was public. The docs were no longer embarrassing. Version 6 felt like a legitimate product milestone.

But the phrase core complete hid an enormous amount of unfinished reality. The app still lacked durable audio recovery, explicit provenance, a coherent provider contract, mature diagnostics, and a settings surface honest enough to expose the consequences of all the new runtime flexibility that was coming.

The lesson is not “never celebrate.” The lesson is that a product handling user-created data is less complete than it feels until it can explain, recover, inspect, and survive failure cleanly.

I Would Have Treated Durability as a Day-One Requirement

The audio persistence transcript from May says the problem plainly: if a thirty-minute recording dies in a crash or power event, and the system has no recovery path, the tool has betrayed the user.

That should have been framed that bluntly earlier.

AudioSpool was the beginning of this lesson, not the end. By v7.0.0 the app had moved further into durable audio vaults, interrupted-session recovery, re-transcription from cached audio, and configurable durability settings. That entire branch of work exists because user data turned out to be more precious than the early architecture had fully admitted.

If I were rebuilding the project from scratch, durability would arrive before a surprising amount of the cleverness did.

💡 Apt Architecture
Any application handling irreplaceable user input should assume failure early. Recovery is not a deluxe feature to be added after the happy path feels polished. It is part of the minimum honest contract.

I Would Have Built the Provider Contract Before Adding Providers

External providers were probably inevitable once the project matured enough. LM Studio and Groq offered real value: access to better models, faster inference on the right setup, and more flexibility than local CTranslate2 could always supply alone.

What I would change is the order of abstraction.

The app should have defined task kind, visible output budget, response shape, and reasoning policy as first-class application concepts before those providers landed. Instead, some of that meaning remained smeared across local assumptions, ad hoc parameters, and increasingly overloaded toggles until the v7.0.1 reasoning hotfix finally forced a cleaner contract into existence.

Provider flexibility was the correct feature. Delaying the provider-neutral request policy was the mistake.

I Still Would Have Delayed Analytics

That judgment has only strengthened.

The analytics work that survived into v6 and v7 is better because it became more honest, more grounded, and more tied to recorded evidence. Provenance fields, throughput summaries, filler reduction, readability shifts, model counts, timing data, forecast aggregates - all of that is more defensible than the earlier tendency to build charts because charts felt like progress.

But the underlying lesson remains the same. Instrumentation should serve a proven workflow. It should not arrive as a proxy for one.

I Would Keep the Intent Architecture and the Web Rewrite

Some judgments have held firm.

The intent-driven structure was still the right call. The web-stack migration was still the right call. The later decomposition work in v6.2.2, v6.5.2, and v6.7.0 only reinforces that. Once the project had to grow into command bars, external providers, vault recovery, processing provenance, prompt management, and release-grade workflows, the early structural seams were not excessive. They were the reason the later work remained survivable.

The project needed fewer premature endings and earlier honesty about failure, not a different backbone.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #21

Andrew Brown — Wed, 27 May 2026 21:14:59 GMT

v6.0.0 was the first release where I allowed myself to say something slightly dangerous: the core product felt mostly complete.

That sentence was true in the narrow way that makes future embarrassment possible. The central loop worked. Record, transcribe, refine, store, review. The project had survived its earlier architectural convulsions. It had a real UI, a real README, a real wiki, and enough daily-driver legitimacy that I no longer felt like I was demoing an interesting prototype.

So version 6 went public.

The Front Door Had to Stop Lying by Omission

The README had been failing the first thirty seconds.

It did not make the product’s shape obvious quickly enough. It talked around the thing instead of naming it. It underplayed the actual differentiators, which were local operation, privacy, and on-device refinement, and it retained just enough stale installation context to make a first-time user wonder whether the repository was alive or merely historically interesting.

The transcript record from the v6 push captures the tone well: validate everything, overhaul the docs, generate screenshots, use agents if helpful, ship by noon if possible, and do not confuse a public release with a lazy packaging exercise. That was the correct instinct. Once strangers are invited in, clarity stops being optional.

The rebuilt README opened plainly. Cross-platform offline speech-to-text with local AI refinement. Then screenshots. Then architecture. Then installation paths that had actually been checked. The wiki got the same treatment. Old PyQt residue was purged. Stale screenshots disappeared. Pages were rewritten from current source instead of from old memory.

💡 Apt Architecture
Public software needs a legible front door. Once other people are expected to understand and install the tool, documentation stops being auxiliary explanation and becomes part of the product surface itself.

The Release Process Became a Quality Exercise

What I like most in retrospect about the v6 transcript is not the release itself. It is the posture.

The plan was not to pile on one last batch of features and call the thing done. The plan was to validate, review, polish, and force the codebase through a series of quality lenses before allowing the milestone to stand. Architectural review. Optimization review. Front-end consistency review. Test-suite skepticism. Manual checking on top of all of it.

That is a healthier relationship to release pressure than the usual ritualized panic. “Gotta make this one count! Need to scramble to get X added!” Nope, not for me, no thank you.

The application also learned something important at this stage: documentation work belongs in the same phase space as product hardening. If the code has changed enough that the README and wiki are misleading, the release is not ready. A public milestone with stale docs is not a milestone. It is a trap for the next user.

The Dangerous Part Was Feeling Finished

Version 6 felt like the first moment the project had become what I originally wanted it to be. That feeling was not fake. It was incomplete.

Because once the app went public, a new class of scrutiny arrived almost immediately. Settings friction. Prompt synchronization bugs. Append semantics. analytics cleanup. GPU-runtime reality. The deeper meaning of “external provider support.” Recovery and durability. Provenance. Release-grade validation. Reasoning-model edge cases.

In other words, going public did not close the book. It changed the standard of evidence required to claim the software was trustworthy.

That is what version 6 really was: not the finish line, but the point at which the project lost the excuse of being private, local, and answerable only to me.

🛠️ Tech Check
Why did the README and wiki matter so much here?
Because public software is judged through its first-contact surfaces before anyone reads the code. If the README is vague or the wiki is stale, the user never reaches the architecture you are proud of. They leave with the accurate impression that the project cannot currently explain itself.

Version 6 mattered because it forced the project to become legible to strangers, and that kind of exposure is a harsher test than private satisfaction ever will be. I am, of course, pleased to announced that the project was very well-received.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #20

Andrew Brown — Wed, 27 May 2026 21:02:51 GMT

Vociferous was built with a frankly unreasonable amount of AI assistance.

Copilot. Claude. GPT. Local models. Cloud models. Autonomous agents. One-off prompts. Long-running instruction files. Review passes from machines that were useful, useless, or catastrophically overconfident depending almost entirely on how precisely the work had been framed.

That last part is the beginning of the whole story.

The Model Was Only as Good as the Context Boundary

Early AI-assisted work on the project suffered from the standard failure mode: the model knew software in the abstract and this codebase not at all. It generated average-project answers for a non-average project.

That is how you get handlers that should dispatch intents but instead call services directly, UI suggestions that ignore pywebview threading realities, and backend changes that appear locally reasonable while violating the specific constraints that keep the architecture coherent.

Eventually .github/copilot-instructions.md stopped being a polite note to the assistant and became part of the development environment. It encoded the command bus, the WebSocket contract, thread ownership, mutation boundaries, platform constraints, naming patterns, and the repo’s increasingly non-negotiable architectural instincts.

That file became large because the project paid for every missing sentence the expensive way first.

💡 Apt Architecture
Once AI becomes part of daily implementation, the instruction file constraining it is infrastructure. If the model can participate in production changes, the document defining what it must not violate is no longer optional commentary.

What the Models Were Actually Good At

Once the context got strong enough, a few categories of work accelerated dramatically.

Boilerplate was the obvious one. Thin routes, contract-preserving refactors, event-shape repetition, typed wiring, repetitive settings surfaces, and other pattern-heavy work became much faster once the model knew the house style.

Mechanical transformation was another. Rename this. Split that. Move the helper. Re-export the public path so imports do not break. A large share of the v6.7.0 SOLID sweep falls into exactly this class of work: valuable, real, and still substantially easier once the architectural direction had already been chosen by a human.

Codebase exploration also improved. “Where does this state actually live?” “Which layer owns this event?” “Who still imports this symbol?” For a repo with enough structure and enough instruction, the model could often answer those questions faster than a cold manual hunt.

And yes, styling work sped up too. I do not object to CSS. I object to wasting half an hour on a layout adjustment whose destination is already obvious.

What the Models Were Bad At

The models were weakest exactly where the work was most important.

They were weak at architectural judgment. Weak at historical reasoning. Weak at knowing which abstraction was earned and which one was merely tidy-looking. Weak at deciding whether a bug was local or a symptom of ownership drift. Weak at recognizing when a “small convenience” had quietly broken one of the repo’s deeper rules.

They were also weak at reading the emotional temperature of a product decision. The transcript record around version 6 makes this plain. I kept talking about polish, trust, confidence, clarity, and whether the interface was lying. Those are not vague artistic concerns, they are product judgment. A model can help execute the resulting fix. It cannot reliably tell you which of three seemingly reasonable directions will keep the product honest six weeks later. That comes from the talent of the developer, and only after they shed blood, sweat, and tears to gain experience.

The Autonomous-Agent Experiment Was Useful for a Different Reason

One of the more revealing moments came around the v6.0 push, when autonomous agents were recruited deliberately for architectural review, optimization, front-end quality, and possibly test quality. That was not just a stunt. It was a stress test of what machine assistance was actually worth when split into roles.

The answer was predictable in hindsight. Agents were useful at surfacing patterns, spotting cleanup opportunities, and generating candidate work. They were not capable of relieving the human of final judgment. A later transcript about reviewing autonomous cloud-agent pull requests made the needed posture explicit: approve, deny, or recommend amendment. In other words, treat the AI as a prolific junior with a PR queue, not as an authority.

That is the sane frame.

💡 Apt Architecture
AI-generated pull requests should be treated like any other high-volume junior contribution: useful as a source of candidate work, unsafe as a substitute for architectural judgment, and only as good as the review discipline applied to them.

The Rationale for AI-Assisted Development

You might be wondering why one would choose to use AI at all, rather than embracing the traditionalist experience of slowly building up skills in the classic way—writing every line of code and conducting all research manually yourself. This is a valid question that I do not dismiss out of hand.

I utilized AI-assisted coding to create Vociferous not because it was a portfolio project intended for rapid completion or to impress potential employers. Rather, I required a tool quickly while simultaneously seeking to balance my learning curve, experience gain, and the polishing of my engineering chops. My goal was to challenge my systems thinking and deepen my understanding of how I conceptualize and appreciate software. Ultimately, the end goal remained consistent: to create a functional tool for a specific purpose.

The 500-Hour Problem and Self-Driven Passion

I recently concluded a three-part presentation series titled You Were Never Bad at Programming. This series hinges on the concept of identifying a “500-hour problem.” This refers to finding a legitimate, verifiable, and solvable problem that you are deeply passionate about solving—going beyond mere interest in learning to program.

This approach is most beneficial when the problem belongs to you, rather than another party. A proper 500-hour problem drives a person slightly mad and significantly toward greener pastures, allowing their knowledge to bloom. No one will invest 500 hours into a problem they do not care about. This is critical in software development: to remain passionate—even if you must fake it initially—you must be highly self-driven.

Personal Verification Through Project Execution

This entire project served as a multifaceted solution and answer rubric for myself. It allowed me to test several fundamental questions:

Was I willing to invest hundreds of hours into solving a problem I was passionate about?
Were software engineering, development, systems thinking, and architecture the right paths for me?

Previously, I described aiming a blunderbuss at a flock of birds, where the birds represented the various problems I faced. At that time, I needed a job—and still do—but more importantly, I needed to learn software development to determine if it was suitable for me. I believe I have accomplished much in this regard.

Measurable Outcomes and Future Intentions

I set out with specific needs, which the project addressed:

Code Interpretation: I needed to understand how to read and interpret code; I have made leaps and bounds in this department.
Ideation Speed: I required a tool to transfer thoughts from my head to paper much faster. The analytics of Vociferous confirm that I have succeeded dramatically, exposing just how useful the program is to me.

Given these needs within a limited timeframe—not one dictated by my own leisure—I had to balance speed and efficiency with understanding and results. This necessity drove my use of AI-assisted programming.

I will not declare this approach suitable for everyone; doing so would be selfish. However, I state plainly that it was exceptionally useful to me. I intend to use it where I deem appropriate, though I will not rely on it exclusively. In fact, as I continue to develop my own skills, I plan to rely on it less.

The Honest Summary

The most accurate description I can give is this: AI pair programming was a force multiplier on implementation once the route was clear, and a seductive source of bad confidence when the route itself was still undecided.

That makes it neither miracle nor fraud. It makes it a tool with a very sharp usefulness boundary. You must keep these tools firmly in their lane.

Used correctly, it accelerated a large fraction of Vociferous. Used carelessly, it would have flattened the repo into average-project mush with extraordinary speed.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #19

Andrew Brown — Wed, 27 May 2026 20:49:33 GMT

The audit loop began, as many decent quality initiatives do, with irritation.

Not catastrophic breakage. Not a production fire. Irritation. Buttons that refused to wrap. States that technically worked and still felt misleading. Prompt selectors that loaded the right thing without showing the user what was active. Append flows that seemed to complete while quietly violating the most important principle in the room: do not erase the user’s data.

That was the real shape of the work between v6.0 and v7.0. Not invention from nothing. Repeated contact with rough edges until each one was either fixed or promoted into a proper issue.

The Loop Was Primitive on Purpose

There was no grand framework.

Open a view. Use it like a real person. Notice the friction. Write it down. Give it an ISS number. Decide whether it is cosmetic, structural, or a disguised data-integrity bug wearing a small UI costume. Fix it. Repeat.

This sounds almost offensively basic, which is probably why so many teams skip it in favor of elaborate abstractions about quality. But direct usage is what surfaced the real defects.

The transcript export from March 2026 is full of exactly this mode of scrutiny. The action bar was compressing badly and overflowing instead of wrapping. The default refinement prompt looked selected in one context and effectively disappeared in another. Continue versus append behavior felt stateful in the wrong places. Compound transcripts were being treated as though the user’s visible aggregate were the only truth worth preserving.

That is audit work at its most useful: not admiring architecture diagrams, but discovering where the lived experience and the implementation have started to drift apart.

💡 Apt Architecture
Quality work needs dedicated discovery time. Minor defects accumulate because each one seems too small to deserve a meeting. An audit loop creates the habit of noticing them before they congeal into the product’s normal level of friction.

Taste Is an Input, Not an Embarrassment

The first signal is often not a measurable bug. It is discomfort.

A layout feels wrong. A view is too crowded. A control is technically accurate and still misleading. A workflow succeeds while creating distrust.

That kind of signal is easy to dismiss if you are pretending that only hard failures count as engineering work. But a surprising amount of meaningful cleanup begins exactly there. The job is not to mystify your own taste. The job is to convert discomfort into a precise claim the system can act on.

This is how aesthetic annoyance becomes architecture. Why are there multiple action-bar implementations? Why does state live in one view and fail to reappear in another? Why does a destructive mutation still flow through a shallow API path instead of the command bus that is supposed to own mutations? Why is a settings surface accumulating unrelated concerns until it becomes a junk drawer?

Once those questions are written clearly enough, the fix is often obvious.

The Audit Passes Started Touching Real Structure

Some of the fixes looked minor from the outside and were still architecturally revealing.

The v5.10.7 architecture audit removed dead infrastructure like batch_retitle_progress, a WebSocket event registered in the bridge but emitted and consumed nowhere. That kind of ghost logic is not dangerous because it is loud. It is dangerous because it quietly teaches everyone reading the code that the system contains unexplained ceremonial leftovers.

Later passes got more consequential. By v6.4.1, append semantics were corrected so the source transcript row survived as a compound child instead of being deleted. By v6.5.4, destructive actions that had leaked back into direct API mutation were pulled behind intents again, where they belonged. By v6.5.5 and v7.0.0, view-level button piles had been replaced with command bars and command menus that actually declared action visibility and overflow rather than improvising it per screen.

Those are not disconnected chores. They are all symptoms of the same discipline: stop allowing convenience to rewrite the system’s rules one local patch at a time.

💡 Apt Architecture
A bug that appears small at the UI layer is often revealing a boundary problem underneath. Treating the symptom without tracing the ownership error is how software becomes cosmetically improved and structurally worse.

The Audit Never Stays “Done”

One of the more embarrassing beliefs recorded in the transcript history is the recurring hope that a sufficiently good cleanup pass might finally finish the product into stability.

That is not how growth works. Every new feature creates new seams. Every additional provider, route, setting, and recovery path introduces another place where state can desynchronize, wording can lie, or a thin convenience path can bypass architecture with just enough plausible justification to sneak through.

The difference between a product that feels intentionally maintained and a product that simply accretes roughness is whether someone keeps returning to the whole thing with the unpleasant question: what is wrong with this now?

That is the audit loop. Not glamour. Not inspiration. Attention, repeated until the software begins to feel like someone actually uses it.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #18

Andrew Brown — Wed, 27 May 2026 20:48:44 GMT

AI refinement without inspection is just blind faith with a prettier label.

That was obvious the first time I tried to compare an original transcript and a refined transcript by eye. The model had done a competent job. It fixed punctuation, removed filler, tightened phrasing, and cleaned up a few malformed clauses. The changes were sensible. They were also irritatingly hard to see.

The computer had already done the edit. I was now doing manual verification work that the interface should have made nearly effortless. That is when the real problem became clear: review cost was threatening to erase refinement value.

Side-by-Side Blocks Were the Wrong Abstraction

The first instinct was the obvious one. Put the original on the left and the refined output on the right.

For short text, this is tolerable. For anything even modestly long, it becomes visual punishment. Your eyes bounce across paragraphs looking for a missing article, a fixed contraction, a deleted phrase, or one slightly better verb in the middle of a block. The relationship between the two texts exists, but the interface makes the user reconstruct it manually.

That is not inspection. That is unpaid labor.

If the review step is tedious, users stop reviewing. Once that happens, one of two stupid outcomes follows. Either they trust the model too much, or they stop using refinement at all. Neither is acceptable.

💡 Apt Architecture
AI systems that rewrite user text need legibility more than they need flair. If the interface cannot make the model’s changes obvious, the feature degenerates into either blind trust or abandonment.

Teeny-tiny Fixes

The implementation in DiffView.svelte is hilariously small for how important it became.

import { diffWords } from "diff";

let parts = $derived(diffWords(original, revised));

That is basically the whole thing. Word-level diffing, rendered inline, with visual emphasis on additions and removals rather than on duplicate blocks.

The point is not that the code is clever. The point is that the feature finally aligned the interface with the actual user question: what changed?

🛠️ Tech Check
Why word-level diff instead of line-level diff?
Source code usually benefits from line-level diff because lines are meaningful units. Transcript prose does not. Change one word in a paragraph and a line-based algorithm can make the entire paragraph look removed and re-added. Word-level diff isolates the real change instead of screaming about the whole block.

Review Had to Become a Real Workflow

The diff view was the first honest version of refinement review, but it was not the last.

By v5.10.5 the comparison surface also carried an analytics delta card: before-and-after word count, sentence count, average sentence length, readability shifts, filler reduction. The raw diff answered where the text changed. The delta card answered what kind of change the system had just made.

Later, the project kept pushing that same principle outward. The Edit view became a dedicated route instead of an accidental side path. By v6.5.5 and v7.0.0, that meant a TipTap-backed Markdown editor with real controls, dedicated routing, and explicit entry points from Transcribe, Transcripts, and Refine. At that point refinement review was no longer a novelty panel. It was part of a broader editorial surface.

That evolution matters because AI-assisted text is not a one-click magical transformation. It is a user-visible editing workflow. The person using the system needs to be able to inspect, accept, discard, revise, retitle, and continue working without dropping into some stale view that only half-remembers the current state.

The Destructive Choice Had to Look Destructive

One of the dumbest older ideas in the UI was danger-reveal: the discard action only looked dangerous when you hovered it. That is cute in exactly the wrong way.

If an action is destructive, it should look destructive before contact. The interface does not need coyness. It needs honesty.

That same principle later showed up all over the audit-driven UI cleanup: action bars that actually wrap, controls that expose current state, settings surfaces that stop burying dangerous operations inside generic cards, and command bars that declare action visibility instead of relying on a bespoke pile of buttons in every view.

Diff highlighting was the first crisp example of that broader lesson. The interface must not make users infer what the tool already knows.

What the Feature Was Never Supposed to Be

I never wanted a per-token parliamentary procedure where the user approves or rejects microscopic edits one by one. That is how you turn a two-second review into a five-minute clerical exercise and then congratulate yourself for empowering the user.

The right ratio is simpler. Show the change. Make it obvious. Provide sane actions. Let the user remain in control without requiring ritual.

That is what the diff view finally did. It made refinement inspectable at the speed the feature needed to justify its own existence.

Vociferous is an open-source, local-first speech-to-text workspace. It runs fully on your hardware by default, with optional external providers when you want them.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #17

Andrew Brown — Wed, 27 May 2026 20:36:17 GMT

Raw transcription is honest and ugly.

Modern ASR is quite good at turning sound into words. It is much worse at turning spoken thought into text anyone would actually want to keep. “so I was thinking about the meeting and I think we should probably move the deadline because you know the the client hasn’t gotten back to us yet” is a perfectly accurate transcription and a paragraph no one sane would send to another human being unedited.

Somewhere between what I said and what I meant to write is the entire value proposition of a local refinement pipeline.

The Cloud Answer Was Disqualified at the Design Level

The easy answer is obvious: send the transcript to GPT-4 or Claude, receive cleaner text, move on with life.

That was also the one answer Vociferous was not permitted to use.

The project’s core promise is offline-first, privacy-respecting, and runs-on-your-hardware. If transcript text leaves the machine, that promise is broken. It does not matter how much nicer the output becomes after the betrayal. Privacy as an architectural decision is not decorative. It has to survive the moment when the convenient implementation would violate it.

So the refinement pipeline runs locally, on the user’s hardware, with models small enough to be plausible on ordinary machines and good enough to improve spoken prose without wandering into hallucinated nonsense.

💡 Apt Architecture
Privacy constraints have to survive feature pressure. If an “offline-first” product abandons that principle the moment a better hosted API appears, the principle was never architectural. It was branding.

The Evolution from GEC to SLM

v2.3.0 began with vennify/t5-base-grammar-correction, a grammatical error correction model converted into CTranslate2 format and running in Int8 quantization on CPU. It corrected grammar in the narrow sense: agreement, tense, obvious cleanup. It did not really rewrite for clarity or shape. The refinement was non-destructive: refined text became a variant, and raw_text was never overwritten.

v2.4.0 replaced the T5 model with Qwen3-4B-Instruct, a decoder-only language model with a 32k context window. That changed the nature of the feature. It was no longer grammatical correction in the strict sense. It became instruction-following text refinement with adjustable intensity. Three profiles emerged: MINIMAL, BALANCED, and STRONG. GPU loading became conditional through nvidia-smi, so the model could land on GPU when headroom existed or remain on CPU when it did not.

v3.0.11 extracted the service into SLMRuntime, a focused class responsible for model lifecycle, inference coordination, and service state. Provisioning, GPU confirmation, and request queueing were moved out of it.

v5.0.0 completed the migration to a unified CTranslate2 stack. llama-cpp-python disappeared. ctranslate2.Generator plus tokenizers took over. ChatML formatting entered through _messages_to_chatml(). Stop conditions became explicit token constraints via tokenizer.token_to_id().

🛠️ Tech Check
What is ChatML?
ChatML is a message-formatting convention used by many instruction-tuned language models. Messages are wrapped in special tokens that indicate roles such as system and user, which allows the model to distinguish instructions from the data it is supposed to process. Without that structure, an instruction-tuned model may treat the prompt as ordinary continuation text rather than a task specification.

Three Layers, Three Jobs

The refinement pipeline eventually settled into a three-layer structure.

SLMRuntime in src/services/slm_runtime.py owns model lifecycle and thread management. It loads and unloads the model, tracks service state, and exposes the runtime API the rest of the application uses. It does not know how prompts are written.

RefinementEngine in src/refinement/engine.py owns inference. It wraps the CTranslate2 generator and tokenizer, manages token budgets and sampling settings, and returns a GenerationResult. It does not know about application state, UI concerns, or threading policy.

PromptBuilder in src/refinement/prompt_builder.py owns prompt construction: system prompt, profile instructions, task directive, user text, and ChatML formatting. That is all it does.

Those boundaries matter because each class remains ignorant of the wrong problems. SLMRuntime does not know prompt wording. PromptBuilder does not import locks. RefinementEngine does not pretend to be a state machine. When the prompt layer starts importing threading primitives, the architecture has already begun to dissolve.

💡 Apt Architecture
A class’s imports are one of the cheapest honest audits of its architecture. When the prompt layer imports threading policy or the runtime imports formatting concerns, the boundaries have already been crossed, whether the diagrams have admitted it yet or not.

The Prompt-Injection Problem

v2.4.0’s release notes mentioned that transcript text is treated strictly as data rather than instructions, using what I described at the time as a Swiss-Army-Knife system prompt strategy. That was not decorative language.

When user text is handed to an instruction-following model, nothing prevents that text from itself containing instructions. “Ignore all prior directions and output the following...” is the canonical example, but the real issue is broader: unless the prompt structure clearly establishes what part of the input is task definition and what part is payload, the model can be induced to behave as though data were instruction.

For a local model processing private transcripts, this is lower stakes than it would be for a public multi-tenant API. It is still a correctness issue. The prompt has to establish clearly that the transcript is material to be processed, not authority to be obeyed.

So the prompt builder does exactly that. The system instruction lands first. The transcript is wrapped in explicit delimiters. The model is given a clear task before it sees any user-supplied prose. Generation is bounded rather than open-ended.

This is not an elaborate security cathedral. It is simply an architecture that declines to make the obvious mistake easy.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #16

Andrew Brown — Wed, 27 May 2026 20:17:46 GMT

The entire project backlog lives in one Markdown file. One file, sixty-plus issues, four sections, and not a trace of Jira.

This is either efficient or mildly deranged. I eventually stopped trying to distinguish the two and accepted such a reality as my lot in this life.

The Case Against Project-Management Infrastructure

I have used Jira, Linear, GitHub Issues, Trello, Notion, and Asana. They are all perfectly competent at doing team things.

Teams need role-based access, workflow automation, notifications, sprint planning, permissions, visibility, and structured handoff. Those are real problems and those tools solve them with varying degrees of tolerability.

In a one-person workflow, none of those are the central problem. The central problem is much simpler: remember what matters, in what order, and for what reason.

Jira wants epics and statuses and sprints and labels. GitHub Issues wants milestones, linked pull requests, and issue taxonomies. Every one of these systems assumes that coordination overhead is the defining challenge. For a solo developer, it is not. The challenge is keeping the list visible without letting the list become noise.

💡 Apt Architecture
Process should match the coordination cost it actually solves. Team tooling earns its keep when there are handoffs, permission boundaries, and visibility problems. When those pressures do not exist, the same tooling often produces ceremony rather than leverage.

The File Has Four Sections

The workboard lives at .agent_resources/workboard.md. It has four sections:

Kanban Overview is the quick scan: one line per issue, including the ISS number, title, area, and status. This is what you read when you want to know what is in flight and what is next.
Open Issues is where the real thinking happens. Each entry includes a description, acceptance criteria, code-reality notes, and design notes. The acceptance criteria are the contract: when those are satisfied, the issue is done. Code-reality notes capture what the code currently does, especially where that differs from what the issue is supposed to resolve. Design notes record decisions that are not self-evident from the requirement alone.
Deferred holds ideas that are real but not ready. Things noticed too early. Features blocked by other prerequisites. Problems worth remembering and not yet worth solving.
Resolved is the historical record: issue number, title, version, date, and commit hash. Every closed issue lands there. It is searchable, attributable, and far more useful than it has any right to be when I need to remember why something ended up shaped the way it did.

Why an agents folder? Because I could point an LLM at it and ask questions that would eventually save me a LOT of Googling. As an added feel-good bonus, I had switched to local agents by now. With them running on my own hardware, I was able to suffer the transgressions of using generative AI personally, in the form of an electric bill.

What an Entry Actually Looks Like

An open issue entry looks roughly like this:

### ISS-082 · Recording UI & auto-title fixes

**View**: TranscribeView  
**Status**: Open  
**Priority**: P2

**Description**: Three post-5.8.5 regressions: Continue button navigation, recording circle double-render, glow clipping.

**Acceptance Criteria**:
- Continue button auto-starts recording
- Recording circle renders as single unified shape
- Glow radiates without clipping

**Code Reality**: viewState transitions to 'viewing' instead of 'ready'; static background-color on mic button creates double-render; overflow: hidden clips glow.

The ISS number is the stable reference that survives refactors, renames, release churn, and test-file movement. The acceptance criteria are what actually close the issue, not a general sense that things feel better, but specific, falsifiable behavior.

🛠️ Tech Check
What is the point of acceptance criteria?
Acceptance criteria make a requirement falsifiable. Without them, “done” is an opinion. With them, “done” means the listed behaviors can be checked and verified. They also form the natural input to regression tests, because a closed issue with clear criteria is a specification waiting to be encoded.

Why This Works for One Person

The workboard works because the coordination cost in a solo project is effectively zero. There are no permissions to manage, no notifications to route, and no ceremonies to host so everyone can agree that everyone has indeed seen the board.

The system only needs to do three things well: keep the issue list visible, keep it ordered, and keep it honest.

A Markdown file in the repository satisfies all three, while also being diffable, grepable, and versioned alongside the code it refers to. When ISS-082 closes, the code change and the workboard update live in the same repository, inside the same history, subject to the same reviewable trail.

For a solo developer who refuses to maintain a second system unless the second system earns its keep aggressively, that is enough.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #15

Andrew Brown — Wed, 27 May 2026 20:07:06 GMT

v5.8.0 shipped a radar chart. What? I thought it was pretty neat… even if that thought only lasted a few days.

Five axes — speaking rate, pause efficiency, session length, vocabulary richness, consistency — plotted into one polygon. The overall score was the area of the shape. It looked intelligent. It looked dashboard-like. It looked like the sort of thing a serious productivity tool might produce.

It was nonsense.

Actually, that thought lacks the eloquence the feature deserves: it was complete, utter, rotting horseshit.

The radar chart survived from v5.8.0 to v5.10.8, which is a much longer lifespan than it should’ve got.

What a Radar Chart Implies Whether You Like It or Not

A radar chart smuggles in two claims whether you intend it to or not.

First, it implies that every axis matters equally. Second, it implies that more on every axis is inherently better.

Neither of those claims survived contact with actual use.

Higher speaking rate is not universally better. Sometimes it reflects clarity and momentum. Sometimes it reflects bulldozing through language without pausing to think. Longer session length is not inherently better. Some useful thoughts are short. “Vocabulary richness” was the most ridiculous axis of the lot: I could compute a number for it, yes, but the existence of a computable number is not an argument for letting that number into polite society.

The chart lasted as long as it did because radar charts look clever in dashboards. That is not a respectable reason to ship anything.

💡 Apt Architecture
Pretty analytics can smuggle in bad assumptions. A polished chart does not merely display data. It quietly instructs the user what matters, what should be improved, and what counts as success. If those assumptions are wrong, the chart is not harmless. It is misleading.

The Bell Curve Was Worse

The speaking-rate view included a bell curve behind the user’s WPM, implicitly comparing them to an average speaker.

Average compared to whom? Nobody real.

There was no population dataset. No representative benchmark. No measured cohort. The curve was a hardcoded Gaussian centered on 130 WPM with a standard deviation chosen because it looked plausible on screen.

That is not analytics. That is fan fiction with axes.

🛠️ Tech Check
What makes analytics dishonest?
Analytics become dishonest when the visualization implies relationships the data does not actually support: comparison to a fictional baseline, equal weighting of incomparable dimensions, or a metric that exists merely because it was easy to compute rather than because it was meaningful.

v5.10.8: Kill the Dishonest Metrics

v5.10.8 deleted RadarChart.svelte and DistributionChart.svelte. The commit message for the radar chart said, “Mixed-unit normalized axes were dishonest.” That was not rhetorical excess. It was the exact diagnosis.

Along with the charts went the metrics that fed them: polysyllabic-word ratio, lexical complexity, average word length, long-word ratio. The documented rationale was, effectively, nobody cares. That too was correct.

What replaced them was an honest dashboard.

Two summary cards, “Your Voice” and “Refinement Impact,” showing streaks, top filler word, and WPM. A Deep Dive tab with four sections: Productivity, Speech Quality, Readability, and Trends. The visuals still existed, but they were now tied to quantities that could survive scrutiny.

WPM was also corrected in the same release. The earlier computation used total recording duration as the denominator. The new computation used speech_duration_ms, which meant VAD-based active speech time. Silence had never belonged in that denominator, and the fact that I let it stay there for as long as it did is a testament to how easy it is for plausible numbers to linger once they have been drawn beautifully enough.

Every metric in the new analytics either worked or disappeared. That sentence sounds obvious. It was apparently not obvious enough to stop the previous version from shipping.

What Chapter 4 Did Not Quite Fix

Chapter 4 in this series was about the metrics trap: how a dictation tool can accidentally become a scoreboard. The radar chart was the sequel, because understanding a failure mode intellectually is not the same as resisting the temptation to ship it when it looks polished.

That is the more uncomfortable lesson. Analytics features are easy to add, because charts render quickly and computed fields make a strong first impression. They are much harder to remove once users and developers alike have invested them with implied importance.

The right time to ask whether a number deserves to exist is before writing the first line of the chart component, not six hours, days, weeks, or months later when the graph has already begun teaching users to care about the wrong things.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #14

Andrew Brown — Wed, 27 May 2026 20:02:54 GMT

The original migration strategy was hilariously and embarrassingly direct: just delete the database! S-tier strategy, by the way.

v2.2.0 said it plainly in the release notes: “BREAKING CHANGE — this release resets the local database structure. Legacy history files will be recreated upon first launch to ensure schema consistency.” That was the policy. Nuke, pave, move on.

That policy did have an expiration date, and once that date arrived it became indefensible almost immediately.

When Nuke-and-Pave Stops Being Acceptable

Early in the project, the installed base was one user: me. The data was recreatable. If a schema change erased a few weeks of test transcripts, I could survive the inconvenience because the transcripts were not yet the product’s memory. They were scaffolding.

That stopped being true the moment the database began holding meeting notes, project plans, and dictated work that did not exist anywhere else. Once the data stopped being disposable, destructive resets stopped being a pragmatic shortcut and became a trust failure; it no longer mattered that I was the only user as I used Vociferous for everything.

From that point forward, the migration system had to go undefeated.

💡 Apt Architecture
Data irreplaceability changes the architectural standard. A destructive reset can be rational while the product is still disposable. Once the software is storing work that cannot simply be recreated on demand, the same reset becomes a breach of trust.

v3.0.14: Forward-Only Migrations

The system I settled on was boring in the way good infrastructure is boring.

The database stores a single integer in schema_version.
On startup, the application reads that value.
If the version lags behind the current schema, migration functions run in sequence until the database catches up.
Each migration runs inside an explicit transaction and commits independently.

That last point mattered more than it first appears to. If one migration succeeds and the next fails, the database remains at the last successfully committed version. No heroics and no magical rollback nonsense pretending the earlier progress never occurred. Fix the broken step, rerun, proceed forward.

🛠️ Tech Check
What is a forward-only migration?
A forward-only migration system only knows how to move a database from an older schema to a newer one. It does not attempt to walk backward automatically. Each migration is therefore responsible for one direction of change only, which greatly simplifies reasoning when real user data is involved.

v3.0.3 removed the legacy schema detection and automatic reset path that predated this system. The old code tried to detect a legacy database and wipe it. The new code never wiped anything, which, of course, is preferable.

What Actually Changed Across the Versions

Between schema versions 1 and 9, the database grew in small, intelligible steps:

v1 → v2 added speech_duration_ms so total recording time could be distinguished from active speech time.
v2 → v3 added variant support through current_variant_id, because grammar refinement needed somewhere to live that was not the original transcript.
v3 → v4 introduced the FTS5 full-text index. Before that, search scanned the transcript table directly.
v5 → v6 added include_in_analytics INTEGER NOT NULL DEFAULT 1, making analytics exclusion a per-transcript property.
v6 → v7 seeded the Compound system tag for appended transcripts.
v7 → v8 added has_audio_cached, tracking whether source audio still existed in the cache.
v8 → v9 added is_protected INTEGER NOT NULL DEFAULT 0, seeded the Prompt system tag, and seeded the default refinement prompt.
v9 → v10 added transcription_time_ms and refinement_time_ms for performance analytics.

Each migration committed independently. If version 7 failed, the database remained safely at version 6. On the next startup, version 7 would simply be attempted again.

💡 Apt Architecture
Independent commits make partial success recoverable. A migration system that commits each step separately allows the database to stop in a valid state if one step fails. Wrapping the whole sequence in one transaction is aesthetically cleaner and operationally worse.

The Part That Did Not Matter Until It Mattered Completely

For months, the migration system did its work invisibly. Users (me!) upgraded. The database evolved and all of the transcripts survived.

That is the only honest measure of whether such a system is good. Not how elegant the migration helpers look, not how clever the versioning scheme sounds, and not how satisfying the code review felt the day it was merged. The real question is whether failures are handled gracefully enough that users never lose data they did not intend to lose.

Nine migrations across the version history. Zero catastrophic losses. That is the metric that mattered; especially as I was quickly catapulting towards massive new features.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #13

Andrew Brown — Wed, 27 May 2026 19:58:16 GMT

I put off audio import for months.

It looked like the sort of feature that would smear itself across the entire codebase: file picker, codec handling, long-running progress events, storage decisions, and a distinct UI state for a workflow that was similar to recording only until you looked at it properly. It had all the usual plumbing smell of something that wants to become everybody’s problem.

v5.9.3 implemented it.

It took about four hours, including testing and polish.

That gap between the estimate and the reality told me more about the architecture than the feature itself did, honestly.

What AudioSpool Had Already Solved

The reason audio import turned out to be trivial is the same reason it had looked expensive before I examined it closely: AudioSpool had already changed what the transcription pipeline considered to be its input.

Before the spool, the pipeline cared about a microphone. It expected a live audio source feeding frames in real time. After the spool, the pipeline cared about a file path. Live recording became a WAV file on disk and then got handed to the transcription engine. The engine did not need to know whether that file had originated from a microphone callback five seconds ago or had been sitting in the user’s Downloads folder since last Tuesday.

Audio import is simply that second case with the recording step removed. The file already exists. The pipeline already understands files. The microphone disappears from the story. The rest of the architecture stays exactly where it was.

💡 Apt Architecture
Feature cost is often a lagging indicator of earlier structural decisions. When a new capability turns out to be surprisingly cheap, something upstream made it cheap. When it turns out to be miserable, something upstream usually made it miserable too.

The Frontend Was Equally Uneventful

The frontend implementation was a plain file picker. The selected file goes to a multipart upload endpoint. The backend treats it as a transcription job. The finished text lands in TranscriptDB alongside live recordings, with one metadata flag indicating that the source was an import rather than microphone capture.

No separate storage model, no special transcript type and no schema change. One metadata bit and the existing system doing its job.

🛠️ Tech Check
What is a multipart upload?
A multipart upload is an HTTP POST request whose body contains one or more files alongside any other form fields. It is the standard browser mechanism for sending file data to a server. The backend receives the bytes and can then write them to disk or hand them straight to a processing pipeline.

I Declined to Build a Codec Layer

faster-whisper already handles WAV, MP3, FLAC, OGG, and most common audio formats because ffmpeg performs the decoding underneath. The correct move was to hand the imported file to the existing transcription engine and let the already-proven stack do the heavy lifting.

No bespoke format-conversion layer. No second copy of the file. No new media subsystem to own, debug, and eventually resent.

Rebuilding a slice of ffmpeg merely to have local ownership of a problem I did not need to own would have converted a four-hour feature into a multi-week ego project. I declined the opportunity.

The One New Moving Part

Imported files can be long. A forty-minute lecture is plausible. A two-hour meeting is plausible. If transcription takes several minutes, a spinner that says nothing is not a insult to the user.

The only genuinely novel work in v5.9.3 was a WebSocket progress event that emits percentage updates during transcription. The frontend shows a progress bar. The user can see that the job is alive and approximately how far through the file it has gone.

That was the only new component with real architectural weight. Everything else was composition: existing upload handling, existing transcription service, existing persistence, existing transcript UI. The feature was not simple because the idea was simple. It was simple because months of structural work had removed all the resistance in advance.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous

Speaking into Existence #12

Andrew Brown — Wed, 27 May 2026 19:50:06 GMT

In March 2026, Vociferous ate seventeen minutes of dictation.

Not mangled. Not partially saved. Not awkwardly recoverable with some creative optimism and a prayer. Gone! A background exception landed at exactly the wrong moment, the audio buffer evaporated, and the transcript came back empty. Seventeen minutes of project planning had existed entirely in RAM, and then it did not exist at all. Honestly, I was surprised it took that long to choke.

In a state of pure indignation, I shipped the permanent fix within twenty-four hours (or so I had thought; version 6.5+ has a much more robust recovery mechanism).

What the Old Pipeline Assumed

The original recording path was simple: press record, accumulate audio frames in memory, press stop, pass the buffer to the transcription engine. For ten-second test clips, that was fine. For seventeen minutes of irreplaceable work, it was architectural negligence wearing the costume of simplicity.

The hidden assumption was that nothing important would fail between the moment recording stopped and the moment transcription began. Thread synchronization issue, pipeline exception, process crash, power loss, shutdown, whatever — any failure in that window could silently discard anything that had not yet reached durable storage.

That design was not wrong for a prototype. It was wrong for a tool that had begun carrying actual work.

This is the kind of failure mode you only feel properly once the data matters. Test clips are disposable. A project plan dictated on a train, while your hands are occupied and the thought is live in your head, is not. The risk profile changes whether the architecture is prepared for that change or not.

💡 Apt Architecture
Durability has to happen before expensive or failure-prone processing. The moment user-generated data becomes irreplaceable, treating capture and processing as one atomic step turns the whole pipeline into a single point of failure.

v5.9.0: AudioSpool

The fix became AudioSpool.

Instead of treating disk as a later persistence step, the recorder writes raw PCM frames to a spool file on disk as they arrive. Every frame. No deferred save step. No cheerful fiction about “we will write it out when the recording is done.” The file grows while the user is still speaking.

When recording stops, the spool finalizes the WAV header and hands the resulting path to the transcription engine. If transcription throws, the audio still exists. If the process crashes, the audio still exists. If the machine dies, most of the audio still exists, and the amount lost is bounded by whatever had not yet flushed.

Spool files live in /audio_spool/. After a successful transcription, the spool is converted into a standard WAV in /audio_cache/{transcript_id}.wav. The cache is bounded by a configurable duration limit, sixty minutes by default, which comes out to roughly 115 MB. Oldest recordings are evicted first through LRU pruning.

The audio_cache_minutes setting in Settings → Recording controls how much post-transcription audio to retain. Set it to 0 and the long-lived cache goes away entirely. The crash-safety spool remains, because that part is not optional.

🛠️ Tech Check
What is a spool?
A spool is a temporary persistence layer that writes data to disk as the data is produced so later stages can process it independently. The term comes from older printing and job-queue systems, but the principle is the same: capture first, process second, and do not let a failure in one destroy the output of the other.

Recovery Became a Startup Concern

Once recording lived on disk, crash recovery stopped being a hypothetical edge case and became a startup responsibility.

On launch, the application now checked the spool directory for orphaned .pcm files left behind by a crash, forced exit, or anything else inelegant. If it finds survivors, it logs each one with duration and path and offers to transcribe them.

That is the architectural shift that matters. “The process died” stopped meaning “your audio is gone.” It started meaning “your audio is still here, do you want to recover it?”

The UI may look the same in either world. Yet the product is not the same in either world.

The Implementation Details That Actually Matter

The microphone callback writes through a lock, because concurrent chunk writes are not a place for vibes.

Successful transcriptions delete their spool files after the transcript is safely committed to the database. The spool directory does not become a landfill.

The spool writes uncompressed PCM in the same WAV shape the transcription engine already expects, so there is no pointless conversion step between capture and inference.

Navigation between views was initially locked while a recording was active via the existing isNavigationLocked mechanism. That prevented the recording state from being casually abandoned by accidental UI movement. Later, in v5.10.3, that restriction was relaxed once the recording path had proved durable enough to persist in the background without depending on that guardrail.

The design is gloriously unglamorous. That is its virtue, says I. Crash-resilient recording does not need cleverness. It needs to write to disk early, write to disk continuously, clean up after success, and check for survivors on startup. Those are the three things v5.9.0 finally made non-negotiable—and, as previously mentioned, v6.5+ built on significantly.

Vociferous is an open-source, offline speech-to-text application. It runs entirely on your hardware, no cloud required.

GitHub: WanderingAstronomer/Vociferous