Introducing WithAudio Web Reader: Text-to-Speech That Runs in Your Browser

We’ve built a way for you to turn any webpage into audio with synchronized text highlighting—instantly, right in your browser. No servers, no limits, no subscriptions and no registration. Free!

How it works

Just take any public webpage URL and add with.audio/ to the beginning:

Original: https://example.com/a-public-address-web-page
WithAudio: with.audio/https://example.com/a-public-address-web-page

Paste that into your browser, wait for the loading bars next to paragraphs, then hit play. You’ll get synchronized text highlighting and real unlimited text-to-speech directly in your browser tab.

Try it now

You can try the WithAudio Web Reader at desktop.with.audio/reader.

One-time setup

The first time you use it, your browser will download a voice model. (The size depends on your platform. Kokoro is around 300 MB, Piper is about 100 MB.) Once that’s done, everything runs locally and even works offline.

Which TTS engine will you get?

We’ve set things up so the system automatically selects the best TTS engine for your platform:

macOS with Chrome or Safari: You’ll get the Kokoro-TTS model (about 300MB). This is a larger model that delivers more premium quality audio, but requires WebGPU support and significant computational power.
Other platforms and browsers: You’ll get the Piper TTS engine. While lighter weight and more compatible, it still provides good quality text-to-speech.

The main challenge with Kokoro-TTS has been WebGPU compatibility across different browsers and operating systems. Some platforms might technically support it, but we haven’t been able to make it work with good performance yet. That’s why we added Piper as a fallback to ensure everyone can use the service.

How it works

Before diving into the technical details, it's worth mentioning that we also have a WithAudio Desktop App that's referenced multiple times throughout this post. The desktop app came first, and a lot of the modules and code are shared between it and the Web Reader. In fact, the Web Reader was built after the desktop app. Creating a desktop application and then porting some of its features into a web application that runs locally in the user's browser was an interesting technical challenge.

The Web Reader uses two different open-source TTS models: Kokoro-TTS (about 300MB) and Piper. The backend automatically selects which engine is more suitable for your device based on your browser and platform capabilities. WithAudio Desktop app uses Kokoro-TTS with more voices.

The process

Because of browser limitations (CORS restrictions), we can’t fetch HTML content directly from the client side. Here’s what happens when you use the Web Reader:

Request to backend: Your browser sends a request to our backend with the base64-encoded URL and details about your user agent (browser information). We keep the URLs sent to us for analytics purposes along with all the details in the request.
Validation and fetching: The backend performs basic validation checks on the input, then fetches the HTML content of the requested page.
Content extraction: We pass the HTML to Defuddle to extract the readable content. Previously, we used Mozilla Readability, but switched to Defuddle as it does a better job at extracting clean, readable content.

We conver the html data to markdown. Markdown is intermediate data format we use to show the text in the UI and procees the content in the backend to extract the text and a mapping between markdown representation and text representation which enablees text highlighting once audio content is generated.
Text processing: We process and clean up the text, converting it into a proper format that can be rendered by the browser text/audio renderer, a lighter version of what we have in the WithAudio Desktop app.
Engine selection: The backend response specifies whether your browser should use Kokoro-TTS or fall back to Piper based on your device capabilities.
Local processing: Everything from this point happens locally in your browser. We set up the model and convert text to audio one paragraph at a time. (Ideally, this would be sentence-by-sentence for better responsiveness, but that’s a change we plan to make in the future.)

This step involves splitting the text representation of each block that we got from the backend and split it into multiple sentences. Each sentence is converted to audio using the selected TTS engine and are appeneded together. This way we can be sure we have timestamps of each sentence no matter which TTS engine we use.

As soon as one paragraph is ready it can be played while next ones are running in the background. In the web version we currently only support processing paragraphs one by one to simplify the task manager which manages the text to speech tasks.

While the audio is playing, depending on the timestamp of the audio, we find the sentence that should be highlighted and mark it so users can know what is currently being played. Implementing a highlighting solution that works with our custom text renderer was a challenge and you probably will see bugs in there where will cause a broken markdown representation.

What makes the synchronized text highlighting difficult is we need to keep three types of data in sync. The markdown content, the text content and audio content. Syncing markdown and text content is important to make it possible to have bold, italic or links represented in their original format while highlighting the.

Privacy considerations

This is where we don’t have the privacy I wish we could offer. Because of browser limitations, the URL you want to read must be sent to our backend. The Desktop app, in contrast, does all of this locally. We haven’t found another solution to work around browser restrictions (aside from building a browser extension).

Note that many websites block requests from data center IP ranges because they want to prevent bots. We understand their reasoning, but it means some URLs may not work with the Web Reader. For those cases, the WithAudio Desktop app is the only workaround since it fetches content locally.

Shared architecture

A lot of the modules we use are shared between the Web Reader and the WithAudio Desktop App. However, the Web Reader has more limited capabilities compared to the desktop app. For example, the desktop app supports more voices, export functionality, and queue management.

Kokoro-TTS uses WebGPU with Transformers.js and ONNX runtime for inference directly in your browser.

Try it

You can test it at desktop.with.audio/reader or use the URL prefix method on any public webpage.

This is still early development, it’s buggy and has limitations. We’re looking for feedback on performance, user experience, and ideas for improving cross-browser compatibility.

If you try it and something doesn’t work as expected, or if a URL fails to load, let us know at [email protected].