For the complete documentation index, see llms.txt. This page is also available as Markdown.

πŸ—£οΈWTF Transcription

World Transcription Format β€” a vendor-neutral analysis shape for speech-to-text output.

Draft: draft-howe-vcon-wtf-extension Β· Extension name: "wtf" (often emitted as "wtf_transcription" by older library code)

What it is

Every speech-to-text provider β€” Whisper, Deepgram, AssemblyAI, Google, AWS, Azure, ElevenLabs, the next one β€” has its own JSON output shape. If you want to swap providers, compare them on the same audio, or build downstream tooling that doesn't care who did the transcription, you end up writing adapter code over and over.

The World Transcription Format (WTF) extension defines a single canonical shape for that output. It covers the transcript text, time-aligned segments, optional word-level timing, optional speaker labels, quality metrics, and provider metadata.

WTF data lives in analysis[], not attachments[] β€” it's derived from the conversation, not supplied alongside it.

When to use it

  • Recording β†’ transcription pipelines where you may switch providers later.

  • Quality benchmarking: same audio, multiple providers, identical downstream code.

  • LLM ingestion: a stable transcript shape means prompts and parsers don't change when the ASR vendor does.

  • Speaker diarization workflows: WTF carries speaker labels in a standard way.

Spec surface

The WTF document is added as an analysis[] entry. The recommended form per the speckit is to use type: "transcript" and identify WTF via the schema: URL β€” this stays compatible with consumers that just want "any transcript". Older library code (and the draft's own examples) use type: "wtf_transcription"; both forms are valid as long as schema: points at the WTF draft.

{
  "analysis": [
    {
      "type": "transcript",
      "dialog": 0,
      "vendor": "openai-whisper",
      "product": "whisper-large-v3",
      "encoding": "json",
      "schema": "https://datatracker.ietf.org/doc/draft-howe-vcon-wtf/",
      "body": "{\"transcript\":{\"text\":\"Hello, I need help with my account.\",\"language\":\"en\",\"duration\":3.2,\"confidence\":0.95},\"segments\":[{\"id\":0,\"start\":0.0,\"end\":3.2,\"text\":\"Hello, I need help with my account.\",\"confidence\":0.95}],\"metadata\":{\"created_at\":\"2026-05-18T10:00:00Z\",\"provider\":\"whisper\",\"model\":\"whisper-large-v3\"}}"
    }
  ]
}

Required analysis fields:

  • type β€” "transcript" (recommended) or "wtf_transcription".

  • dialog β€” index of the dialog this transcription covers.

  • vendor β€” REQUIRED by the core spec. Identifies the ASR provider (e.g. "openai-whisper", "deepgram", "assemblyai").

  • product β€” the specific model (e.g. "whisper-large-v3", "nova-2").

  • encoding β€” "json".

  • schema β€” URL pointing at the WTF draft, so consumers know how to parse body.

  • body β€” JSON-encoded WTF document as a string. The body is always a string in vCon; pair it with encoding: "json" to indicate the string is itself JSON.

The WTF document shape

Inside body (decoded), the WTF document has four top-level sections:

transcript and segments are required. speakers is optional (use it when diarization was performed). metadata carries provider/model details and processing context.

Word-level timing is optional and lives inside each segment as a words[] array:

Don't forget to declare the extension at the top level:

Python helper

The vcon Python library has add_wtf_transcription_attachment(). Be aware of two quirks:

  1. The helper places the transcription as an attachment, not an analysis entry. Spec-compliant code puts WTF data in analysis[]. You can either rebuild the attachment manually (shown above) or call the helper and then move the entry.

  2. The helper emits type: "wtf_transcription" β€” you may want to rename this to purpose: (if you keep it as an attachment) or to type: "transcript" (if you move it into analysis[]).

The direct-construction pattern is usually simpler:

See also

Last updated

Was this helpful?