PlayKit.ai
Text-to-Speech

Subtitle Timestamps

Request word/sentence timings alongside the audio, and the alignment format

Subtitle Timestamps

In addition to the audio, TTS can return an alignment — the start/end time of each word or sentence. Use it for captions, karaoke-style highlighting, lip-sync, click-to-seek transcripts, and accessibility.

Timestamps are opt-in: you call the with-timestamps variant of synthesis (e.g. synthesizeWithTimestamps in the JavaScript SDK). Plain synthesis returns audio only. The model must support subtitles — the default TTS model does.

Granularity

Choose how finely the audio is segmented:

  • word (default) — one entry per word. Best for karaoke highlighting and precise seeking.
  • sentence — one entry per sentence. Good for caption blocks.

Response format

The with-timestamps call returns a JSON envelope:

{
  "audio_base64": "<base64-encoded audio>",
  "format": "mp3",
  "usage_characters": 64,
  "audio_length_ms": 4644,
  "alignment": {
    "granularity": "word",
    "items": [
      { "text": "Hello", "start_ms": 43,  "end_ms": 469,  "text_start": 0,  "text_end": 5 },
      { "text": "world", "start_ms": 469, "end_ms": 725,  "text_start": 6,  "text_end": 11 }
    ]
  }
}

alignment.items is an ordered list. Each item has:

FieldMeaning
textThe spoken text of this unit (word or sentence).
start_ms / end_msStart and end time in milliseconds, relative to the start of the audio.
text_start / text_endCharacter offsets of this unit in the input text (when reported), for mapping back to the source.

SDKs surface the same data in their idiomatic shape (e.g. the JavaScript SDK decodes audio_base64 into bytes and exposes alignment.items with startMs / endMs).

Use cases

  • Karaoke / word highlighting — highlight the active word as audio plays by comparing the playback clock to each item's start_ms/end_ms.
  • Captions / subtitles — render sentence-granularity items as timed caption blocks (export to SRT/VTT).
  • Click-to-seek transcripts — make each word clickable; seek the audio to its start_ms.
  • Lip-sync / animation — drive mouth shapes or beats from word timings.

For working code, see the JavaScript or Unity TTS guide.