Subtitle Timestamps
Request word/sentence timings alongside the audio, and the alignment format
Subtitle Timestamps
In addition to the audio, TTS can return an alignment — the start/end time of each word or sentence. Use it for captions, karaoke-style highlighting, lip-sync, click-to-seek transcripts, and accessibility.
Timestamps are opt-in: you call the with-timestamps variant of synthesis (e.g. synthesizeWithTimestamps in the JavaScript SDK). Plain synthesis returns audio only. The model must support subtitles — the default TTS model does.
Granularity
Choose how finely the audio is segmented:
word(default) — one entry per word. Best for karaoke highlighting and precise seeking.sentence— one entry per sentence. Good for caption blocks.
Response format
The with-timestamps call returns a JSON envelope:
{
"audio_base64": "<base64-encoded audio>",
"format": "mp3",
"usage_characters": 64,
"audio_length_ms": 4644,
"alignment": {
"granularity": "word",
"items": [
{ "text": "Hello", "start_ms": 43, "end_ms": 469, "text_start": 0, "text_end": 5 },
{ "text": "world", "start_ms": 469, "end_ms": 725, "text_start": 6, "text_end": 11 }
]
}
}alignment.items is an ordered list. Each item has:
| Field | Meaning |
|---|---|
text | The spoken text of this unit (word or sentence). |
start_ms / end_ms | Start and end time in milliseconds, relative to the start of the audio. |
text_start / text_end | Character offsets of this unit in the input text (when reported), for mapping back to the source. |
SDKs surface the same data in their idiomatic shape (e.g. the JavaScript SDK decodes audio_base64 into bytes and exposes alignment.items with startMs / endMs).
Use cases
- Karaoke / word highlighting — highlight the active word as audio plays by comparing the playback clock to each item's
start_ms/end_ms. - Captions / subtitles — render
sentence-granularity items as timed caption blocks (export to SRT/VTT). - Click-to-seek transcripts — make each word clickable; seek the audio to its
start_ms. - Lip-sync / animation — drive mouth shapes or beats from word timings.
For working code, see the JavaScript or Unity TTS guide.