Transcript — Architecture
How the transcript module works internally
Overview
The transcript module fetches YouTube captions without using the YouTube Data API v3. Instead, it interacts with YouTube's internal Innertube API — the same API used by the YouTube web player and mobile apps.
Three-phase fetch flow
Every transcribeVideo() call performs up to 3 HTTP requests:
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ 1. Watch │────▶│ 2. Innertube │────▶│ 3. Transcript │
│ Page │ │ Player API │ │ XML │
│ (GET) │ │ (POST) │ │ (GET) │
└─────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ Extract API key │ Get caption tracks │ Parse XML into
│ from HTML │ and video details │ TranscriptLine[]Phase 1: Watch page
A GET request to https://www.youtube.com/watch?v={videoId} fetches the video's HTML page. The module extracts the INNERTUBE_API_KEY from embedded JavaScript.
const watchUrl = `https://www.youtube.com/watch?v=${identifier}`
const watchRes = await fetch(watchUrl, { headers: { 'User-Agent': userAgent } })
const watchBody = await watchRes.text()
const apiKey = watchBody.match(/"INNERTUBE_API_KEY":"([^"]+)"/)[1]Phase 2: Innertube Player API
A POST request to https://www.youtube.com/youtubei/v1/player?key={apiKey} using the Android client context. This returns:
- Caption tracks — available languages with transcript URLs
- Video details — title, author, view count, description (when
includeMeta: true) - Playability status — whether the video is playable
const playerBody = JSON.stringify({
context: {
client: {
clientName: 'ANDROID',
clientVersion: '20.10.38',
},
},
videoId: identifier,
})The Android client context is used because it provides caption track URLs in the player response, which some other client contexts do not.
Phase 3: Transcript XML
A GET request to the caption track URL (extracted from phase 2) downloads the raw XML transcript. The fmt query parameter is stripped to get the raw XML format.
<transcript>
<text start="0" dur="3.36">Hello and welcome to this video.</text>
<text start="3.36" dur="2.64">Today we are going to talk about</text>
</transcript>Video ID resolution
The module reuses the existing extractVideoId() utility from lyra-sdk/url to support multiple input formats:
import { extractVideoId } from '../utils/url-patterns.js'
// All of these resolve to 'dQw4w9WgXcQ'
resolveVideoId('dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
resolveVideoId('https://youtu.be/dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/embed/dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/shorts/dQw4w9WgXcQ')The function first checks if the input is a raw 11-character ID (/^[a-zA-Z0-9_-]{11}$/), then falls back to URL pattern extraction.
XML parsing and entity decoding
Transcript XML uses standard XML entities that need decoding:
const XML_ENTITIES = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
''': "'",
''': "'",
}The parser uses a regex (/<text start="([^"]*)" dur="([^"]*)">([^<]*)<\/text>/g) to extract start, dur, and text content from each <text> element.
Module structure
packages/core/src/
modules/
transcript.ts # Public API: TranscriptClient, transcribeVideo, listCaptionTracks
transcript/
constants.ts # Default UA, regex patterns, Innertube config, cache/retry defaults
types.ts # All transcript types + CacheStore interface
errors.ts # 7 error classes extending TranscriptError
parse.ts # resolveVideoId, validateLang, parseTranscriptXml, decodeXmlEntities
fetch.ts # 3-phase HTTP flow, cache integration, retry wrapping
format.ts # toSRT, toVTT, toPlainText converters
retry.ts # fetchWithRetry with exponential backoff
cache/
index.ts # Barrel exports
memory-store.ts # InMemoryCache (Map-based, TTL, maxEntries)
file-store.ts # FsCache (JSON files, TTL, auto-mkdir)Error detection logic
The module distinguishes several failure scenarios based on the Innertube API response:
| Condition | Error thrown |
|---|---|
| Watch page returns non-200 | TranscriptVideoUnavailableError |
| Watch page contains reCAPTCHA | TranscriptRateLimitError |
No INNERTUBE_API_KEY found in page | TranscriptNotFoundError |
| Player response has no captions + playable | TranscriptDisabledError |
| Player response has no captions + unplayable | TranscriptNotFoundError |
| Captions exist but requested lang missing | TranscriptLanguageError |
| Transcript XML has zero parsed lines | TranscriptNotFoundError |
| Transcript fetch returns 429 | TranscriptRateLimitError |
Caching integration
When a cache option is provided, the module checks cache before making any HTTP requests and stores the result after a successful fetch:
transcribeVideo()
│
├── cache.get(key) → hit? → return parsed JSON
│
├── (miss) → Phase 1 → Phase 2 → Phase 3 → parse XML
│
└── cache.set(key, JSON.stringify(result)) → return resultCache keys include the video ID, language code, and whether metadata was requested. Cache failures are silently caught — they never break a request.
Retry integration
All three HTTP phases are independently wrapped with fetchWithRetry(). Each phase gets its own retry counter:
Phase 1: fetchWithRetry(watchPage, retries, delay, signal)
Phase 2: fetchWithRetry(playerAPI, retries, delay, signal)
Phase 3: fetchWithRetry(transcriptXML, retries, delay, signal)The retry logic uses delay * 2^attempt for exponential backoff and checks AbortSignal between attempts.
Rate limiting considerations
Since this module doesn't use the YouTube Data API, there's no formal quota system. However, YouTube may still rate-limit requests:
- YouTube may show a reCAPTCHA page if too many requests come from the same IP
- The module detects reCAPTCHA and throws
TranscriptRateLimitError - Use the
customFetchoption to route through rotating proxies if needed - Caching dramatically reduces the number of HTTP requests
- Retry with backoff helps handle transient 429 responses
Comparison with the core SDK
| Aspect | Core SDK | Transcript module |
|---|---|---|
| Authentication | API key (yt(key)) | None |
| Data source | googleapis.com/youtube/v3 | youtube.com (Innertube) |
| HTTP client | HttpClient class | Native fetch |
| Quota | Consumed | Not applicable |
| Rate limiting | Formal quota system | Informal (reCAPTCHA) |
| Import path | lyra-sdk | lyra-sdk/transcript |
| Sub-entry exports | /url, /fmt | /transcript |