Transcript — Architecture

How the transcript module works internally

Overview

The transcript module fetches YouTube captions without using the YouTube Data API v3. Instead, it interacts with YouTube's internal Innertube API — the same API used by the YouTube web player and mobile apps.


Three-phase fetch flow

Every transcribeVideo() call performs up to 3 HTTP requests:

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  1. Watch    │────▶│  2. Innertube    │────▶│  3. Transcript  │
│     Page     │     │     Player API   │     │     XML         │
│  (GET)       │     │  (POST)          │     │  (GET)          │
└─────────────┘     └──────────────────┘     └─────────────────┘
│                   │                        │
│ Extract API key   │ Get caption tracks     │ Parse XML into
│ from HTML         │ and video details      │ TranscriptLine[]

Phase 1: Watch page

A GET request to https://www.youtube.com/watch?v={videoId} fetches the video's HTML page. The module extracts the INNERTUBE_API_KEY from embedded JavaScript.

const watchUrl = `https://www.youtube.com/watch?v=${identifier}`
const watchRes = await fetch(watchUrl, { headers: { 'User-Agent': userAgent } })
const watchBody = await watchRes.text()
const apiKey = watchBody.match(/"INNERTUBE_API_KEY":"([^"]+)"/)[1]

Phase 2: Innertube Player API

A POST request to https://www.youtube.com/youtubei/v1/player?key={apiKey} using the Android client context. This returns:

  • Caption tracks — available languages with transcript URLs
  • Video details — title, author, view count, description (when includeMeta: true)
  • Playability status — whether the video is playable
const playerBody = JSON.stringify({
  context: {
    client: {
      clientName: 'ANDROID',
      clientVersion: '20.10.38',
    },
  },
  videoId: identifier,
})

The Android client context is used because it provides caption track URLs in the player response, which some other client contexts do not.

Phase 3: Transcript XML

A GET request to the caption track URL (extracted from phase 2) downloads the raw XML transcript. The fmt query parameter is stripped to get the raw XML format.

<transcript>
  <text start="0" dur="3.36">Hello and welcome to this video.</text>
  <text start="3.36" dur="2.64">Today we are going to talk about</text>
</transcript>

Video ID resolution

The module reuses the existing extractVideoId() utility from lyra-sdk/url to support multiple input formats:

import { extractVideoId } from '../utils/url-patterns.js'

// All of these resolve to 'dQw4w9WgXcQ'
resolveVideoId('dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
resolveVideoId('https://youtu.be/dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/embed/dQw4w9WgXcQ')
resolveVideoId('https://www.youtube.com/shorts/dQw4w9WgXcQ')

The function first checks if the input is a raw 11-character ID (/^[a-zA-Z0-9_-]{11}$/), then falls back to URL pattern extraction.


XML parsing and entity decoding

Transcript XML uses standard XML entities that need decoding:

const XML_ENTITIES = {
  '&amp;': '&',
  '&lt;': '<',
  '&gt;': '>',
  '&quot;': '"',
  '&#39;': "'",
  '&apos;': "'",
}

The parser uses a regex (/<text start="([^"]*)" dur="([^"]*)">([^<]*)<\/text>/g) to extract start, dur, and text content from each <text> element.


Module structure

packages/core/src/
  modules/
    transcript.ts              # Public API: TranscriptClient, transcribeVideo, listCaptionTracks
  transcript/
    constants.ts               # Default UA, regex patterns, Innertube config, cache/retry defaults
    types.ts                   # All transcript types + CacheStore interface
    errors.ts                  # 7 error classes extending TranscriptError
    parse.ts                   # resolveVideoId, validateLang, parseTranscriptXml, decodeXmlEntities
    fetch.ts                   # 3-phase HTTP flow, cache integration, retry wrapping
    format.ts                  # toSRT, toVTT, toPlainText converters
    retry.ts                   # fetchWithRetry with exponential backoff
    cache/
      index.ts                 # Barrel exports
      memory-store.ts          # InMemoryCache (Map-based, TTL, maxEntries)
      file-store.ts            # FsCache (JSON files, TTL, auto-mkdir)

Error detection logic

The module distinguishes several failure scenarios based on the Innertube API response:

ConditionError thrown
Watch page returns non-200TranscriptVideoUnavailableError
Watch page contains reCAPTCHATranscriptRateLimitError
No INNERTUBE_API_KEY found in pageTranscriptNotFoundError
Player response has no captions + playableTranscriptDisabledError
Player response has no captions + unplayableTranscriptNotFoundError
Captions exist but requested lang missingTranscriptLanguageError
Transcript XML has zero parsed linesTranscriptNotFoundError
Transcript fetch returns 429TranscriptRateLimitError

Caching integration

When a cache option is provided, the module checks cache before making any HTTP requests and stores the result after a successful fetch:

transcribeVideo()

  ├── cache.get(key) → hit? → return parsed JSON

  ├── (miss) → Phase 1 → Phase 2 → Phase 3 → parse XML

  └── cache.set(key, JSON.stringify(result)) → return result

Cache keys include the video ID, language code, and whether metadata was requested. Cache failures are silently caught — they never break a request.


Retry integration

All three HTTP phases are independently wrapped with fetchWithRetry(). Each phase gets its own retry counter:

Phase 1: fetchWithRetry(watchPage, retries, delay, signal)
Phase 2: fetchWithRetry(playerAPI, retries, delay, signal)
Phase 3: fetchWithRetry(transcriptXML, retries, delay, signal)

The retry logic uses delay * 2^attempt for exponential backoff and checks AbortSignal between attempts.


Rate limiting considerations

Since this module doesn't use the YouTube Data API, there's no formal quota system. However, YouTube may still rate-limit requests:

  • YouTube may show a reCAPTCHA page if too many requests come from the same IP
  • The module detects reCAPTCHA and throws TranscriptRateLimitError
  • Use the customFetch option to route through rotating proxies if needed
  • Caching dramatically reduces the number of HTTP requests
  • Retry with backoff helps handle transient 429 responses

Comparison with the core SDK

AspectCore SDKTranscript module
AuthenticationAPI key (yt(key))None
Data sourcegoogleapis.com/youtube/v3youtube.com (Innertube)
HTTP clientHttpClient classNative fetch
QuotaConsumedNot applicable
Rate limitingFormal quota systemInformal (reCAPTCHA)
Import pathlyra-sdklyra-sdk/transcript
Sub-entry exports/url, /fmt/transcript

On this page